Michal Krupa

Log Amplification: When Observability Becomes the Workload

June 10, 2026

Modern production systems often treat logs as nearly free. They are not. On small virtual hosts, container platforms, Kubernetes nodes, and machine learning gateways, logging can become a meaningful workload: CPU, disk I/O, storage, network egress, and even latency. The failure mode is subtle because the application may be doing very little useful work while the host is busy recording the fact that it is failing.

I call this log amplification.

A small event like one failed HTTP request can produce logs at multiple layers:

application log
  -> container stdout/stderr
  -> container runtime log driver
  -> journald or node log file
  -> reverse proxy access/error log
  -> service manager logs
  -> collector/forwarder logs
  -> centralized logging storage
  -> alerts, traces, dashboards, and terminal tails

In a healthy system this may be tolerable. In a failure loop, it can become the largest workload on the machine.

A concrete example

Suppose a CI runner polls an API endpoint while the backend Rails/Puma process is down. One request can produce a runner log, reverse proxy access log, workhorse/backend error, container log entry, journald entry, and possibly a forwarded log record. If it happens every second, the system is no longer simply “down”; it is also generating a steady stream of diagnostic traffic.

That diagnostic stream can be larger than the original application traffic.

Why this matters on virtual hosts

Small VMs often have enough CPU and RAM for the application, but weak virtual disk performance. Journaling, container log writes, fsyncs, compression, indexing, rotation, and log shipping can dominate the host during failure conditions.

Symptoms include:

Kubernetes makes this more visible because container stdout/stderr is collected through the node logging pipeline. Kubernetes documents that container runtimes redirect application stdout/stderr and that kubelet/container runtimes handle those logs using the CRI logging format. That is useful, but it also means every application log line enters a broader host-level log system.

Why this matters even more for ML systems

LLM and machine learning systems add another dimension. “Logging” may include:

Logging full prompts and completions by default is risky for performance, cost, and privacy. OpenTelemetry’s GenAI conventions include token-related metrics, which points toward a better pattern: measure token usage without necessarily storing full text payloads.

A safer default is:

Log metadata by default:
  request id
  tenant or service id
  model
  route
  latency
  input token count
  output token count
  status
  cache hit/miss
  cost estimate

Do not log full prompts/completions by default.

Full payload logging should be temporary, sampled, redacted, access-controlled, and short-retention.

The diagnostic model

Treat logs as a workload. Measure:

log bytes/sec
log lines/sec
log storage growth per day
logging-agent CPU
logging-agent disk I/O
log egress volume

Then estimate:

log amplification ratio = stored or shipped log bytes / application-emitted log bytes

A ratio of 2x may be normal. A ratio of 10x is worth understanding. A ratio of 100x during a failure loop is an operational bug.

Quick Linux checks

sudo journalctl --disk-usage
sudo du -sh /var/log /var/log/journal 2>/dev/null
sudo iotop -aoP
pidstat -d 1
pidstat -u 1

For recent journal volume:

journalctl --since "10 minutes ago" --no-pager | wc -c
journalctl --since "10 minutes ago" --no-pager | wc -l

For rootless Podman:

sudo -iu containers podman ps
sudo -iu containers podman logs --since 10m <container> | wc -c

For Kubernetes nodes:

sudo du -sh /var/log/containers /var/log/pods 2>/dev/null
sudo find /var/log/containers -type f -printf '%s %p\n' | sort -nr | head

Baseline controls

For journald, configure explicit retention and rate limits:

# /etc/systemd/journald.conf.d/limits.conf
[Journal]
SystemMaxUse=1G
RuntimeMaxUse=256M
MaxRetentionSec=7day
RateLimitIntervalSec=30s
RateLimitBurst=1000

Then restart journald:

sudo systemctl restart systemd-journald

For smaller hosts, lower limits may be appropriate:

SystemMaxUse=512M
RuntimeMaxUse=128M
MaxRetentionSec=3day

Other controls:

A small audit script

The accompanying logamp-audit.sh script checks common signals:

Download

Example:

./logamp-audit.sh --since "10 minutes ago"
sudo ./logamp-audit.sh --scan-path /srv:/opt

The script is not a proof that a system is safe or unsafe. It is an inexpensive way to find where logs may be turning from observability into workload.

Production rule of thumb

Ask this of every service:

If this service fails in a tight loop for ten minutes, will logging become the biggest workload?

If the answer is “yes” or “unknown,” the system needs log budgets, rate limits, sampling, and redaction before it is truly production-hardened.

References