Log Amplification: When Observability Becomes the Workload

June 10, 2026

Modern production systems often treat logs as nearly free. They are not. On small virtual hosts, container platforms, Kubernetes nodes, and machine learning gateways, logging can become a meaningful workload: CPU, disk I/O, storage, network egress, and even latency. The failure mode is subtle because the application may be doing very little useful work while the host is busy recording the fact that it is failing.

I call this log amplification.

A small event like one failed HTTP request can produce logs at multiple layers:

application log
  -> container stdout/stderr
  -> container runtime log driver
  -> journald or node log file
  -> reverse proxy access/error log
  -> service manager logs
  -> collector/forwarder logs
  -> centralized logging storage
  -> alerts, traces, dashboards, and terminal tails

In a healthy system this may be tolerable. In a failure loop, it can become the largest workload on the machine.

A concrete example

Suppose a CI runner polls an API endpoint while the backend Rails/Puma process is down. One request can produce a runner log, reverse proxy access log, workhorse/backend error, container log entry, journald entry, and possibly a forwarded log record. If it happens every second, the system is no longer simply “down”; it is also generating a steady stream of diagnostic traffic.

That diagnostic stream can be larger than the original application traffic.

Why this matters on virtual hosts

Small VMs often have enough CPU and RAM for the application, but weak virtual disk performance. Journaling, container log writes, fsyncs, compression, indexing, rotation, and log shipping can dominate the host during failure conditions.

Symptoms include:

high CPU from systemd-journald, conmon, rsyslog, fluent-bit, vector, otelcol, or filebeat
high disk writes despite low useful traffic
large /var/log/journal, /var/log/containers, or /var/log/pods
slow SSH sessions or slow container startup
reverse proxy logs filled with repeated 502, 404, health-check, or auth redirect lines
logging agents using more CPU than the service they monitor

Kubernetes makes this more visible because container stdout/stderr is collected through the node logging pipeline. Kubernetes documents that container runtimes redirect application stdout/stderr and that kubelet/container runtimes handle those logs using the CRI logging format. That is useful, but it also means every application log line enters a broader host-level log system.

Why this matters even more for ML systems

LLM and machine learning systems add another dimension. “Logging” may include:

prompts
completions
tool calls
retrieved documents
token counts
model names
latency and cost metadata
safety annotations
traces and spans
request and response bodies

Logging full prompts and completions by default is risky for performance, cost, and privacy. OpenTelemetry’s GenAI conventions include token-related metrics, which points toward a better pattern: measure token usage without necessarily storing full text payloads.

A safer default is:

Log metadata by default:
  request id
  tenant or service id
  model
  route
  latency
  input token count
  output token count
  status
  cache hit/miss
  cost estimate

Do not log full prompts/completions by default.

Full payload logging should be temporary, sampled, redacted, access-controlled, and short-retention.

The diagnostic model

Treat logs as a workload. Measure:

log bytes/sec
log lines/sec
log storage growth per day
logging-agent CPU
logging-agent disk I/O
log egress volume

Then estimate:

log amplification ratio = stored or shipped log bytes / application-emitted log bytes

A ratio of 2x may be normal. A ratio of 10x is worth understanding. A ratio of 100x during a failure loop is an operational bug.

Quick Linux checks

sudo journalctl --disk-usage
sudo du -sh /var/log /var/log/journal 2>/dev/null
sudo iotop -aoP
pidstat -d 1
pidstat -u 1

For recent journal volume:

journalctl --since "10 minutes ago" --no-pager | wc -c
journalctl --since "10 minutes ago" --no-pager | wc -l

For rootless Podman:

sudo -iu containers podman ps
sudo -iu containers podman logs --since 10m <container> | wc -c

For Kubernetes nodes:

sudo du -sh /var/log/containers /var/log/pods 2>/dev/null
sudo find /var/log/containers -type f -printf '%s %p\n' | sort -nr | head

Baseline controls

For journald, configure explicit retention and rate limits:

# /etc/systemd/journald.conf.d/limits.conf
[Journal]
SystemMaxUse=1G
RuntimeMaxUse=256M
MaxRetentionSec=7day
RateLimitIntervalSec=30s
RateLimitBurst=1000

Then restart journald:

sudo systemctl restart systemd-journald

For smaller hosts, lower limits may be appropriate:

SystemMaxUse=512M
RuntimeMaxUse=128M
MaxRetentionSec=3day

Other controls:

turn off production debug logs by default
suppress health-check access logs
avoid logging request and response bodies by default
sample noisy logs
set retention by service class
alert on log-rate anomalies
avoid down && up boot loops where up -d is enough
do not leave journalctl -f, podman logs -f, or kubectl logs -f running unnecessarily
separate metrics from logs
use trace IDs instead of repeating full context everywhere

A small audit script

The accompanying logamp-audit.sh script checks common signals:

Download

journal disk usage
recent journal bytes/sec and lines/sec
noisy units/processes
/var/log, /var/log/journal, Kubernetes log directories, and Docker log directories
Podman and Docker container recent log volume
logging agent CPU/process presence
journald size/rate-limit configuration
optional source/config scan for risky logging patterns such as prompts, completions, request bodies, debug logs, and secrets

Example:

./logamp-audit.sh --since "10 minutes ago"
sudo ./logamp-audit.sh --scan-path /srv:/opt

The script is not a proof that a system is safe or unsafe. It is an inexpensive way to find where logs may be turning from observability into workload.

Production rule of thumb

Ask this of every service:

If this service fails in a tight loop for ten minutes, will logging become the biggest workload?

If the answer is “yes” or “unknown,” the system needs log budgets, rate limits, sampling, and redaction before it is truly production-hardened.

References

systemd journald configuration supports storage and rate limit controls such as SystemMaxUse, RuntimeMaxUse, RateLimitIntervalSec, and RateLimitBurst.
Kubernetes node logging redirects container stdout/stderr through the container runtime and standardizes runtime integration through the CRI log format.
OpenTelemetry GenAI semantic conventions include token-related metrics, which are useful for observing model usage without always logging full text payloads.