Log Amplification: When Observability Becomes the Workload
June 10, 2026
Modern production systems often treat logs as nearly free. They are not. On small virtual hosts, container platforms, Kubernetes nodes, and machine learning gateways, logging can become a meaningful workload: CPU, disk I/O, storage, network egress, and even latency. The failure mode is subtle because the application may be doing very little useful work while the host is busy recording the fact that it is failing.
I call this log amplification.
A small event like one failed HTTP request can produce logs at multiple layers:
application log
-> container stdout/stderr
-> container runtime log driver
-> journald or node log file
-> reverse proxy access/error log
-> service manager logs
-> collector/forwarder logs
-> centralized logging storage
-> alerts, traces, dashboards, and terminal tails
In a healthy system this may be tolerable. In a failure loop, it can become the largest workload on the machine.
A concrete example
Suppose a CI runner polls an API endpoint while the backend Rails/Puma process is down. One request can produce a runner log, reverse proxy access log, workhorse/backend error, container log entry, journald entry, and possibly a forwarded log record. If it happens every second, the system is no longer simply “down”; it is also generating a steady stream of diagnostic traffic.
That diagnostic stream can be larger than the original application traffic.
Why this matters on virtual hosts
Small VMs often have enough CPU and RAM for the application, but weak virtual disk performance. Journaling, container log writes, fsyncs, compression, indexing, rotation, and log shipping can dominate the host during failure conditions.
Symptoms include:
- high CPU from
systemd-journald,conmon,rsyslog,fluent-bit,vector,otelcol, orfilebeat - high disk writes despite low useful traffic
- large
/var/log/journal,/var/log/containers, or/var/log/pods - slow SSH sessions or slow container startup
- reverse proxy logs filled with repeated 502, 404, health-check, or auth redirect lines
- logging agents using more CPU than the service they monitor
Kubernetes makes this more visible because container stdout/stderr is collected through the node logging pipeline. Kubernetes documents that container runtimes redirect application stdout/stderr and that kubelet/container runtimes handle those logs using the CRI logging format. That is useful, but it also means every application log line enters a broader host-level log system.
Why this matters even more for ML systems
LLM and machine learning systems add another dimension. “Logging” may include:
- prompts
- completions
- tool calls
- retrieved documents
- token counts
- model names
- latency and cost metadata
- safety annotations
- traces and spans
- request and response bodies
Logging full prompts and completions by default is risky for performance, cost, and privacy. OpenTelemetry’s GenAI conventions include token-related metrics, which points toward a better pattern: measure token usage without necessarily storing full text payloads.
A safer default is:
Log metadata by default:
request id
tenant or service id
model
route
latency
input token count
output token count
status
cache hit/miss
cost estimate
Do not log full prompts/completions by default.
Full payload logging should be temporary, sampled, redacted, access-controlled, and short-retention.
The diagnostic model
Treat logs as a workload. Measure:
log bytes/sec
log lines/sec
log storage growth per day
logging-agent CPU
logging-agent disk I/O
log egress volume
Then estimate:
log amplification ratio = stored or shipped log bytes / application-emitted log bytes
A ratio of 2x may be normal. A ratio of 10x is worth understanding. A ratio of 100x during a failure loop is an operational bug.
Quick Linux checks
sudo journalctl --disk-usage
sudo du -sh /var/log /var/log/journal 2>/dev/null
sudo iotop -aoP
pidstat -d 1
pidstat -u 1
For recent journal volume:
journalctl --since "10 minutes ago" --no-pager | wc -c
journalctl --since "10 minutes ago" --no-pager | wc -l
For rootless Podman:
sudo -iu containers podman ps
sudo -iu containers podman logs --since 10m <container> | wc -c
For Kubernetes nodes:
sudo du -sh /var/log/containers /var/log/pods 2>/dev/null
sudo find /var/log/containers -type f -printf '%s %p\n' | sort -nr | head
Baseline controls
For journald, configure explicit retention and rate limits:
# /etc/systemd/journald.conf.d/limits.conf
[Journal]
SystemMaxUse=1G
RuntimeMaxUse=256M
MaxRetentionSec=7day
RateLimitIntervalSec=30s
RateLimitBurst=1000
Then restart journald:
sudo systemctl restart systemd-journald
For smaller hosts, lower limits may be appropriate:
SystemMaxUse=512M
RuntimeMaxUse=128M
MaxRetentionSec=3day
Other controls:
- turn off production debug logs by default
- suppress health-check access logs
- avoid logging request and response bodies by default
- sample noisy logs
- set retention by service class
- alert on log-rate anomalies
- avoid
down && upboot loops whereup -dis enough - do not leave
journalctl -f,podman logs -f, orkubectl logs -frunning unnecessarily - separate metrics from logs
- use trace IDs instead of repeating full context everywhere
A small audit script
The accompanying logamp-audit.sh script checks common signals:
- journal disk usage
- recent journal bytes/sec and lines/sec
- noisy units/processes
/var/log,/var/log/journal, Kubernetes log directories, and Docker log directories- Podman and Docker container recent log volume
- logging agent CPU/process presence
- journald size/rate-limit configuration
- optional source/config scan for risky logging patterns such as prompts, completions, request bodies, debug logs, and secrets
Example:
./logamp-audit.sh --since "10 minutes ago"
sudo ./logamp-audit.sh --scan-path /srv:/opt
The script is not a proof that a system is safe or unsafe. It is an inexpensive way to find where logs may be turning from observability into workload.
Production rule of thumb
Ask this of every service:
If this service fails in a tight loop for ten minutes, will logging become the biggest workload?
If the answer is “yes” or “unknown,” the system needs log budgets, rate limits, sampling, and redaction before it is truly production-hardened.
References
- systemd journald configuration supports storage and rate limit controls such as
SystemMaxUse,RuntimeMaxUse,RateLimitIntervalSec, andRateLimitBurst. - Kubernetes node logging redirects container stdout/stderr through the container runtime and standardizes runtime integration through the CRI log format.
- OpenTelemetry GenAI semantic conventions include token-related metrics, which are useful for observing model usage without always logging full text payloads.