OpenTelemetry eBPF Instrumentation: Unbounded BPF internal metrics replay can exhaust CPU
Summary
OBI replays BPF probe hits into histogram observations by looping once per recorded run count. On busy systems, the run-count delta can become very large, causing the metrics exporter to spend excessive CPU time in a tight loop every collection interval.
Details
The vulnerable loop is in pkg/export/prom/prom_bpf.go. During each metrics tick, OBI iterates through probeMetrics and then executes for range metric.count, invoking BpfProbeLatency(...) for each individual recorded hit.
The count comes from calculateStats() in the same file, where deltaCount := bp.runCount - bp.prevRunCount is calculated and returned without any cap before the per-hit replay loop.
If probe activity spikes between scrape intervals, deltaCount can be very large. The exporter then spends CPU time proportional to the number of probe hits rather than the number of metric series.
PoC
Local testing with a small reproducer confirmed the replay-loop behavior and showed CPU scaling with the recorded hit count rather than the number of metric series.
Use a vulnerable build and enable internal metrics export:
git checkout v0.0.0-rc.1+build
make build
export OTEL_EBPF_INTERNAL_METRICS_PROMETHEUS_PORT=9090
sudo ./bin/obiCreate a high-rate workload that repeatedly exercises traced probes. For example, generate HTTP traffic against an instrumented service:
python3 -m http.server 18081Then drive it:
seq 1 500000 | xargs -P 128 -I{} curl -s http://127.0.0.1:18081 >/dev/nullAt the same time, scrape metrics repeatedly:
while true; do curl -s http://127.0.0.1:9090/metrics >/dev/null; doneOn a vulnerable build, OBI CPU consumption rises sharply during the metrics loop because histogram updates are replayed once per counted probe execution. The effect is visible in top or pidstat and is most pronounced under sustained high request volume.
Impact
This is an availability issue in the internal metrics path. Any deployment that enables BPF internal metrics and traces busy workloads is affected. Attackers can indirectly consume CPU in the privileged agent by driving enough activity through instrumented services.