TECHNOLOGYglobal

GPU Incident Response in 60 Seconds: An SRE's Guide to eBPF-Based GPU Observability

Single source

Updated 4 hours ago

First seen March 20, 2026 11:02:53

Stay on top of this story

Follow the names and topics behind it.

Add this story's key topics to your watchlist so LyscoNews can highlight related developments and future matches.

Create free account

Create a free account to sync your watchlist, saved stories, and alerts across devices.

Quick Summary

TL;DR

You get paged at 3am: GPU training pipeline missed its SLA. Datadog shows 95% GPU utilization. nvidia-smi agrees. Everything looks green — but the job is 3x slower than expected. You have zero tools to diagnose this. Ingero gives you causal chains in 60 seconds: the host CPU was fighting with DataLoader workers, starving the GPU. You fix it with taskset and go back to sleep — without waking the ML engineer. Your PagerDuty fires: [CRITICAL] GPU Training Pipeline SLA Breached Cluster: prod-gpu-01 (8x H100) Job: nightly-retraining-v3 Expected completion: 02:00 UTC Current status: 47% complete at 03:12 UTC

You open your monitoring stack: Datadog GPU Dashboard: GPU Utilization: 95% ✅ GPU Memory: 78% ✅ GPU Temperature: 72°C ✅ Power Draw: 680W ✅

Grafana (DCGM Exporter): dcgm_gpu_utilization: 0.95 ✅ dcgm_fb_used: 62GB ✅ dcgm_sm_clock: 1980MHz ✅

Every single dashboard says the GPU is fine. You have a breached SLA and zero signal to work with. This is where most GPU incidents stall. The SRE has no tools that see below the GPU utilization counter. The options are: Wake the ML engineer (who'll spend 2 hours adding print statements) Restart the job and hope it goes faster (it won't) Stare at dashboards that all say green Every GPU monitoring tool in your stack — Datadog, Grafana, DCGM, nvidia-smi — reports the same underlying metric: "did the GPU have at least one kernel scheduled?" That metric is useless for diagnosis. It's like monitoring a restaurant by checking "is someone sitting at each table?" without knowing if anyone is eating. The kitchen (GPU compute cores) could be idle 80% of the time between courses, and your dashboard would still say "97% utilized." The real problems that cause GPU SLA breaches are host-side: CPU scheduling contention starving the data pipeline DataLoader workers preempted by monitoring agents (ironic) Memory pressure causing page faults in the data loading path Disk I/O bottlenecks blocking the next training batch Network retransmits stalling distributed training These are all Linux kernel events. DCGM and nvidia-smi have zero visibility into them. Your GPU dashboards are structurally blind to the most common causes of GPU performance degradation. Ingero is an eBPF-based observability agent that traces both sides — CUDA APIs (what the GPU is doing) and host kernel events (what the CPU, scheduler, memory, and I/O subsystems are doing). It builds causal chains connecting host events to GPU latency. It deploys as a K8s DaemonSet and runs continuously with <2% overhead. No code changes, no NVIDIA SDK, no CUPTI. Here's what incident response looks like: $ ingero explain --since 1h

System Context: CPU: 94.2% | Memory: 78.1% | Load: 12.3 (8 cores) | Swap: 0 MB

Causal Chains (last 1 hour): ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ [HIGH] CPU scheduling contention → CUDA throughput drop Root: 14,504 context switches on training process (PID 3821) Process off-CPU 62 of 120 seconds (51.7% of wall clock) Effect: cudaStreamSync p99 inflated 1,028x (7µs → 7.2ms) CUDA op throughput dropped 47% from peak Contributing: 4 DataLoader workers + prometheus-node-exporter + fluent-bit competing for 8 cores Fix: pin training to dedicated cores: taskset -c 0-5 python3 train.py set DataLoader persistent_workers=True nice -n 19 monitoring agents

There it is. The training process was off-CPU 51.7% of the time. The GPU was waiting for data, not computing. Your monitoring agents (Prometheus node exporter, Fluent Bit) were stealing CPU from the training pipeline. nvidia-smi said 97% because kernels were queued — but the pipeline was running at half speed. $ ingero explain --per-process --since 1h

pt_data_worker:0 PID 3822: ⚠ Off-CPU: 31.4s / 120s (26.2%) ⚠ Worst stall: 609ms

prometheus-node-exporter PID 1205: ⚠ Context switches: 3,201 ⚠ CPU stolen: 8.7s

The training process and all 4 DataLoader workers are fighting for CPU with your monitoring stack. The worst single scheduling stall is 609ms — over half a second where a data worker was frozen while the GPU sat idle.

Pin training to dedicated cores (leave 2 for monitoring + OS)

$ kubectl exec -it gpu-training-pod -- taskset -c 0-5 python3 train.py

Or: deprioritize monitoring agents

$ kubectl exec -it monitoring-pod -- nice -n 19 prometheus-node-exporter

Or better yet, add to your DaemonSet config:

training pod

resources: limits: cpu: "6" requests: cpu: "6"

After the fix: Context switches on training: 14,504 → 890 cudaStreamSync p99: 7.2ms → 45µs Pipeline throughput: restored to expected rate SLA: back on track. ML engineer: still sleeping. Ingero includes an MCP (Model Context Protocol) server that lets AI assistants investigate GPU incidents. If your team uses Claude, Cursor, or any MCP-compatible tool, the AI can query Ingero directly. This turns a 2-hour debugging session into a 30-second conversation. Ingero deploys like any other observability agent in your K8s stack:

Helm install (DaemonSet + RBAC)

helm install ingero ./deploy/helm/ingero
--set prometheus.enabled=true
--set otlp.enabled=true

Or standalone

sudo ./bin/ingero trace --stack --prometheus :9090

What you get: DaemonSet: Runs on every GPU node automatically Prometheus /metrics: GPU latency percentiles, causal chain counts — plug into your existing Grafana OTLP export: Send traces to your existing backend (Jaeger, Tempo, Honeycomb) MCP server: AI-assisted investigation via Claude, Cursor, etc. SQLite local storage: 10GB rolling, auto-prunes old events. No external database needed Pod metadata: Enriches events with K8s pod name, namespace, container ID It slots into your existing monitoring stack. No rip-and-replace.

Signal DCGM / nvidia-smi Ingero

GPU utilization % Yes (misleading) Yes (with causal context)

Per-CUDA-call latency No Yes (p50/p95/p99 for every API call)

CPU scheduling delays No Yes (sched_switch tracepoints)

DataLoader worker stalls No Yes (per-process off-CPU time)

Memory pressure → GPU impact No Yes (mm_page_alloc + CUDA correlation)

Disk I/O → GPU stalls No Yes (block_rq + CUDA correlation)

Network → distributed training No Yes (tcp_retransmit + CUDA correlation)

Root cause chain No Yes (automated causal chains with fix recommendations)

Python source line attribution No Yes (CPython frame extraction with --stack)

For SREs managing GPU infrastructure, Ingero answers three questions: Incident response: "Why is the GPU slow right now?" → Causal chain in 60 seconds Capacity planning: "Are we actually using these GPUs efficiently?" → Real compute efficiency, not nvidia-smi lies Cost attribution: "Which team's workload is causing contention?" → Per-process, per-namespace breakdown You don't need to understand CUDA or ML model architectures. Ingero translates kernel-level GPU events into actionable SRE language: root cause, impact, fix. No GPU required to see the pattern: git clone https://github.com/ingero-io/ingero.git cd ingero && make build ./bin/ingero demo incident # See a causal chain form in real-time ./bin/ingero demo cpu-contention # CPU scheduling causing GPU stalls

For real GPU tracing: sudo ./bin/ingero check # Verify system compatibility sudo ./bin/ingero trace --stack # Start tracing (runs continuously) ./bin/ingero explain --since 5min # See causal chains

Ingero is open-source (Apache 2.0), deploys as a K8s DaemonSet, and traces CUDA APIs via standard Linux kernel uprobes. No NVIDIA SDK, no code changes, <2% overhead. Production-safe by design. GitHub: github.com/ingero-io/ingero

Perspective lens

Deeper Analysis

All coverage links (1)

Dev.toRegional / Local

GPU Incident Response in 60 Seconds: An SRE's Guide to eBPF-Based GPU Observability

4 hours ago