Your GPU Is 97% Utilized But Your Training Is 3x Slower Than Expected
Stay on top of this story
Follow the names and topics behind it.
Add this story's key topics to your watchlist so LyscoNews can highlight related developments and future matches.
Create a free account to sync your watchlist, saved stories, and alerts across devices.
Quick Summary
TL;DR
Your GPU shows 97% utilization in nvidia-smi, but training throughput is a fraction of what benchmarks promise. The GPU isn't computing — it's waiting. Data loading workers are starving the training loop because CPU contention, I/O bottlenecks, or scheduling delays prevent data from arriving fast enough. Ingero traces the full host-to-GPU pipeline to show you exactly where the bubble is. You've got an H100 costing $3.50/hour. PyTorch Lightning says you're processing 200 samples/sec, but the model card says the same architecture should hit 600 samples/sec on this hardware. You open nvidia-smi: +-------------------------------------------+ | GPU Name | GPU-Util | Memory-Usage | |==================+==========+==============| | 0 H100 SXM | 97% | 62000MiB/80GB | +-------------------------------------------+
97% utilization. The GPU must be working hard, right? Wrong. That number means "the GPU had at least one kernel running 97% of the time." It doesn't distinguish between: A massive matrix multiply that saturates all SMs A tiny 1ms kernel followed by 32ms of idle waiting for the next batch cudaMemcpy transferring data while compute cores sit idle Your GPU is "utilized" the way a restaurant is "full" when one person sits at every table but nobody is eating. The kitchen (your compute cores) is idle. This isn't hypothetical. A 100-GPU H100 cluster at 60% effective utilization (despite nvidia-smi reporting 95%+) wastes $1.4 million per year in capital alone. Add electricity, cooling, and engineering time debugging performance, and the number climbs past $2.5M. 75% of organizations report GPU utilization below 70% at peak. The gap between what nvidia-smi reports and actual compute efficiency is where millions of dollars disappear. nvidia-smi samples GPU state once per second. It reports a binary: "was a kernel running?" It has zero visibility into: Pipeline bubbles: GPU idle between kernel launches while waiting for data CPU scheduling delays: Your DataLoader workers got preempted by other processes Host memory pressure: Page faults in the data loading path stalling cudaMemcpy
I/O bottlenecks: Disk or NFS reads blocking the next batch Scheduling storms: 10,000+ context switches per second on the training process These are all host-side problems causing GPU-side underutilization. nvidia-smi only sees the GPU side. Ingero traces both sides — CUDA APIs (what the GPU is doing) and host kernel events (what the CPU is doing) — then builds causal chains connecting them. $ ingero explain --since 120s
System Context: CPU: 94.2% | Memory: 78.1% | Load: 12.3 (8 cores) | Swap: 0 MB
Causal Chains (last 2 min): ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ [HIGH] CPU scheduling contention → CUDA throughput drop Root: 14,504 context switches on training process (PID 3821) Process off-CPU 62 of 120 seconds (51.7% of wall clock) Effect: cudaStreamSync p99 inflated 1,028x (7µs → 7.2ms) CUDA op throughput dropped 47% from peak (1,200 → 640 ops/sec) Contributing: 4 DataLoader workers + 3 background processes competing for 8 cores Fix: pin training to dedicated cores: taskset -c 0-3 python3 train.py set DataLoader persistent_workers=True nice -n 19 background jobs
There it is. The training process was off-CPU for 51.7% of the time. The GPU was waiting — not computing. nvidia-smi saw kernels queued and reported "97% utilized," but actual compute throughput was half of what it should be. Using Ingero's MCP server: Engineer: "Which processes caused the most scheduling contention in the last 2 minutes?" SELECT pn.name as process, COUNT(*) as context_switches, SUM(duration_ns)/1e9 as total_off_cpu_sec, MAX(duration_ns)/1e6 as worst_stall_ms FROM events e JOIN process_names pn ON e.pid = pn.pid WHERE op = 'sched_switch' AND timestamp > (SELECT MAX(timestamp) - 120000000000 FROM events) GROUP BY pn.name ORDER BY total_off_cpu_sec DESC LIMIT 10;
process | switches | off_cpu_sec | worst_stall_ms ---------------------|----------|-------------|---------------- python3 (train.py) | 14,504 | 62.0 | 790.3 pt_data_worker:0 | 8,217 | 31.4 | 609.1 pt_data_worker:1 | 7,932 | 29.8 | 642.7 pt_data_worker:2 | 8,104 | 30.1 | 611.3 pt_data_worker:3 | 7,889 | 28.9 | 587.6 prometheus-node-exp | 3,201 | 8.7 | 45.2 fluent-bit | 2,890 | 7.1 | 38.9
The training process and all 4 DataLoader workers are fighting for CPU. And the worst single stall is 790ms — that's almost a full second where the training loop was frozen while the GPU sat idle. Background monitoring agents (Prometheus node exporter, Fluent Bit) are stealing another 15+ seconds of CPU time. SELECT (timestamp / 10000000000) * 10 as window_sec, COUNT(CASE WHEN op = 'sched_switch' THEN 1 END) as ctx_switches, COUNT(CASE WHEN op = 'cudaStreamSync' THEN 1 END) as sync_calls, AVG(CASE WHEN op = 'cudaStreamSync' THEN duration_ns END)/1000 as sync_avg_us FROM events WHERE timestamp > (SELECT MAX(timestamp) - 120000000000 FROM events) GROUP BY window_sec ORDER BY window_sec;
window_sec | ctx_switches | sync_calls | sync_avg_us -----------|-------------|------------|------------ 0 | 342 | 89 | 52 ← baseline 10 | 1,205 | 91 | 180 ← contention starts 20 | 2,847 | 78 | 890 ← throughput drops 30 | 3,102 | 64 | 1,420 ← GPU starving 40 | 2,956 | 61 | 2,100 ← worst period 50 | 1,834 | 72 | 780 ← partial recovery
At the 40-second mark, context switches hit 3,000/10s and cudaStreamSync average latency is 40x baseline. The GPU is doing 30% fewer sync calls — not because it's working harder, but because it has nothing to sync on. The pipeline is empty. With --stack enabled, Ingero captures exactly which Python function was on-CPU when the stall happened: Top cudaStreamSync callers during contention window (t=20-50s): train.py:142 → cudaStreamSync | 89 calls | avg 1.8ms | max 7.2ms ↳ loss.backward() train.py:145 → cudaStreamSync | 34 calls | avg 2.1ms | max 4.9ms ↳ optimizer.step()
Top sched_switch victims: train.py:138 → DataLoader.next() | preempted 4,201 times ↳ waiting for batch from workers
The training loop at line 138 is blocked waiting for the next batch. The DataLoader workers themselves are being preempted. The fix is clear. Pin training to dedicated cores: taskset -c 0-3 python3 train.py — isolate from background processes Use persistent workers: DataLoader(persistent_workers=True) — eliminates worker respawn overhead Reduce background noise: nice -n 19 for monitoring agents, or move them to separate cgroups Match worker count to available cores: Don't set num_workers=8 on an 8-core machine — leave 2 cores for the training loop and OS Check I/O: If workers are blocked on disk reads, pre-cache your dataset to tmpfs or use DataLoader(prefetch_factor=4)
After applying fixes 1-3, the same training run: Context switches on training process: 14,504 → 890 cudaStreamSync p99: 7.2ms → 45µs (160x improvement) Effective throughput: 200 → 540 samples/sec (2.7x) nvidia-smi still says 97% — but now it's real utilization GPU underutilization is a trillion-dollar infrastructure problem hiding behind a misleading metric. Every ML team has hit this wall — training that should take 4 hours takes 12, and nobody can explain why because all the dashboards say the GPU is "fine." The problem is always on the host side: CPU scheduling, data loading, memory pressure, I/O contention. These are Linux kernel events. The only way to see them alongside CUDA behavior is to trace both layers simultaneously. That's what Ingero does — eBPF uprobes on the CUDA libraries plus kernel tracepoints on the scheduler, memory subsystem, and I/O stack. No code changes, no SDK integration, <2% overhead. Production-safe. No GPU needed to see the pattern: git clone https://github.com/ingero-io/ingero.git cd ingero && make build ./bin/ingero demo cpu-contention # CPU scheduling delays causing GPU stalls ./bin/ingero demo memcpy-bottleneck # Data transfer dominating wall-clock time
For real GPU tracing: sudo ./bin/ingero trace --stack --duration 120s
... run your training ...
./bin/ingero explain --since 120s
GitHub: github.com/ingero-io/ingero