feat: add -vvv trace mode dashboard for client overhead#334
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request introduces a high-performance, binary event-tracing framework activated via the -vvv CLI flag, complete with a live terminal dashboard, ablation scripts, and comprehensive unit tests. Tracing is integrated across the benchmark executor, HTTP client, worker processes, and load generator. The reviewer's feedback focuses on critical thread-safety improvements to the Dashboard class using a reentrant lock (threading.RLock) to prevent race conditions between the reader and render threads. Additionally, the reviewer suggests key performance optimizations in the hot path, such as using a cached memoryview in _TraceEmitter for zero-copy writes and pre-computing the integer sid in InFlightRequest to avoid repetitive string parsing.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
f8bd193 to
4dbbeef
Compare
4dbbeef to
02c069f
Compare
d5da7d0 to
9d2dd89
Compare
9d2dd89 to
5ac6ec0
Compare
5ac6ec0 to
a81ebda
Compare
4b81bca to
4eb16bc
Compare
4eb16bc to
58b4739
Compare
58b4739 to
2a27b50
Compare
8ca09d5 to
2eb43e9
Compare
16353c9 to
9c1a920
Compare
9c1a920 to
5b452af
Compare
0f2c8fd to
d0169fd
Compare
d0169fd to
e9a853a
Compare
e9a853a to
aff40be
Compare
Opt-in (`-vvv`) per-request lifecycle tracing for the benchmark client, rendered as a live `rich` dashboard alongside the run. Off by default with no measurable overhead when off (emit is a no-op binding); the worker hot path stays lock-free and allocation-free. Pipeline: - utils/trace.py — lock-free SPSC ring emitter (~190 ns/event) into a per-process 512 MiB anonymous mmap (pages fault in on write). Per-pid POSIX FIFO transport: non-blocking open with bounded retry (raises rather than hanging if the dashboard died), O_NONBLOCK writes that drop on EAGAIN with an adaptive sampler + cumulative self-healing drop counter. bootstrap spawns the dashboard, tracks its Popen, and cleanup() reaps it (kills a wedged one as a backstop). 17-byte <BQQ frames. - utils/trace_dashboard.py — pure aggregation + render (unit-tested in isolation): lifecycle fold into HDR-histogram stages, heat-graded %E2E table, client/server/backpressure verdict, e2e bar, backpressure cause tree (pickup-ipc heats independently; encode+tcp-acquire fused), per-proc loop-lag panel, and an ESTABLISHED tcp-conn gauge. Cross-process deltas floored at 0. - scripts/trace_dashboard.py — TUI subprocess: FIFO reader thread, ZMQ SUB to the aggregator PUB (sidecar fallback), and an off-render-thread /proc tcp-conn sampler. Reads to true FIFO EOF with an authoritative final frame. - Worker / session / agentic-inference emit sites; warmup excluded via a PERF_START reset; PERF_END freezes the lifecycle/verdict/tree. Perf (B200 bare metal, #328 roofline, 3 reps): trace-off within run-to-run noise of main; -vvv ~5-8% at the sub-ms stub ceiling (worst case). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
Opt-in (
-vvv) binary trace pipeline: per-request lifecycle latency + per-process asyncio event-loop lag, rendered as a liverich.Livedashboard alongside the running benchmark. Off by default; no measurable overhead when off (see Perf impact).Example (demo data)
What it shows
ipc_2_worker/ipc_2_mainmark the cross-process hops.tracked_*).How it's wired
utils/trace.py— lock-free SPSC emitter (~190 ns/event) into a per-process 512 MiB anonymous-mmap ring (pages fault in on demand); per-pid FIFO transport with non-blocking open (bounded retry, raises instead of hanging if the dashboard died) andO_NONBLOCKwrites — drops on EAGAIN, adaptive sampling sheds load, and a cumulative self-healing drop counter is re-emitted every tick.utils/trace_dashboard.py(pure aggregation/render, unit-tested in isolation) +scripts/trace_dashboard.py(TUI subprocess) — reads the FIFO via vectorizediter_unpackplus a dedicated ZMQ SUB on the aggregator PUB (sidecar fallback).teardown()writes the terminal loadgen snapshot to the sidecar before closing the FIFO, so the reader exits on true EOF with an authoritative closing frame — no idle/time caps.PERF_STARTreset (lifecycle, rates, loop lag);PERF_ENDfreezes the lifecycle/verdict/tree consistently.Perf impact (B200 bare metal)
3-config e2e roofline (#328 recipe:
max_throughput_server, concurrency 4000, 10 s, 3 reps, mean ± pstdev QPS) on an exclusive bare-metal B200 host (Xeon Platinum 8568Y+, 192 threads):-vvvTrace-off is within run-to-run noise of main — no overhead when
-vvvisn't used.-vvvcosts ~5–8% at the sub-millisecond stub roofline, the worst case by construction; against real LLM endpoints (server-dominated e2e) the relative overhead shrinks accordingly.Test plan
uv run pytest tests/unit/utils/— 112 tests (counts, folds, drop self-heal, loop lag, warmup exclusion, verdict/bar/tree, freeze, tcp gauge)uv run pre-commit run --all-files— greenmax_throughput_server(streaming + offline), plus the 3-config roofline aboveinference-endpoint -vvv benchmark ...renders cleanly🤖 Generated with Claude Code