feat: add -vvv trace mode dashboard for client overhead by viraatc · Pull Request #334 · mlcommons/endpoints

viraatc · 2026-06-04T00:03:20Z

Summary

Opt-in (-vvv) binary trace pipeline: per-request lifecycle latency + per-process asyncio event-loop lag, rendered as a live rich.Live dashboard alongside the running benchmark. Off by default; no measurable overhead when off (see Perf impact).

Example (demo data)

═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
  uptime                              0.0s   status                      BACKPRESSURE   req/s                           12,747.2
  issued                            80,000   dropped frames                       512   tok/s                        5,353,810.0
═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════

  REQUEST LIFECYCLE  (ms)
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  stage                                                        N        avg        min        p50        p99        max     %E2E
  [client] issue -> ipc_2_worker -> conn acquired            400      33.87      25.00      34.01      43.02      43.00    20.5%
  [client] conn acquired -> payload written                  400       1.00       1.00       1.00       1.00       1.00     0.6%
  [server] payload written -> headers recvd                  400       2.00       2.00       2.00       2.00       2.00     1.2%
  [server] headers recvd -> 1st chunk                        400      31.55      28.00      31.60      35.23      35.20    19.1%
  [server] 1st chunk -> last chunk                           400      87.74      70.00      88.01     106.04     106.00    53.2%
  [client] last chunk -> ipc_2_main -> complete              400       8.89       8.00       8.90       9.81       9.80     5.4%
  E2E TOTAL  issue -> complete                               400     165.04     134.00     165.54     197.00     197.00   100.0%

  client work                        22.4%   server work                        73.5%   backpressure [workers busy]        20.5%
  e2e               │▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▒███████████████████████████████████████████████████████████▒▒▒▒│                 encode ─┤
                    ▒ client   ▓ backpressure   █ server                                                          tcp-acquire ─┤
                                                                                                                   sse-decode ─┤
                                                                                                                 final-decode ─┤
                                                                                                                 complete-ipc ─┘

  LOADGEN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  issued                            80,000   completed                         76,483   errors                                12
  issued/s                        13,333.3   completed/s                     12,747.2   tok/s                        5,353,810.0

  latency (ms)                                                 N        avg        min        p50        p99        max
  ttft                                                    76,483      33.00      20.00      32.00      95.00     120.00
  tpot                                                    76,483      11.50       5.00      11.00      45.00      60.00
  e2e                                                     76,483     151.00      90.00     150.00     380.00     410.00

  EVENT LOOP LAG  (ms)
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  fleet p99 1.17 ms   hot workers (p99 ≥ 5 ms)  2/6   tcp conns  3,988
  worker                                                       N        avg        min        p50        p99        max
  main                                                        20       0.49       0.40       0.48       0.57       0.57
  w4                                                          20      11.09      11.00      11.08      11.17      11.17
  w2                                                          20       6.59       6.50       6.58       6.67       6.67
  w0                                                          20       1.29       1.20       1.28       1.37       1.37
  w1                                                          20       0.89       0.80       0.88       0.97       0.97
  …

What it shows

REQUEST LIFECYCLE — per-stage N/avg/min/p50/p99/max + heat-graded %E2E across the request path; ipc_2_worker/ipc_2_main mark the cross-process hops.
Verdict + e2e bar + cause tree — client / server / backpressure split of E2E; the bar uses only the legend key colors, and the tree under the backpressure value names the worker-loop phases, colored by their stage's %E2E heat.
LOADGEN — authoritative counts/rates + ttft/tpot/e2e from the metrics aggregator (perf-window tracked_*).
EVENT LOOP LAG — fleet p99, hot-worker count, fleet-wide ESTABLISHED TCP conns (sampled from /proc by the dashboard process — zero producer cost), then main + top-16 workers by max lag.

How it's wired

utils/trace.py — lock-free SPSC emitter (~190 ns/event) into a per-process 512 MiB anonymous-mmap ring (pages fault in on demand); per-pid FIFO transport with non-blocking open (bounded retry, raises instead of hanging if the dashboard died) and O_NONBLOCK writes — drops on EAGAIN, adaptive sampling sheds load, and a cumulative self-healing drop counter is re-emitted every tick.
utils/trace_dashboard.py (pure aggregation/render, unit-tested in isolation) + scripts/trace_dashboard.py (TUI subprocess) — reads the FIFO via vectorized iter_unpack plus a dedicated ZMQ SUB on the aggregator PUB (sidecar fallback). teardown() writes the terminal loadgen snapshot to the sidecar before closing the FIFO, so the reader exits on true EOF with an authoritative closing frame — no idle/time caps.
Warmup is excluded everywhere via the PERF_START reset (lifecycle, rates, loop lag); PERF_END freezes the lifecycle/verdict/tree consistently.

Perf impact (B200 bare metal)

3-config e2e roofline (#328 recipe: max_throughput_server, concurrency 4000, 10 s, 3 reps, mean ± pstdev QPS) on an exclusive bare-metal B200 host (Xeon Platinum 8568Y+, 192 threads):

mode	main	PR trace-off	PR `-vvv`
nonstream	66,176 ± 971	66,921 ± 875 (+1.1%)	63,211 ± 926 (−4.5% vs main)
stream	51,436 ± 1,306	50,296 ± 730 (−2.2%)	47,428 ± 289 (−7.8% vs main)

Trace-off is within run-to-run noise of main — no overhead when -vvv isn't used. -vvv costs ~5–8% at the sub-millisecond stub roofline, the worst case by construction; against real LLM endpoints (server-dominated e2e) the relative overhead shrinks accordingly.

Test plan

uv run pytest tests/unit/utils/ — 112 tests (counts, folds, drop self-heal, loop lag, warmup exclusion, verdict/bar/tree, freeze, tcp gauge)
uv run pre-commit run --all-files — green
e2e smoke vs max_throughput_server (streaming + offline), plus the 3-config roofline above
Reviewer: inference-endpoint -vvv benchmark ... renders cleanly

🤖 Generated with Claude Code

github-actions · 2026-06-04T00:03:28Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request introduces a high-performance, binary event-tracing framework activated via the -vvv CLI flag, complete with a live terminal dashboard, ablation scripts, and comprehensive unit tests. Tracing is integrated across the benchmark executor, HTTP client, worker processes, and load generator. The reviewer's feedback focuses on critical thread-safety improvements to the Dashboard class using a reentrant lock (threading.RLock) to prevent race conditions between the reader and render threads. Additionally, the reviewer suggests key performance optimizations in the hot path, such as using a cached memoryview in _TraceEmitter for zero-copy writes and pre-computing the integer sid in InFlightRequest to avoid repetitive string parsing.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

This reverts commit 28f5ac1. The -vvv trace feature was merged to main before review. Reverting so main stays clean; it will be re-introduced via the reviewed feat/trace-events PR (#334). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

This reverts commit 28f5ac1. The -vvv trace feature was merged to main before review; reverting so main stays clean. It will be re-introduced via the reviewed feat/trace-events PR (#334). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Opt-in (`-vvv`) per-request lifecycle tracing for the benchmark client, rendered as a live `rich` dashboard alongside the run. Off by default with no measurable overhead when off (emit is a no-op binding); the worker hot path stays lock-free and allocation-free. Pipeline: - utils/trace.py — lock-free SPSC ring emitter (~190 ns/event) into a per-process 512 MiB anonymous mmap (pages fault in on write). Per-pid POSIX FIFO transport: non-blocking open with bounded retry (raises rather than hanging if the dashboard died), O_NONBLOCK writes that drop on EAGAIN with an adaptive sampler + cumulative self-healing drop counter. bootstrap spawns the dashboard, tracks its Popen, and cleanup() reaps it (kills a wedged one as a backstop). 17-byte <BQQ frames. - utils/trace_dashboard.py — pure aggregation + render (unit-tested in isolation): lifecycle fold into HDR-histogram stages, heat-graded %E2E table, client/server/backpressure verdict, e2e bar, backpressure cause tree (pickup-ipc heats independently; encode+tcp-acquire fused), per-proc loop-lag panel, and an ESTABLISHED tcp-conn gauge. Cross-process deltas floored at 0. - scripts/trace_dashboard.py — TUI subprocess: FIFO reader thread, ZMQ SUB to the aggregator PUB (sidecar fallback), and an off-render-thread /proc tcp-conn sampler. Reads to true FIFO EOF with an authoritative final frame. - Worker / session / agentic-inference emit sites; warmup excluded via a PERF_START reset; PERF_END freezes the lifecycle/verdict/tree. Perf (B200 bare metal, #328 roofline, 3 reps): trace-off within run-to-run noise of main; -vvv ~5-8% at the sub-ms stub ceiling (worst case). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

viraatc requested a review from a team June 4, 2026 00:03

github-actions Bot requested review from arekay-nv and nvzhihanj June 4, 2026 00:03

gemini-code-assist Bot reviewed Jun 4, 2026

View reviewed changes