feat: add client-side profiling trigger by maxyanghu · Pull Request #326 · mlcommons/endpoints

maxyanghu · 2026-05-26T19:47:32Z

Summary

Add a client-side trigger that fires POST /start_profile at the beginning of the performance phase and POST /stop_profile at run end. Lets inference-endpoint capture a vLLM profile during a measured run without coupling endpoints to any vendor-specific server harness.

Motivated by MLPerf Inference v6.1 q3vl, which decouples LoadGen+vLLM into a separate HTTP server (mlperf-inf-mm-q3vl benchmark nv mpi-dynamo-vllm-serve). The server already supports HTTP-driven profiling — this PR is purely the client-side arm/disarm trigger.

What changed

config/schema.py — new ProfilerEngine str-enum (currently just vllm) and ProfilingConfig Pydantic model with two fields: engine: ProfilerEngine | None = None and urls: list[str] | None = None (CLI --profile-urls). Hung off Settings next to WarmupConfig. urls carries a scheme validator mirroring EndpointConfig's (rejects URLs without http:///https://, skips when None). Default-off; YAML and existing CLI behavior unchanged.
commands/benchmark/execute.py — at performance-phase start, POST /start_profile to each URL derived from the profiling endpoint list (strip trailing /v1, append the engine's path). The list is profiling.urls when set, otherwise endpoint_config.endpoints. In the run's finally block, POST /stop_profile to URLs whose start succeeded. Writes a Profiling section to report.txt and a sibling profiling.json. Warn-don't-fail throughout — profiling never fails a benchmark run.
commands/benchmark/cli.py — from-config gets --profile <engine> and --profile-urls <url...> override flags, same pattern as the existing --timeout / --mode overrides. offline / online already pick up both from the schema via cyclopts.

How to use

# Server, with profiling enabled at server start:
vllm serve \
    ... \
    --profiler-config.profiler=cuda \
    --profiler-config.delay_iterations=50 \
    ---profiler-config.max_iterations=100 \
    ...

# Client:
inference-endpoint benchmark from-config -c run.yaml
#  in YAML:
#   settings:
#     profiling:
#       engine: vllm

# When the profiler admin endpoint differs from the inference endpoint,
# override the trigger target (otherwise it derives from endpoint_config.endpoints):
# or in YAML:
#   settings:
#     profiling:
#       engine: vllm
#       urls:
#         - http://host:8001/v1

The actual trace window is bounded server-side by delay_iterations / max_iterations. The client's /stop_profile is a safety belt for the max_iterations=0 case and for aborted runs.

Caveats

Multi-replica fan-out is sequential; small skew (sub-iteration) between first and last server's cudaProfilerStop. Bounded well under max_iterations.
The client /stop_profile fires from the session-end finally, not from a per-phase end hook. Profiled runs are not expected to include the accuracy phase; if that changes, add an on_phase_end callback to BenchmarkSession.run.
URL derivation strips /v1 and assumes a bare host/port underneath. Reverse-proxy-prefixed deployments (e.g. gateway/api/v1) are not supported — but --profile-urls can supply the correct base directly when the profiler endpoint differs from the inference endpoint.

closes #324

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

github-actions · 2026-05-26T19:47:45Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request introduces client-side profiling triggers for inference engines (specifically supporting vLLM) around the performance phase of a benchmark run. It adds a --profile CLI option, config schema updates, and logic to trigger start/stop profiling endpoints. The review feedback highlights a critical issue where the profiling stop requests are synchronous blocking network calls executed within the main event loop during the cleanup phase. This can block the event loop or prevent subsequent cleanup steps if interrupted, and should be run asynchronously using an executor.

wangshangsam

LGTM but (probably in a separate PR) I have a feeling that _run_benchmark_async() needs refactoring. It's currently 342 lines long; not very readable.

viraatc · 2026-06-16T10:47:16Z

    config: Annotated[Path, cyclopts.Parameter(name=["--config", "-c"])],
    timeout: float | None = None,
    mode: TestMode | None = None,
+    profile: Annotated[


lets keep from-config cli minimal if possible,
is there special need --profile, --profile_urls in from-config mode? (instead of specifying in config yaml)

Thanks! I dropped --profile/profile_urls

viraatc

LGTM thanks!

viraatc

Review Council — Multi-AI Code Review

Reviewed by: Codex (gpt-5.5) + Claude | Depth: standard

3 inline comments below. See the summary comment for the full picture, including findings already covered by existing reviewers.

viraatc · 2026-06-16T11:08:54Z

+            # Fire /stop_profile for URLs whose /start_profile succeeded.
+            # Unifies the clean phase-end path and the abort path —
+            # both reach this block, both fire stops.
+            if profile_starts:


[Codex] medium (correctness): The profiling window is not bounded to the performance phase when accuracy datasets are configured.

These /stop_profile calls run in the finally, which only executes after await session.run(phases, ...) (line 901) has finished every phase. _build_phases() appends phases as warmup → performance → accuracy (one PhaseConfig per accuracy dataset, execute.py:515-523), all driven by the single session.run. So /start_profile fires at performance-phase start but /stop_profile fires only after the accuracy phases complete — meaning that in submission / --mode both runs with separate accuracy datasets, the profiler trace and the reported Trigger span cover performance plus all accuracy traffic, contradicting the documented “around the performance phase” contract (ProfilingConfig docstring). Pure perf runs are unaffected.

Minimal fix: stop at performance-phase end rather than session end — e.g. add an on_phase_end hook to session.run (it currently exposes only on_phase_start, session.py:369-389) and fire the stops there when phase.phase_type == PhaseType.PERFORMANCE.

we won't run profiling with --mode both/acc, it's only for --mode perf. Plus there is no existing on_phase_end hook. We don't want to add it merely because we want a profiling functionality.

viraatc · 2026-06-16T11:08:54Z

+    if profile_urls is not None:
+        profiling_update["urls"] = profile_urls
+    if profiling_update:
+        new_profiling = resolved.settings.profiling.model_copy(update=profiling_update)


[Codex + Claude] medium (data-integrity): this from-config override bypasses the new URL-scheme validation.

In Pydantic v2 model_copy(update=...) does not run validators, so ProfilingConfig._validate_url_scheme (schema.py:666) never fires on this path. A scheme-less value such as benchmark from-config --profile-urls localhost:8000 is silently accepted; _derive_profile_urls then yields localhost:8000/start_profile and urlopen raises unknown url type, which is swallowed into the record so the trigger silently no-ops. The offline/online path constructs ProfilingConfig normally and does reject the same input — so the two CLI paths disagree (verified empirically on pydantic 2.12.0).

Minimal fix: re-validate instead of copying, e.g.

new_profiling = ProfilingConfig.model_validate( {**resolved.settings.profiling.model_dump(), **profiling_update} )

Related to @viraatc’s note above on keeping the from-config surface minimal — if these overrides are dropped, this code path (and the asymmetry) disappears entirely.

I was just testing this PR out as well, +1, I saw this issue too, malformed URLs are able to pass through using from-config

I've dropped --profile/profile_urls from from-config cli. This issue doesn't exist any more.

viraatc · 2026-06-16T11:08:54Z

+}
+
+
+def _derive_profile_urls(


[Claude] medium (testing): none of the new profiling code is exercised by tests (the only profil* test, tests/unit/test_profiler.py, covers the unrelated line-profiler module). The branches most worth covering — and the ones implicated by the two correctness issues in this review — are: _derive_profile_urls /v1 stripping and the empty-endpoints ValueError (here); the model_validate-vs-model_copy validation asymmetry (cli.py:132); and the start/stop record plumbing into profiling.json. A monkeypatched urlopen (or a stub server) keeps these as fast unit tests; per AGENTS.md the new branches should be covered before merge.

Tests added

viraatc · 2026-06-16T11:09:24Z

Review Council — Multi-AI Code Review

Reviewed by: Codex (gpt-5.5, xhigh) + Claude | Depth: standard | Focus: code & integration quality, minimal changes

Posted 3 inline comments. Each was verified against the source at HEAD (cd84c53) and de-duplicated against the 4 existing human/bot comments.

Posted inline

#	File	Line	Severity	Category	Reviewer(s)	Summary
1	`commands/benchmark/execute.py`	913	medium	correctness	Codex	`/stop_profile` fires in the `finally` after all phases, so with accuracy datasets the trace + `Trigger span` cover performance plus accuracy traffic — not "around the performance phase". Bound the stop to perf-phase end (needs an `on_phase_end` hook).
2	`commands/benchmark/cli.py`	132	medium	data-integrity	Codex + Claude	`model_copy(update=…)` skips validators, so `--profile-urls` on `from-config` bypasses `_validate_url_scheme` (scheme-less URL silently no-ops the trigger). `offline`/`online` correctly rejects it. Re-validate via `ProfilingConfig.model_validate({dump, update})`.
3	`commands/benchmark/execute.py`	556	medium	testing	Claude	No tests exercise any new profiling code (URL derivation, validator asymmetry, start/stop plumbing). The untested branches are exactly the ones implicated by #1/#2.

Already covered / not re-posted (to avoid duplicate noise)

Blocking urllib.urlopen on the event loop — already filed by gemini-code-assist[bot] at execute.py:928 for the stop loop (recommends loop.run_in_executor + except BaseException so a 2nd Ctrl+C can't skip cleanup). Worth extending: the start loop (execute.py:887-897) has the identical synchronous-urlopen-on-the-loop pattern and isn't mentioned in that thread — the same run_in_executor fix should apply there too.
Should --profile/--profile-urls exist in from-config at all? — already raised by @viraatc at cli.py:102. Resolving that (dropping the overrides) would also eliminate finding [fix] Update review and branch name #2's code path.

Low / non-blocking (noted, not posted inline)

profiling.json + report section are skipped on the execution-failure path. If session.run raises, the finally still POSTs /stop_profile (good — the server profiler is stopped), but _run_benchmark_async re-raises as ExecutionError; run_benchmark (execute.py:1176) catches only KeyboardInterrupt, so finalize_benchmark never runs and the stop records (tagged stop_reason="abort") are discarded. Consequently the "(from abort handler)" suffix in _write_profiling_section is effectively unreachable — the only path that returns a BenchmarkResult is the clean/SIGINT path, where session_completed_normally=True ⇒ "phase_end". Low impact; the comment "both reach this block, both fire stops" is accurate for the HTTP call but not for the persisted record.
DRY: the --profile / --profile-urls flag names + help strings are duplicated between ProfilingConfig (schema.py, via alias=) and the explicit from_config params (cli.py:102-117). Partly unavoidable (from_config takes a Path, not the model), but the help text is copy-pasted and will drift — extract to shared constants if the overrides are kept.

Commit hygiene: clean. The main..HEAD range shows 6 commits, but 3 are a merge + already-merged PRs (#355, #331) pulled in via the merge; the PR contributes 2 feat: + 1 fix:.

_{🤖 Two independent AI reviews (Codex + Claude), synthesized and verified against HEAD. Comments are advisory (event: COMMENT); no approval state changed.}

maxyanghu · 2026-06-16T19:16:22Z

@viraatc do you know why Tests/audit is failing?

viraatc · 2026-06-16T22:40:25Z

its an audit failure, we need to update aiohttp on main first (ill make separate MR)

viraatc · 2026-06-21T21:49:11Z

merged now, can u pl. rebase?
#358

Adds an optional client-side trigger that fires POST /start_profile at the performance phase start and /stop_profile at run end, so a profiled run can be driven from a YAML/CLI flag without coupling endpoints to any vendor harness. Schema: ProfilerEngine enum (currently {vllm}) and ProfilingConfig hung off Settings. URLs are auto-derived per entry in endpoint_config.endpoints (strip /v1, append engine-specific path). Default-off; warn-don't-fail throughout. Report.txt gets a Profiling section and a sibling profiling.json is written next to result_summary.json when the trigger is enabled.

The profiler-trigger commit (a4fe30b) left the Field( call for metrics_tokenizer_workers unterminated, so config/schema.py raised SyntaxError and the inference-endpoint CLI could not import. Add the missing ) so the module compiles.

Add an optional profiling.endpoints (CLI --profile-endpoints) field so the profiler start/stop triggers can target a different host than the inference endpoint. When unset, URLs are still derived from endpoint_config.endpoints; when set, derivation runs over the override list using the same engine-specific protocol. Adds a scheme validator mirroring EndpointConfig and a matching --profile-endpoints override on the from-config subcommand.

Keep the from-config CLI surface minimal per review feedback: profiling is configured via the YAML settings.profiling block for from-config runs. This also removes the model_copy(update=...) path that bypassed ProfilingConfig URL-scheme validation. offline/online keep the schema-generated --profile/--profile-urls flags, which validate normally.

Adds unit tests for the client-side profiling trigger (review finding #3): - TestProfilingConfig (test_schema.py): defaults, engine enum coercion, and URL-scheme validation on both the direct-construction (offline/online) and model_validate (from-config YAML) paths. - TestProfilingHelpers (test_benchmark.py): _derive_profile_urls /v1 stripping and empty-endpoints ValueError, _post_profile 200/404/connection-failure via mocked urlopen, _render_profile_status, and _write_profiling_section output plus profiling.json serializability.

maxyanghu requested a review from a team May 26, 2026 19:47

gemini-code-assist Bot reviewed May 26, 2026

View reviewed changes

Comment thread src/inference_endpoint/commands/benchmark/execute.py

wangshangsam requested a review from arekay-nv May 27, 2026 05:13

wangshangsam self-requested a review June 4, 2026 20:34

wangshangsam assigned maxyanghu Jun 4, 2026

maxyanghu force-pushed the feat/maxhu-add-profiler-trigger branch from a5d0534 to a4fe30b Compare June 10, 2026 18:53

wangshangsam requested changes Jun 11, 2026

View reviewed changes

Comment thread src/inference_endpoint/config/schema.py Outdated

maxyanghu force-pushed the feat/maxhu-add-profiler-trigger branch 2 times, most recently from 8413f09 to 757b5b5 Compare June 15, 2026 19:00

maxyanghu requested a review from wangshangsam June 15, 2026 19:16

wangshangsam approved these changes Jun 15, 2026

View reviewed changes

maxyanghu force-pushed the feat/maxhu-add-profiler-trigger branch from c3e1b7b to cd84c53 Compare June 15, 2026 21:17

viraatc reviewed Jun 16, 2026

View reviewed changes

viraatc approved these changes Jun 16, 2026

View reviewed changes

viraatc reviewed Jun 16, 2026

View reviewed changes

maxyanghu requested review from leopck and viraatc June 16, 2026 19:10

maxyanghu force-pushed the feat/maxhu-add-profiler-trigger branch from c2d22ec to 1a951ac Compare June 16, 2026 19:11

maxyanghu added 5 commits June 22, 2026 13:00

maxyanghu force-pushed the feat/maxhu-add-profiler-trigger branch from 1a951ac to 6cba0e0 Compare June 22, 2026 17:00

		}


		def _derive_profile_urls(

Conversation

maxyanghu commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

How to use

Caveats

Type of change

Checklist

Uh oh!

github-actions Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

wangshangsam left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viraatc left a comment

Choose a reason for hiding this comment

Uh oh!

viraatc left a comment

Choose a reason for hiding this comment

Review Council — Multi-AI Code Review

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viraatc commented Jun 16, 2026

Review Council — Multi-AI Code Review

Posted inline

Already covered / not re-posted (to avoid duplicate noise)

Low / non-blocking (noted, not posted inline)

Uh oh!

maxyanghu commented Jun 16, 2026

Uh oh!

viraatc commented Jun 16, 2026

Uh oh!

viraatc commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

maxyanghu commented May 26, 2026 •

edited

Loading

github-actions Bot commented May 26, 2026 •

edited

Loading

wangshangsam left a comment •

edited

Loading