Skip to content

2. Add prompt presets#54

Merged
ErlisLushtaku merged 12 commits into
mainfrom
pr32-split-v2/02-prompt-presets-only
Jun 16, 2026
Merged

2. Add prompt presets#54
ErlisLushtaku merged 12 commits into
mainfrom
pr32-split-v2/02-prompt-presets-only

Conversation

@ErlisLushtaku

@ErlisLushtaku ErlisLushtaku commented May 28, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Move shared judge prompt preset resolution into its own stack layer.
  • Wire prompt preset selection through CLI, pairwise judging, and MT-Bench preset judging.

Stack

  • Next: pr32-split-v2/02.5-mt-bench-preset-judging.

Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split.

Includes-AI-Code: true
Move prompt preset resolution, MT-Bench preset judging, and localized m-Arena-Hard prompt assets into their own stack layer so inference metadata can build on top.

Includes-AI-Code: true
@ErlisLushtaku

Copy link
Copy Markdown
Collaborator Author

@kargibora this is the same as #48 I just split some logic related to mt-bench into another one #55 . If you want you can give it another review but it should be the same as what you saw before. #55 is new though.

@ErlisLushtaku ErlisLushtaku requested a review from kargibora May 28, 2026 09:04

@kargibora kargibora left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clean-up! I left some more comments after reading it throughly.

provide_explanation=provide_explanation,
)

if preset is None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If preset=None, we set it to DEFAULT_WITH_EXPLANATION_PRESET if explanation is provided. So we can not use task and provide_explanation at the same time. Is this intended?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This precedence is unchanged from the original judge-prompt-registry PR (#40), where provide_explanation was intentionally documented as a legacy alias for default_with_explanation that overrides the task default. I also added a comment here to make this more clear.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, do we really need to keep legacy alias/arguments to not break previous CLI commands? I think this is an overall design question but I think it is more natural to just drop alias in the functions and just create a warning about the usage

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a point of discussion, we could discuss that in a meeting later

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll note this, and merge the PR for now so we don't block further and have to resolve more conflicts

Comment thread judgearena/prompts/registry.py
Comment thread judgearena/mt_bench/preset_judging.py Outdated
Comment thread judgearena/mt_bench/mt_bench_utils.py Outdated
- Drop the always-zero num_inconsistent return from
  judge_mt_bench_with_preset; return a 3-tuple and unpack 3 in
  _run_mt_bench_preset (the FastChat path keeps its real value).
- Document that provide_explanation is a legacy alias that
  intentionally overrides the task default preset.
Resolve conflicts in evaluate.py and generate_and_evaluate.py by keeping
both sides:
- evaluate.py: keep preset-driven PairScore(parser_mode=...) and adopt
  main's _parse_and_warn parse-failure logging helper.
- generate_and_evaluate.py: keep resolved_prompt.metadata() alongside
  main's new "swap_mode"/"result_folder" result fields.
ErlisLushtaku added a commit that referenced this pull request Jun 16, 2026
…judging

Pulls main (via the updated 02 base) plus the PR #54 review fixes into the
MT-Bench branch. Resolve judge_mt_bench_with_preset signature conflict by
adopting the 3-tuple return (drops the unused always-0 int) while keeping
this branch's resolve_mt_bench_turn_flags refactor; the inline
assert turns_mode is redundant since resolve_mt_bench_turn_flags validates it.
ErlisLushtaku added a commit that referenced this pull request Jun 16, 2026
Propagate main + PR #54/#55 changes up the stack. Resolve conflicts:
- evaluate.py: keep this branch's parser_mode parameter while adopting
  main's _parse_and_warn parse-failure logging helper.
- tests/test_utils.py: keep both added imports (pytest and the mt_bench
  module used by the new tests).
ErlisLushtaku added a commit that referenced this pull request Jun 16, 2026
Integrate main + PR #54/#55/#57 changes under the thinking-model branch.
Resolve conflicts:
- preset_judging.py / mt_bench_utils.py: adopt the 3-tuple
  judge_mt_bench_with_preset return while keeping strip_thinking_before_judging.
- generate_and_evaluate.py: keep _build_generation_engine_kwargs (battle
  thinking-token sub-budget) and drop _build_judge_engine_kwargs in favour of
  03.5's inlined build_default_judge_model_kwargs; keep both sides' new result
  fields (thinking + swap_mode/result_folder).
- utils.py: adopt 03.5's refactored do_inference (no per-message metadata
  extraction). Keep strip_thinking_tags* and safe_parse_int; drop the now-dead
  _extract_ai_message_metadata and batch_with_metadata. thinking_token_budget
  enforcement is preserved; budget-exhaustion observability is dropped.
@ErlisLushtaku ErlisLushtaku merged commit 8d9e947 into main Jun 16, 2026
1 check passed
ErlisLushtaku added a commit that referenced this pull request Jun 16, 2026
#54 was squash-merged into main as a single commit (8d9e947), so the
original 02 prompt-preset commits in this branch's history collided with
main. main's tree is byte-identical to the 02 branch tip (301cba8), which
this branch already merged, so every conflict was resolved by keeping this
branch's version (a strict superset: 02 changes + the MT-Bench work).
Resulting tree is unchanged; the PR diff now shows only the MT-Bench changes.
ErlisLushtaku added a commit that referenced this pull request Jun 16, 2026
test_judge_mt_bench_with_preset_parses_and_inverts_swapped_scores still
unpacked 4 values and asserted num_inconsistent == 0, but the #54/#55
review dropped the always-zero num_inconsistent so judge_mt_bench_with_preset
now returns a 3-tuple. Unpack 3 values and drop the stale assertion.
ErlisLushtaku added a commit that referenced this pull request Jun 16, 2026
* Add native baselines and judge controls

* Fix versioned m-Arena-Hard download test

Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split.

Includes-AI-Code: true

* remove backward compatibility code

* revert

* Add prompt presets and localized judge prompts

Move prompt preset resolution, MT-Bench preset judging, and localized m-Arena-Hard prompt assets into their own stack layer so inference metadata can build on top.

Includes-AI-Code: true

* skywork prompt preset

* Add clarifying comment about localized presets

* remove localized and skywork presets

* Merge pr 40 logic, add registry and task defaults

* refactor and add tests

* add mt-bench reproducibility metadata and completion alignment

Record MT-Bench run metadata alongside results and fail early when cached or baseline completions are missing question rows, so dedicated MT-Bench runs are easier to reproduce and debug.

* Remove unnecessary cases added

* Address PR #54 review on prompt presets

- Drop the always-zero num_inconsistent return from
  judge_mt_bench_with_preset; return a 3-tuple and unpack 3 in
  _run_mt_bench_preset (the FastChat path keeps its real value).
- Document that provide_explanation is a legacy alias that
  intentionally overrides the task default preset.

* Address PR #55 review on MT-Bench preset judging
- Drop the always-constant swap_policy parameter and dead guard from
  the preset _append_results closure (and the now-unused
  _PRESET_SWAP_POLICY constant / MTBenchSwapPolicy import).
- Record the results "date" as UTC isoformat, consistent with the
  run-metadata started_at_utc timestamp.

* Drop MTBenchSwapPolicy in favor of inline comments

- The swap_policy parameter was always constant on both judging paths
(the only caller passed the matching module constant and the guard was
unreachable), and the enum was never serialized or used for dispatch.
- Remove it from _resolve_fastchat_item_result too (it was already removed
from the preset path), delete the unused enum, and document the two
swap_mode="both" combine strategies with inline comments instead:
conservative agreement for FastChat, append-inverted-score for presets.

* Fix MT-Bench preset test to unpack 3-tuple return

test_judge_mt_bench_with_preset_parses_and_inverts_swapped_scores still
unpacked 4 values and asserted num_inconsistent == 0, but the #54/#55
review dropped the always-zero num_inconsistent so judge_mt_bench_with_preset
now returns a 3-tuple. Unpack 3 values and drop the stale assertion.
ErlisLushtaku added a commit that referenced this pull request Jun 16, 2026
…cleanup, point to judgearena hf (#57)

* Add native baselines and judge controls

* Fix versioned m-Arena-Hard download test

Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split.

Includes-AI-Code: true

* remove backward compatibility code

* revert

* Add prompt presets and localized judge prompts

Move prompt preset resolution, MT-Bench preset judging, and localized m-Arena-Hard prompt assets into their own stack layer so inference metadata can build on top.

Includes-AI-Code: true

* skywork prompt preset

* Add clarifying comment about localized presets

* remove localized and skywork presets

* Merge pr 40 logic, add registry and task defaults

* refactor and add tests

* add mt-bench reproducibility metadata and completion alignment

Record MT-Bench run metadata alongside results and fail early when cached or baseline completions are missing question rows, so dedicated MT-Bench runs are easier to reproduce and debug.

* Remove unnecessary cases added

* Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf

- Make model setup and inference more robust: retry transient vLLM GPU-init races, strip vLLM-only engine kwargs for remote API providers, and propagate finish_reason/stop_reason through both the
async and batch inference paths.
- Centralize judge engine kwargs (incl. FP8 KV default) in build_default_judge_model_kwargs, accept an explicit parser_mode in judge_and_parse_prefs, restore the deprecated --model alias for --model_A, and tidy generation orchestration.
- Point dataset downloads at the judge-arena/judge-arena-dataset repo and drive download_all from
M_ARENA_HARD_BASELINES.

* Address PR #54 review on prompt presets

- Drop the always-zero num_inconsistent return from
  judge_mt_bench_with_preset; return a 3-tuple and unpack 3 in
  _run_mt_bench_preset (the FastChat path keeps its real value).
- Document that provide_explanation is a legacy alias that
  intentionally overrides the task default preset.

* Address PR #55 review on MT-Bench preset judging
- Drop the always-constant swap_policy parameter and dead guard from
  the preset _append_results closure (and the now-unused
  _PRESET_SWAP_POLICY constant / MTBenchSwapPolicy import).
- Record the results "date" as UTC isoformat, consistent with the
  run-metadata started_at_utc timestamp.

* Drop MTBenchSwapPolicy in favor of inline comments

- The swap_policy parameter was always constant on both judging paths
(the only caller passed the matching module constant and the guard was
unreachable), and the enum was never serialized or used for dispatch.
- Remove it from _resolve_fastchat_item_result too (it was already removed
from the preset path), delete the unused enum, and document the two
swap_mode="both" combine strategies with inline comments instead:
conservative agreement for FastChat, append-inverted-score for presets.

* Address PR #57 review: simplify CLI/judge kwargs, safe env parsing, drop unused metadata plumbing

- cli.py: remove the re-introduced --model alias and _resolve_model_a;
  callers use --model_A directly.
- generate_and_evaluate.py: inline build_default_judge_model_kwargs at its
  single call site and delete the _build_judge_engine_kwargs wrapper.
- utils.py: add safe_parse_int(env_var) so malformed JUDGEARENA_* integers
  fall back to defaults (with a warning) instead of crashing at import; use
  it for the vLLM init constants and the judge concurrency cap (also removes
  a double int() call).
- utils.py: remove the unused finish_reason/stop_reason metadata scaffolding
  (_extract_ai_message_metadata, do_inference return_metadata,
  ChatVLLM.batch_with_metadata) whose only consumer is the deferred
  truncation-tracking work; fold batch_with_metadata back into batch and add
  docstrings to do_inference and batch.
ErlisLushtaku added a commit that referenced this pull request Jun 16, 2026
* Add native baselines and judge controls

* Fix versioned m-Arena-Hard download test

Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split.

Includes-AI-Code: true

* remove backward compatibility code

* revert

* Add prompt presets and localized judge prompts

Move prompt preset resolution, MT-Bench preset judging, and localized m-Arena-Hard prompt assets into their own stack layer so inference metadata can build on top.

Includes-AI-Code: true

* skywork prompt preset

* Add clarifying comment about localized presets

* remove localized and skywork presets

* Merge pr 40 logic, add registry and task defaults

* refactor and add tests

* add mt-bench reproducibility metadata and completion alignment

Record MT-Bench run metadata alongside results and fail early when cached or baseline completions are missing question rows, so dedicated MT-Bench runs are easier to reproduce and debug.

* Remove unnecessary cases added

* Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf

- Make model setup and inference more robust: retry transient vLLM GPU-init races, strip vLLM-only engine kwargs for remote API providers, and propagate finish_reason/stop_reason through both the
async and batch inference paths.
- Centralize judge engine kwargs (incl. FP8 KV default) in build_default_judge_model_kwargs, accept an explicit parser_mode in judge_and_parse_prefs, restore the deprecated --model alias for --model_A, and tidy generation orchestration.
- Point dataset downloads at the judge-arena/judge-arena-dataset repo and drive download_all from
M_ARENA_HARD_BASELINES.

* Thinking model support (linearized net diff)

* Address PR #54 review on prompt presets

- Drop the always-zero num_inconsistent return from
  judge_mt_bench_with_preset; return a 3-tuple and unpack 3 in
  _run_mt_bench_preset (the FastChat path keeps its real value).
- Document that provide_explanation is a legacy alias that
  intentionally overrides the task default preset.

* Address PR #55 review on MT-Bench preset judging
- Drop the always-constant swap_policy parameter and dead guard from
  the preset _append_results closure (and the now-unused
  _PRESET_SWAP_POLICY constant / MTBenchSwapPolicy import).
- Record the results "date" as UTC isoformat, consistent with the
  run-metadata started_at_utc timestamp.

* Drop MTBenchSwapPolicy in favor of inline comments

- The swap_policy parameter was always constant on both judging paths
(the only caller passed the matching module constant and the guard was
unreachable), and the enum was never serialized or used for dispatch.
- Remove it from _resolve_fastchat_item_result too (it was already removed
from the preset path), delete the unused enum, and document the two
swap_mode="both" combine strategies with inline comments instead:
conservative agreement for FastChat, append-inverted-score for presets.

* Address PR #57 review: simplify CLI/judge kwargs, safe env parsing, drop unused metadata plumbing

- cli.py: remove the re-introduced --model alias and _resolve_model_a;
  callers use --model_A directly.
- generate_and_evaluate.py: inline build_default_judge_model_kwargs at its
  single call site and delete the _build_judge_engine_kwargs wrapper.
- utils.py: add safe_parse_int(env_var) so malformed JUDGEARENA_* integers
  fall back to defaults (with a warning) instead of crashing at import; use
  it for the vLLM init constants and the judge concurrency cap (also removes
  a double int() call).
- utils.py: remove the unused finish_reason/stop_reason metadata scaffolding
  (_extract_ai_message_metadata, do_inference return_metadata,
  ChatVLLM.batch_with_metadata) whose only consumer is the deferred
  truncation-tracking work; fold batch_with_metadata back into batch and add
  docstrings to do_inference and batch.

* Finalize #53 merge: drop deferred thinking-budget observability (Option A)

Align the thinking-model branch with the current base, which removed
do_inference's metadata path (truncation/metadata tracking is deferred):
- utils.py: drop the now-unread _thinking_budget_marker/_value attributes
  (budget enforcement via sampling params is unchanged).
- test_chat_vllm.py: remove the two budget-exhaustion observability tests
  that exercised the dropped batch_with_metadata path.
- test_mt_bench_preset_judging.py: unpack the 3-tuple judge_mt_bench_with_preset
  return (drop the removed always-0 num_inconsistent element).
ErlisLushtaku added a commit that referenced this pull request Jun 16, 2026
* Add native baselines and judge controls

* Fix versioned m-Arena-Hard download test

Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split.

Includes-AI-Code: true

* remove backward compatibility code

* revert

* Add prompt presets and localized judge prompts

Move prompt preset resolution, MT-Bench preset judging, and localized m-Arena-Hard prompt assets into their own stack layer so inference metadata can build on top.

Includes-AI-Code: true

* skywork prompt preset

* Add clarifying comment about localized presets

* remove localized and skywork presets

* Merge pr 40 logic, add registry and task defaults

* refactor and add tests

* add mt-bench reproducibility metadata and completion alignment

Record MT-Bench run metadata alongside results and fail early when cached or baseline completions are missing question rows, so dedicated MT-Bench runs are easier to reproduce and debug.

* Remove unnecessary cases added

* Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf

- Make model setup and inference more robust: retry transient vLLM GPU-init races, strip vLLM-only engine kwargs for remote API providers, and propagate finish_reason/stop_reason through both the
async and batch inference paths.
- Centralize judge engine kwargs (incl. FP8 KV default) in build_default_judge_model_kwargs, accept an explicit parser_mode in judge_and_parse_prefs, restore the deprecated --model alias for --model_A, and tidy generation orchestration.
- Point dataset downloads at the judge-arena/judge-arena-dataset repo and drive download_all from
M_ARENA_HARD_BASELINES.

* Thinking model support (linearized net diff)

* add elo random 1k sampling

* Address PR #54 review on prompt presets

- Drop the always-zero num_inconsistent return from
  judge_mt_bench_with_preset; return a 3-tuple and unpack 3 in
  _run_mt_bench_preset (the FastChat path keeps its real value).
- Document that provide_explanation is a legacy alias that
  intentionally overrides the task default preset.

* Address PR #55 review on MT-Bench preset judging
- Drop the always-constant swap_policy parameter and dead guard from
  the preset _append_results closure (and the now-unused
  _PRESET_SWAP_POLICY constant / MTBenchSwapPolicy import).
- Record the results "date" as UTC isoformat, consistent with the
  run-metadata started_at_utc timestamp.

* Drop MTBenchSwapPolicy in favor of inline comments

- The swap_policy parameter was always constant on both judging paths
(the only caller passed the matching module constant and the guard was
unreachable), and the enum was never serialized or used for dispatch.
- Remove it from _resolve_fastchat_item_result too (it was already removed
from the preset path), delete the unused enum, and document the two
swap_mode="both" combine strategies with inline comments instead:
conservative agreement for FastChat, append-inverted-score for presets.

* Address PR #57 review: simplify CLI/judge kwargs, safe env parsing, drop unused metadata plumbing

- cli.py: remove the re-introduced --model alias and _resolve_model_a;
  callers use --model_A directly.
- generate_and_evaluate.py: inline build_default_judge_model_kwargs at its
  single call site and delete the _build_judge_engine_kwargs wrapper.
- utils.py: add safe_parse_int(env_var) so malformed JUDGEARENA_* integers
  fall back to defaults (with a warning) instead of crashing at import; use
  it for the vLLM init constants and the judge concurrency cap (also removes
  a double int() call).
- utils.py: remove the unused finish_reason/stop_reason metadata scaffolding
  (_extract_ai_message_metadata, do_inference return_metadata,
  ChatVLLM.batch_with_metadata) whose only consumer is the deferred
  truncation-tracking work; fold batch_with_metadata back into batch and add
  docstrings to do_inference and batch.

* Address PR #58 review: restore judge engine kwargs, dedupe JSON helper

- Restore judge_extra_kwargs.update(args.engine_kwargs / args.judge_engine_kwargs)
  in estimate_elo_ratings so the Elo judge again honors --engine_kwargs /
  --judge_engine_kwargs (e.g. tensor_parallel_size); the removal was an
  unintended out-of-scope change.
- Drop the local _jsonable in favor of _to_jsonable from judgearena.repro
  (superset: also handles datetime/Path/NaN), removing the duplication.
- Use row["question_id"] instead of row.get("question_id", "") in
  _sample_fingerprint, matching the hard column access in
  select_seeded_random_arena_battles.

* Finalize #53 merge: drop deferred thinking-budget observability (Option A)

Align the thinking-model branch with the current base, which removed
do_inference's metadata path (truncation/metadata tracking is deferred):
- utils.py: drop the now-unread _thinking_budget_marker/_value attributes
  (budget enforcement via sampling params is unchanged).
- test_chat_vllm.py: remove the two budget-exhaustion observability tests
  that exercised the dropped batch_with_metadata path.
- test_mt_bench_preset_judging.py: unpack the 3-tuple judge_mt_bench_with_preset
  return (drop the removed always-0 num_inconsistent element).
kargibora added a commit that referenced this pull request Jun 16, 2026
* Add soft elo

* Add temperature calibration

* Update READMe for soft-elo support

* Update temperature

* Update CLI to unify elo computation

* Remove duplication

* Fix a edge case when all the labels are same

* ruff fix

* Make soft-elo default

* fix preference bug

* fix soft-elo dead flag bug

* fix: calibration error

* bug: fix the problem in the regex parser

* 2. Add prompt presets (#54)

* Add native baselines and judge controls

* Fix versioned m-Arena-Hard download test

Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split.

Includes-AI-Code: true

* remove backward compatibility code

* revert

* Add prompt presets and localized judge prompts

Move prompt preset resolution, MT-Bench preset judging, and localized m-Arena-Hard prompt assets into their own stack layer so inference metadata can build on top.

Includes-AI-Code: true

* skywork prompt preset

* Add clarifying comment about localized presets

* remove localized and skywork presets

* Merge pr 40 logic, add registry and task defaults

* Address PR #54 review on prompt presets

- Drop the always-zero num_inconsistent return from
  judge_mt_bench_with_preset; return a 3-tuple and unpack 3 in
  _run_mt_bench_preset (the FastChat path keeps its real value).
- Document that provide_explanation is a legacy alias that
  intentionally overrides the task default preset.

* 2.5. mt-bench changes (#55)

* Add native baselines and judge controls

* Fix versioned m-Arena-Hard download test

Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split.

Includes-AI-Code: true

* remove backward compatibility code

* revert

* Add prompt presets and localized judge prompts

Move prompt preset resolution, MT-Bench preset judging, and localized m-Arena-Hard prompt assets into their own stack layer so inference metadata can build on top.

Includes-AI-Code: true

* skywork prompt preset

* Add clarifying comment about localized presets

* remove localized and skywork presets

* Merge pr 40 logic, add registry and task defaults

* refactor and add tests

* add mt-bench reproducibility metadata and completion alignment

Record MT-Bench run metadata alongside results and fail early when cached or baseline completions are missing question rows, so dedicated MT-Bench runs are easier to reproduce and debug.

* Remove unnecessary cases added

* Address PR #54 review on prompt presets

- Drop the always-zero num_inconsistent return from
  judge_mt_bench_with_preset; return a 3-tuple and unpack 3 in
  _run_mt_bench_preset (the FastChat path keeps its real value).
- Document that provide_explanation is a legacy alias that
  intentionally overrides the task default preset.

* Address PR #55 review on MT-Bench preset judging
- Drop the always-constant swap_policy parameter and dead guard from
  the preset _append_results closure (and the now-unused
  _PRESET_SWAP_POLICY constant / MTBenchSwapPolicy import).
- Record the results "date" as UTC isoformat, consistent with the
  run-metadata started_at_utc timestamp.

* Drop MTBenchSwapPolicy in favor of inline comments

- The swap_policy parameter was always constant on both judging paths
(the only caller passed the matching module constant and the guard was
unreachable), and the enum was never serialized or used for dispatch.
- Remove it from _resolve_fastchat_item_result too (it was already removed
from the preset path), delete the unused enum, and document the two
swap_mode="both" combine strategies with inline comments instead:
conservative agreement for FastChat, append-inverted-score for presets.

* Fix MT-Bench preset test to unpack 3-tuple return

test_judge_mt_bench_with_preset_parses_and_inverts_swapped_scores still
unpacked 4 values and asserted num_inconsistent == 0, but the #54/#55
review dropped the always-zero num_inconsistent so judge_mt_bench_with_preset
now returns a 3-tuple. Unpack 3 values and drop the stale assertion.

* 3.5. Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf (#57)

* Add native baselines and judge controls

* Fix versioned m-Arena-Hard download test

Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split.

Includes-AI-Code: true

* remove backward compatibility code

* revert

* Add prompt presets and localized judge prompts

Move prompt preset resolution, MT-Bench preset judging, and localized m-Arena-Hard prompt assets into their own stack layer so inference metadata can build on top.

Includes-AI-Code: true

* skywork prompt preset

* Add clarifying comment about localized presets

* remove localized and skywork presets

* Merge pr 40 logic, add registry and task defaults

* refactor and add tests

* add mt-bench reproducibility metadata and completion alignment

Record MT-Bench run metadata alongside results and fail early when cached or baseline completions are missing question rows, so dedicated MT-Bench runs are easier to reproduce and debug.

* Remove unnecessary cases added

* Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf

- Make model setup and inference more robust: retry transient vLLM GPU-init races, strip vLLM-only engine kwargs for remote API providers, and propagate finish_reason/stop_reason through both the
async and batch inference paths.
- Centralize judge engine kwargs (incl. FP8 KV default) in build_default_judge_model_kwargs, accept an explicit parser_mode in judge_and_parse_prefs, restore the deprecated --model alias for --model_A, and tidy generation orchestration.
- Point dataset downloads at the judge-arena/judge-arena-dataset repo and drive download_all from
M_ARENA_HARD_BASELINES.

* Address PR #54 review on prompt presets

- Drop the always-zero num_inconsistent return from
  judge_mt_bench_with_preset; return a 3-tuple and unpack 3 in
  _run_mt_bench_preset (the FastChat path keeps its real value).
- Document that provide_explanation is a legacy alias that
  intentionally overrides the task default preset.

* Address PR #55 review on MT-Bench preset judging
- Drop the always-constant swap_policy parameter and dead guard from
  the preset _append_results closure (and the now-unused
  _PRESET_SWAP_POLICY constant / MTBenchSwapPolicy import).
- Record the results "date" as UTC isoformat, consistent with the
  run-metadata started_at_utc timestamp.

* Drop MTBenchSwapPolicy in favor of inline comments

- The swap_policy parameter was always constant on both judging paths
(the only caller passed the matching module constant and the guard was
unreachable), and the enum was never serialized or used for dispatch.
- Remove it from _resolve_fastchat_item_result too (it was already removed
from the preset path), delete the unused enum, and document the two
swap_mode="both" combine strategies with inline comments instead:
conservative agreement for FastChat, append-inverted-score for presets.

* Address PR #57 review: simplify CLI/judge kwargs, safe env parsing, drop unused metadata plumbing

- cli.py: remove the re-introduced --model alias and _resolve_model_a;
  callers use --model_A directly.
- generate_and_evaluate.py: inline build_default_judge_model_kwargs at its
  single call site and delete the _build_judge_engine_kwargs wrapper.
- utils.py: add safe_parse_int(env_var) so malformed JUDGEARENA_* integers
  fall back to defaults (with a warning) instead of crashing at import; use
  it for the vLLM init constants and the judge concurrency cap (also removes
  a double int() call).
- utils.py: remove the unused finish_reason/stop_reason metadata scaffolding
  (_extract_ai_message_metadata, do_inference return_metadata,
  ChatVLLM.batch_with_metadata) whose only consumer is the deferred
  truncation-tracking work; fold batch_with_metadata back into batch and add
  docstrings to do_inference and batch.

* 4. Thinking model support (#53)

* Add native baselines and judge controls

* Fix versioned m-Arena-Hard download test

Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split.

Includes-AI-Code: true

* remove backward compatibility code

* revert

* Add prompt presets and localized judge prompts

Move prompt preset resolution, MT-Bench preset judging, and localized m-Arena-Hard prompt assets into their own stack layer so inference metadata can build on top.

Includes-AI-Code: true

* skywork prompt preset

* Add clarifying comment about localized presets

* remove localized and skywork presets

* Merge pr 40 logic, add registry and task defaults

* refactor and add tests

* add mt-bench reproducibility metadata and completion alignment

Record MT-Bench run metadata alongside results and fail early when cached or baseline completions are missing question rows, so dedicated MT-Bench runs are easier to reproduce and debug.

* Remove unnecessary cases added

* Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf

- Make model setup and inference more robust: retry transient vLLM GPU-init races, strip vLLM-only engine kwargs for remote API providers, and propagate finish_reason/stop_reason through both the
async and batch inference paths.
- Centralize judge engine kwargs (incl. FP8 KV default) in build_default_judge_model_kwargs, accept an explicit parser_mode in judge_and_parse_prefs, restore the deprecated --model alias for --model_A, and tidy generation orchestration.
- Point dataset downloads at the judge-arena/judge-arena-dataset repo and drive download_all from
M_ARENA_HARD_BASELINES.

* Thinking model support (linearized net diff)

* Address PR #54 review on prompt presets

- Drop the always-zero num_inconsistent return from
  judge_mt_bench_with_preset; return a 3-tuple and unpack 3 in
  _run_mt_bench_preset (the FastChat path keeps its real value).
- Document that provide_explanation is a legacy alias that
  intentionally overrides the task default preset.

* Address PR #55 review on MT-Bench preset judging
- Drop the always-constant swap_policy parameter and dead guard from
  the preset _append_results closure (and the now-unused
  _PRESET_SWAP_POLICY constant / MTBenchSwapPolicy import).
- Record the results "date" as UTC isoformat, consistent with the
  run-metadata started_at_utc timestamp.

* Drop MTBenchSwapPolicy in favor of inline comments

- The swap_policy parameter was always constant on both judging paths
(the only caller passed the matching module constant and the guard was
unreachable), and the enum was never serialized or used for dispatch.
- Remove it from _resolve_fastchat_item_result too (it was already removed
from the preset path), delete the unused enum, and document the two
swap_mode="both" combine strategies with inline comments instead:
conservative agreement for FastChat, append-inverted-score for presets.

* Address PR #57 review: simplify CLI/judge kwargs, safe env parsing, drop unused metadata plumbing

- cli.py: remove the re-introduced --model alias and _resolve_model_a;
  callers use --model_A directly.
- generate_and_evaluate.py: inline build_default_judge_model_kwargs at its
  single call site and delete the _build_judge_engine_kwargs wrapper.
- utils.py: add safe_parse_int(env_var) so malformed JUDGEARENA_* integers
  fall back to defaults (with a warning) instead of crashing at import; use
  it for the vLLM init constants and the judge concurrency cap (also removes
  a double int() call).
- utils.py: remove the unused finish_reason/stop_reason metadata scaffolding
  (_extract_ai_message_metadata, do_inference return_metadata,
  ChatVLLM.batch_with_metadata) whose only consumer is the deferred
  truncation-tracking work; fold batch_with_metadata back into batch and add
  docstrings to do_inference and batch.

* Finalize #53 merge: drop deferred thinking-budget observability (Option A)

Align the thinking-model branch with the current base, which removed
do_inference's metadata path (truncation/metadata tracking is deferred):
- utils.py: drop the now-unread _thinking_budget_marker/_value attributes
  (budget enforcement via sampling params is unchanged).
- test_chat_vllm.py: remove the two budget-exhaustion observability tests
  that exercised the dropped batch_with_metadata path.
- test_mt_bench_preset_judging.py: unpack the 3-tuple judge_mt_bench_with_preset
  return (drop the removed always-0 num_inconsistent element).

* 5.5. random 1k battles elo sampling (#58)

* Add native baselines and judge controls

* Fix versioned m-Arena-Hard download test

Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split.

Includes-AI-Code: true

* remove backward compatibility code

* revert

* Add prompt presets and localized judge prompts

Move prompt preset resolution, MT-Bench preset judging, and localized m-Arena-Hard prompt assets into their own stack layer so inference metadata can build on top.

Includes-AI-Code: true

* skywork prompt preset

* Add clarifying comment about localized presets

* remove localized and skywork presets

* Merge pr 40 logic, add registry and task defaults

* refactor and add tests

* add mt-bench reproducibility metadata and completion alignment

Record MT-Bench run metadata alongside results and fail early when cached or baseline completions are missing question rows, so dedicated MT-Bench runs are easier to reproduce and debug.

* Remove unnecessary cases added

* Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf

- Make model setup and inference more robust: retry transient vLLM GPU-init races, strip vLLM-only engine kwargs for remote API providers, and propagate finish_reason/stop_reason through both the
async and batch inference paths.
- Centralize judge engine kwargs (incl. FP8 KV default) in build_default_judge_model_kwargs, accept an explicit parser_mode in judge_and_parse_prefs, restore the deprecated --model alias for --model_A, and tidy generation orchestration.
- Point dataset downloads at the judge-arena/judge-arena-dataset repo and drive download_all from
M_ARENA_HARD_BASELINES.

* Thinking model support (linearized net diff)

* add elo random 1k sampling

* Address PR #54 review on prompt presets

- Drop the always-zero num_inconsistent return from
  judge_mt_bench_with_preset; return a 3-tuple and unpack 3 in
  _run_mt_bench_preset (the FastChat path keeps its real value).
- Document that provide_explanation is a legacy alias that
  intentionally overrides the task default preset.

* Address PR #55 review on MT-Bench preset judging
- Drop the always-constant swap_policy parameter and dead guard from
  the preset _append_results closure (and the now-unused
  _PRESET_SWAP_POLICY constant / MTBenchSwapPolicy import).
- Record the results "date" as UTC isoformat, consistent with the
  run-metadata started_at_utc timestamp.

* Drop MTBenchSwapPolicy in favor of inline comments

- The swap_policy parameter was always constant on both judging paths
(the only caller passed the matching module constant and the guard was
unreachable), and the enum was never serialized or used for dispatch.
- Remove it from _resolve_fastchat_item_result too (it was already removed
from the preset path), delete the unused enum, and document the two
swap_mode="both" combine strategies with inline comments instead:
conservative agreement for FastChat, append-inverted-score for presets.

* Address PR #57 review: simplify CLI/judge kwargs, safe env parsing, drop unused metadata plumbing

- cli.py: remove the re-introduced --model alias and _resolve_model_a;
  callers use --model_A directly.
- generate_and_evaluate.py: inline build_default_judge_model_kwargs at its
  single call site and delete the _build_judge_engine_kwargs wrapper.
- utils.py: add safe_parse_int(env_var) so malformed JUDGEARENA_* integers
  fall back to defaults (with a warning) instead of crashing at import; use
  it for the vLLM init constants and the judge concurrency cap (also removes
  a double int() call).
- utils.py: remove the unused finish_reason/stop_reason metadata scaffolding
  (_extract_ai_message_metadata, do_inference return_metadata,
  ChatVLLM.batch_with_metadata) whose only consumer is the deferred
  truncation-tracking work; fold batch_with_metadata back into batch and add
  docstrings to do_inference and batch.

* Address PR #58 review: restore judge engine kwargs, dedupe JSON helper

- Restore judge_extra_kwargs.update(args.engine_kwargs / args.judge_engine_kwargs)
  in estimate_elo_ratings so the Elo judge again honors --engine_kwargs /
  --judge_engine_kwargs (e.g. tensor_parallel_size); the removal was an
  unintended out-of-scope change.
- Drop the local _jsonable in favor of _to_jsonable from judgearena.repro
  (superset: also handles datetime/Path/NaN), removing the duplication.
- Use row["question_id"] instead of row.get("question_id", "") in
  _sample_fingerprint, matching the hard column access in
  select_seeded_random_arena_battles.

* Finalize #53 merge: drop deferred thinking-budget observability (Option A)

Align the thinking-model branch with the current base, which removed
do_inference's metadata path (truncation/metadata tracking is deferred):
- utils.py: drop the now-unread _thinking_budget_marker/_value attributes
  (budget enforcement via sampling params is unchanged).
- test_chat_vllm.py: remove the two budget-exhaustion observability tests
  that exercised the dropped batch_with_metadata path.
- test_mt_bench_preset_judging.py: unpack the 3-tuple judge_mt_bench_with_preset
  return (drop the removed always-0 num_inconsistent element).

---------

Co-authored-by: Erlis Lushtaku <59629249+ErlisLushtaku@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants