2.5. mt-bench changes by ErlisLushtaku · Pull Request #55 · OpenEuroLLM/JudgeArena

ErlisLushtaku · 2026-05-28T08:58:52Z

Summary

This PR pulls the MT-Bench-specific follow-up work out of the prompt preset PR and keeps it as a focused stacked cleanup.

It now does two things:

refactors the non-delegated MT-Bench judging path to reuse shared pairwise helpers instead of duplicating batching, prompt grouping, answer swapping, and item construction
adds MT-Bench-specific reproducibility/orchestration follow-ups that were originally mixed into PR 52, without bringing over the deferred truncation tracking or cache-key changes

Concretely, this PR:

extracts shared MT-Bench pairwise judging helpers into judgearena/mt_bench/pairwise_judging.py
keeps FastChat-specific verdict parsing and preset-specific score parsing separate, while reusing the same lower-level MT-Bench mechanics
replaces loose shared item dicts with a typed MTBenchJudgeItem
makes the two swap_mode="both" semantics explicit in code:
- FastChat: conservative agreement
- preset judging: append inverted swapped scores
factors shared MT-Bench result finalization in mt_bench_utils.py
aligns pre-generated or cached MT-Bench completions to the requested question order and raises a clear error if rows are missing
writes MT-Bench run metadata, including input payload and prompt metadata, alongside saved artifacts
lets run_mt_bench derive its default result folder and artifact name when callers do not pass them explicitly
adds focused MT-Bench tests for both the FastChat and preset judging paths, plus coverage for completion alignment and metadata writing

Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split. Includes-AI-Code: true

Move prompt preset resolution, MT-Bench preset judging, and localized m-Arena-Hard prompt assets into their own stack layer so inference metadata can build on top. Includes-AI-Code: true

…pt-presets-localized

Record MT-Bench run metadata alongside results and fail early when cached or baseline completions are missing question rows, so dedicated MT-Bench runs are easier to reproduce and debug.

- Drop the always-zero num_inconsistent return from judge_mt_bench_with_preset; return a 3-tuple and unpack 3 in _run_mt_bench_preset (the FastChat path keeps its real value). - Document that provide_explanation is a legacy alias that intentionally overrides the task default preset.

- Drop the always-constant swap_policy parameter and dead guard from the preset _append_results closure (and the now-unused _PRESET_SWAP_POLICY constant / MTBenchSwapPolicy import). - Record the results "date" as UTC isoformat, consistent with the run-metadata started_at_utc timestamp.

- The swap_policy parameter was always constant on both judging paths (the only caller passed the matching module constant and the guard was unreachable), and the enum was never serialized or used for dispatch. - Remove it from _resolve_fastchat_item_result too (it was already removed from the preset path), delete the unused enum, and document the two swap_mode="both" combine strategies with inline comments instead: conservative agreement for FastChat, append-inverted-score for presets.

Resolve conflicts in evaluate.py and generate_and_evaluate.py by keeping both sides: - evaluate.py: keep preset-driven PairScore(parser_mode=...) and adopt main's _parse_and_warn parse-failure logging helper. - generate_and_evaluate.py: keep resolved_prompt.metadata() alongside main's new "swap_mode"/"result_folder" result fields.

…judging Pulls main (via the updated 02 base) plus the PR #54 review fixes into the MT-Bench branch. Resolve judge_mt_bench_with_preset signature conflict by adopting the 3-tuple return (drops the unused always-0 int) while keeping this branch's resolve_mt_bench_turn_flags refactor; the inline assert turns_mode is redundant since resolve_mt_bench_turn_flags validates it.

Propagate main + PR #54/#55 changes up the stack. Resolve conflicts: - evaluate.py: keep this branch's parser_mode parameter while adopting main's _parse_and_warn parse-failure logging helper. - tests/test_utils.py: keep both added imports (pytest and the mt_bench module used by the new tests).

Integrate main + PR #54/#55/#57 changes under the thinking-model branch. Resolve conflicts: - preset_judging.py / mt_bench_utils.py: adopt the 3-tuple judge_mt_bench_with_preset return while keeping strip_thinking_before_judging. - generate_and_evaluate.py: keep _build_generation_engine_kwargs (battle thinking-token sub-budget) and drop _build_judge_engine_kwargs in favour of 03.5's inlined build_default_judge_model_kwargs; keep both sides' new result fields (thinking + swap_mode/result_folder). - utils.py: adopt 03.5's refactored do_inference (no per-message metadata extraction). Keep strip_thinking_tags* and safe_parse_int; drop the now-dead _extract_ai_message_metadata and batch_with_metadata. thinking_token_budget enforcement is preserved; budget-exhaustion observability is dropped.

#54 was squash-merged into main as a single commit (8d9e947), so the original 02 prompt-preset commits in this branch's history collided with main. main's tree is byte-identical to the 02 branch tip (301cba8), which this branch already merged, so every conflict was resolved by keeping this branch's version (a strict superset: 02 changes + the MT-Bench work). Resulting tree is unchanged; the PR diff now shows only the MT-Bench changes.

test_judge_mt_bench_with_preset_parses_and_inverts_swapped_scores still unpacked 4 values and asserted num_inconsistent == 0, but the #54/#55 review dropped the always-zero num_inconsistent so judge_mt_bench_with_preset now returns a 3-tuple. Unpack 3 values and drop the stale assertion.

#55 was squash-merged into main (57548e9), so main now carries the full 02 + 02.5 changes that already live in this branch's history, plus the 3-tuple test fix from the #55 review. main's tree is identical to the final 02.5 tip, so the overlapping evaluate.py / generate_and_evaluate.py conflicts were resolved by keeping this branch's version (02.5 content + the 03.5 runtime/CLI work). test_mt_bench_preset_judging.py took main's fixed 3-tuple unpack. Net change vs the prior tip is only that one-line test fix.

…cleanup, point to judgearena hf (#57) * Add native baselines and judge controls * Fix versioned m-Arena-Hard download test Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split. Includes-AI-Code: true * remove backward compatibility code * revert * Add prompt presets and localized judge prompts Move prompt preset resolution, MT-Bench preset judging, and localized m-Arena-Hard prompt assets into their own stack layer so inference metadata can build on top. Includes-AI-Code: true * skywork prompt preset * Add clarifying comment about localized presets * remove localized and skywork presets * Merge pr 40 logic, add registry and task defaults * refactor and add tests * add mt-bench reproducibility metadata and completion alignment Record MT-Bench run metadata alongside results and fail early when cached or baseline completions are missing question rows, so dedicated MT-Bench runs are easier to reproduce and debug. * Remove unnecessary cases added * Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf - Make model setup and inference more robust: retry transient vLLM GPU-init races, strip vLLM-only engine kwargs for remote API providers, and propagate finish_reason/stop_reason through both the async and batch inference paths. - Centralize judge engine kwargs (incl. FP8 KV default) in build_default_judge_model_kwargs, accept an explicit parser_mode in judge_and_parse_prefs, restore the deprecated --model alias for --model_A, and tidy generation orchestration. - Point dataset downloads at the judge-arena/judge-arena-dataset repo and drive download_all from M_ARENA_HARD_BASELINES. * Address PR #54 review on prompt presets - Drop the always-zero num_inconsistent return from judge_mt_bench_with_preset; return a 3-tuple and unpack 3 in _run_mt_bench_preset (the FastChat path keeps its real value). - Document that provide_explanation is a legacy alias that intentionally overrides the task default preset. * Address PR #55 review on MT-Bench preset judging - Drop the always-constant swap_policy parameter and dead guard from the preset _append_results closure (and the now-unused _PRESET_SWAP_POLICY constant / MTBenchSwapPolicy import). - Record the results "date" as UTC isoformat, consistent with the run-metadata started_at_utc timestamp. * Drop MTBenchSwapPolicy in favor of inline comments - The swap_policy parameter was always constant on both judging paths (the only caller passed the matching module constant and the guard was unreachable), and the enum was never serialized or used for dispatch. - Remove it from _resolve_fastchat_item_result too (it was already removed from the preset path), delete the unused enum, and document the two swap_mode="both" combine strategies with inline comments instead: conservative agreement for FastChat, append-inverted-score for presets. * Address PR #57 review: simplify CLI/judge kwargs, safe env parsing, drop unused metadata plumbing - cli.py: remove the re-introduced --model alias and _resolve_model_a; callers use --model_A directly. - generate_and_evaluate.py: inline build_default_judge_model_kwargs at its single call site and delete the _build_judge_engine_kwargs wrapper. - utils.py: add safe_parse_int(env_var) so malformed JUDGEARENA_* integers fall back to defaults (with a warning) instead of crashing at import; use it for the vLLM init constants and the judge concurrency cap (also removes a double int() call). - utils.py: remove the unused finish_reason/stop_reason metadata scaffolding (_extract_ai_message_metadata, do_inference return_metadata, ChatVLLM.batch_with_metadata) whose only consumer is the deferred truncation-tracking work; fold batch_with_metadata back into batch and add docstrings to do_inference and batch.

* Add native baselines and judge controls * Fix versioned m-Arena-Hard download test Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split. Includes-AI-Code: true * remove backward compatibility code * revert * Add prompt presets and localized judge prompts Move prompt preset resolution, MT-Bench preset judging, and localized m-Arena-Hard prompt assets into their own stack layer so inference metadata can build on top. Includes-AI-Code: true * skywork prompt preset * Add clarifying comment about localized presets * remove localized and skywork presets * Merge pr 40 logic, add registry and task defaults * refactor and add tests * add mt-bench reproducibility metadata and completion alignment Record MT-Bench run metadata alongside results and fail early when cached or baseline completions are missing question rows, so dedicated MT-Bench runs are easier to reproduce and debug. * Remove unnecessary cases added * Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf - Make model setup and inference more robust: retry transient vLLM GPU-init races, strip vLLM-only engine kwargs for remote API providers, and propagate finish_reason/stop_reason through both the async and batch inference paths. - Centralize judge engine kwargs (incl. FP8 KV default) in build_default_judge_model_kwargs, accept an explicit parser_mode in judge_and_parse_prefs, restore the deprecated --model alias for --model_A, and tidy generation orchestration. - Point dataset downloads at the judge-arena/judge-arena-dataset repo and drive download_all from M_ARENA_HARD_BASELINES. * Thinking model support (linearized net diff) * Address PR #54 review on prompt presets - Drop the always-zero num_inconsistent return from judge_mt_bench_with_preset; return a 3-tuple and unpack 3 in _run_mt_bench_preset (the FastChat path keeps its real value). - Document that provide_explanation is a legacy alias that intentionally overrides the task default preset. * Address PR #55 review on MT-Bench preset judging - Drop the always-constant swap_policy parameter and dead guard from the preset _append_results closure (and the now-unused _PRESET_SWAP_POLICY constant / MTBenchSwapPolicy import). - Record the results "date" as UTC isoformat, consistent with the run-metadata started_at_utc timestamp. * Drop MTBenchSwapPolicy in favor of inline comments - The swap_policy parameter was always constant on both judging paths (the only caller passed the matching module constant and the guard was unreachable), and the enum was never serialized or used for dispatch. - Remove it from _resolve_fastchat_item_result too (it was already removed from the preset path), delete the unused enum, and document the two swap_mode="both" combine strategies with inline comments instead: conservative agreement for FastChat, append-inverted-score for presets. * Address PR #57 review: simplify CLI/judge kwargs, safe env parsing, drop unused metadata plumbing - cli.py: remove the re-introduced --model alias and _resolve_model_a; callers use --model_A directly. - generate_and_evaluate.py: inline build_default_judge_model_kwargs at its single call site and delete the _build_judge_engine_kwargs wrapper. - utils.py: add safe_parse_int(env_var) so malformed JUDGEARENA_* integers fall back to defaults (with a warning) instead of crashing at import; use it for the vLLM init constants and the judge concurrency cap (also removes a double int() call). - utils.py: remove the unused finish_reason/stop_reason metadata scaffolding (_extract_ai_message_metadata, do_inference return_metadata, ChatVLLM.batch_with_metadata) whose only consumer is the deferred truncation-tracking work; fold batch_with_metadata back into batch and add docstrings to do_inference and batch. * Finalize #53 merge: drop deferred thinking-budget observability (Option A) Align the thinking-model branch with the current base, which removed do_inference's metadata path (truncation/metadata tracking is deferred): - utils.py: drop the now-unread _thinking_budget_marker/_value attributes (budget enforcement via sampling params is unchanged). - test_chat_vllm.py: remove the two budget-exhaustion observability tests that exercised the dropped batch_with_metadata path. - test_mt_bench_preset_judging.py: unpack the 3-tuple judge_mt_bench_with_preset return (drop the removed always-0 num_inconsistent element).

* Add native baselines and judge controls * Fix versioned m-Arena-Hard download test Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split. Includes-AI-Code: true * remove backward compatibility code * revert * Add prompt presets and localized judge prompts Move prompt preset resolution, MT-Bench preset judging, and localized m-Arena-Hard prompt assets into their own stack layer so inference metadata can build on top. Includes-AI-Code: true * skywork prompt preset * Add clarifying comment about localized presets * remove localized and skywork presets * Merge pr 40 logic, add registry and task defaults * refactor and add tests * add mt-bench reproducibility metadata and completion alignment Record MT-Bench run metadata alongside results and fail early when cached or baseline completions are missing question rows, so dedicated MT-Bench runs are easier to reproduce and debug. * Remove unnecessary cases added * Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf - Make model setup and inference more robust: retry transient vLLM GPU-init races, strip vLLM-only engine kwargs for remote API providers, and propagate finish_reason/stop_reason through both the async and batch inference paths. - Centralize judge engine kwargs (incl. FP8 KV default) in build_default_judge_model_kwargs, accept an explicit parser_mode in judge_and_parse_prefs, restore the deprecated --model alias for --model_A, and tidy generation orchestration. - Point dataset downloads at the judge-arena/judge-arena-dataset repo and drive download_all from M_ARENA_HARD_BASELINES. * Thinking model support (linearized net diff) * add elo random 1k sampling * Address PR #54 review on prompt presets - Drop the always-zero num_inconsistent return from judge_mt_bench_with_preset; return a 3-tuple and unpack 3 in _run_mt_bench_preset (the FastChat path keeps its real value). - Document that provide_explanation is a legacy alias that intentionally overrides the task default preset. * Address PR #55 review on MT-Bench preset judging - Drop the always-constant swap_policy parameter and dead guard from the preset _append_results closure (and the now-unused _PRESET_SWAP_POLICY constant / MTBenchSwapPolicy import). - Record the results "date" as UTC isoformat, consistent with the run-metadata started_at_utc timestamp. * Drop MTBenchSwapPolicy in favor of inline comments - The swap_policy parameter was always constant on both judging paths (the only caller passed the matching module constant and the guard was unreachable), and the enum was never serialized or used for dispatch. - Remove it from _resolve_fastchat_item_result too (it was already removed from the preset path), delete the unused enum, and document the two swap_mode="both" combine strategies with inline comments instead: conservative agreement for FastChat, append-inverted-score for presets. * Address PR #57 review: simplify CLI/judge kwargs, safe env parsing, drop unused metadata plumbing - cli.py: remove the re-introduced --model alias and _resolve_model_a; callers use --model_A directly. - generate_and_evaluate.py: inline build_default_judge_model_kwargs at its single call site and delete the _build_judge_engine_kwargs wrapper. - utils.py: add safe_parse_int(env_var) so malformed JUDGEARENA_* integers fall back to defaults (with a warning) instead of crashing at import; use it for the vLLM init constants and the judge concurrency cap (also removes a double int() call). - utils.py: remove the unused finish_reason/stop_reason metadata scaffolding (_extract_ai_message_metadata, do_inference return_metadata, ChatVLLM.batch_with_metadata) whose only consumer is the deferred truncation-tracking work; fold batch_with_metadata back into batch and add docstrings to do_inference and batch. * Address PR #58 review: restore judge engine kwargs, dedupe JSON helper - Restore judge_extra_kwargs.update(args.engine_kwargs / args.judge_engine_kwargs) in estimate_elo_ratings so the Elo judge again honors --engine_kwargs / --judge_engine_kwargs (e.g. tensor_parallel_size); the removal was an unintended out-of-scope change. - Drop the local _jsonable in favor of _to_jsonable from judgearena.repro (superset: also handles datetime/Path/NaN), removing the duplication. - Use row["question_id"] instead of row.get("question_id", "") in _sample_fingerprint, matching the hard column access in select_seeded_random_arena_battles. * Finalize #53 merge: drop deferred thinking-budget observability (Option A) Align the thinking-model branch with the current base, which removed do_inference's metadata path (truncation/metadata tracking is deferred): - utils.py: drop the now-unread _thinking_budget_marker/_value attributes (budget enforcement via sampling params is unchanged). - test_chat_vllm.py: remove the two budget-exhaustion observability tests that exercised the dropped batch_with_metadata path. - test_mt_bench_preset_judging.py: unpack the 3-tuple judge_mt_bench_with_preset return (drop the removed always-0 num_inconsistent element).

* Add soft elo * Add temperature calibration * Update READMe for soft-elo support * Update temperature * Update CLI to unify elo computation * Remove duplication * Fix a edge case when all the labels are same * ruff fix * Make soft-elo default * fix preference bug * fix soft-elo dead flag bug * fix: calibration error * bug: fix the problem in the regex parser * 2. Add prompt presets (#54) * Add native baselines and judge controls * Fix versioned m-Arena-Hard download test Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split. Includes-AI-Code: true * remove backward compatibility code * revert * Add prompt presets and localized judge prompts Move prompt preset resolution, MT-Bench preset judging, and localized m-Arena-Hard prompt assets into their own stack layer so inference metadata can build on top. Includes-AI-Code: true * skywork prompt preset * Add clarifying comment about localized presets * remove localized and skywork presets * Merge pr 40 logic, add registry and task defaults * Address PR #54 review on prompt presets - Drop the always-zero num_inconsistent return from judge_mt_bench_with_preset; return a 3-tuple and unpack 3 in _run_mt_bench_preset (the FastChat path keeps its real value). - Document that provide_explanation is a legacy alias that intentionally overrides the task default preset. * 2.5. mt-bench changes (#55) * Add native baselines and judge controls * Fix versioned m-Arena-Hard download test Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split. Includes-AI-Code: true * remove backward compatibility code * revert * Add prompt presets and localized judge prompts Move prompt preset resolution, MT-Bench preset judging, and localized m-Arena-Hard prompt assets into their own stack layer so inference metadata can build on top. Includes-AI-Code: true * skywork prompt preset * Add clarifying comment about localized presets * remove localized and skywork presets * Merge pr 40 logic, add registry and task defaults * refactor and add tests * add mt-bench reproducibility metadata and completion alignment Record MT-Bench run metadata alongside results and fail early when cached or baseline completions are missing question rows, so dedicated MT-Bench runs are easier to reproduce and debug. * Remove unnecessary cases added * Address PR #54 review on prompt presets - Drop the always-zero num_inconsistent return from judge_mt_bench_with_preset; return a 3-tuple and unpack 3 in _run_mt_bench_preset (the FastChat path keeps its real value). - Document that provide_explanation is a legacy alias that intentionally overrides the task default preset. * Address PR #55 review on MT-Bench preset judging - Drop the always-constant swap_policy parameter and dead guard from the preset _append_results closure (and the now-unused _PRESET_SWAP_POLICY constant / MTBenchSwapPolicy import). - Record the results "date" as UTC isoformat, consistent with the run-metadata started_at_utc timestamp. * Drop MTBenchSwapPolicy in favor of inline comments - The swap_policy parameter was always constant on both judging paths (the only caller passed the matching module constant and the guard was unreachable), and the enum was never serialized or used for dispatch. - Remove it from _resolve_fastchat_item_result too (it was already removed from the preset path), delete the unused enum, and document the two swap_mode="both" combine strategies with inline comments instead: conservative agreement for FastChat, append-inverted-score for presets. * Fix MT-Bench preset test to unpack 3-tuple return test_judge_mt_bench_with_preset_parses_and_inverts_swapped_scores still unpacked 4 values and asserted num_inconsistent == 0, but the #54/#55 review dropped the always-zero num_inconsistent so judge_mt_bench_with_preset now returns a 3-tuple. Unpack 3 values and drop the stale assertion. * 3.5. Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf (#57) * Add native baselines and judge controls * Fix versioned m-Arena-Hard download test Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split. Includes-AI-Code: true * remove backward compatibility code * revert * Add prompt presets and localized judge prompts Move prompt preset resolution, MT-Bench preset judging, and localized m-Arena-Hard prompt assets into their own stack layer so inference metadata can build on top. Includes-AI-Code: true * skywork prompt preset * Add clarifying comment about localized presets * remove localized and skywork presets * Merge pr 40 logic, add registry and task defaults * refactor and add tests * add mt-bench reproducibility metadata and completion alignment Record MT-Bench run metadata alongside results and fail early when cached or baseline completions are missing question rows, so dedicated MT-Bench runs are easier to reproduce and debug. * Remove unnecessary cases added * Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf - Make model setup and inference more robust: retry transient vLLM GPU-init races, strip vLLM-only engine kwargs for remote API providers, and propagate finish_reason/stop_reason through both the async and batch inference paths. - Centralize judge engine kwargs (incl. FP8 KV default) in build_default_judge_model_kwargs, accept an explicit parser_mode in judge_and_parse_prefs, restore the deprecated --model alias for --model_A, and tidy generation orchestration. - Point dataset downloads at the judge-arena/judge-arena-dataset repo and drive download_all from M_ARENA_HARD_BASELINES. * Address PR #54 review on prompt presets - Drop the always-zero num_inconsistent return from judge_mt_bench_with_preset; return a 3-tuple and unpack 3 in _run_mt_bench_preset (the FastChat path keeps its real value). - Document that provide_explanation is a legacy alias that intentionally overrides the task default preset. * Address PR #55 review on MT-Bench preset judging - Drop the always-constant swap_policy parameter and dead guard from the preset _append_results closure (and the now-unused _PRESET_SWAP_POLICY constant / MTBenchSwapPolicy import). - Record the results "date" as UTC isoformat, consistent with the run-metadata started_at_utc timestamp. * Drop MTBenchSwapPolicy in favor of inline comments - The swap_policy parameter was always constant on both judging paths (the only caller passed the matching module constant and the guard was unreachable), and the enum was never serialized or used for dispatch. - Remove it from _resolve_fastchat_item_result too (it was already removed from the preset path), delete the unused enum, and document the two swap_mode="both" combine strategies with inline comments instead: conservative agreement for FastChat, append-inverted-score for presets. * Address PR #57 review: simplify CLI/judge kwargs, safe env parsing, drop unused metadata plumbing - cli.py: remove the re-introduced --model alias and _resolve_model_a; callers use --model_A directly. - generate_and_evaluate.py: inline build_default_judge_model_kwargs at its single call site and delete the _build_judge_engine_kwargs wrapper. - utils.py: add safe_parse_int(env_var) so malformed JUDGEARENA_* integers fall back to defaults (with a warning) instead of crashing at import; use it for the vLLM init constants and the judge concurrency cap (also removes a double int() call). - utils.py: remove the unused finish_reason/stop_reason metadata scaffolding (_extract_ai_message_metadata, do_inference return_metadata, ChatVLLM.batch_with_metadata) whose only consumer is the deferred truncation-tracking work; fold batch_with_metadata back into batch and add docstrings to do_inference and batch. * 4. Thinking model support (#53) * Add native baselines and judge controls * Fix versioned m-Arena-Hard download test Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split. Includes-AI-Code: true * remove backward compatibility code * revert * Add prompt presets and localized judge prompts Move prompt preset resolution, MT-Bench preset judging, and localized m-Arena-Hard prompt assets into their own stack layer so inference metadata can build on top. Includes-AI-Code: true * skywork prompt preset * Add clarifying comment about localized presets * remove localized and skywork presets * Merge pr 40 logic, add registry and task defaults * refactor and add tests * add mt-bench reproducibility metadata and completion alignment Record MT-Bench run metadata alongside results and fail early when cached or baseline completions are missing question rows, so dedicated MT-Bench runs are easier to reproduce and debug. * Remove unnecessary cases added * Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf - Make model setup and inference more robust: retry transient vLLM GPU-init races, strip vLLM-only engine kwargs for remote API providers, and propagate finish_reason/stop_reason through both the async and batch inference paths. - Centralize judge engine kwargs (incl. FP8 KV default) in build_default_judge_model_kwargs, accept an explicit parser_mode in judge_and_parse_prefs, restore the deprecated --model alias for --model_A, and tidy generation orchestration. - Point dataset downloads at the judge-arena/judge-arena-dataset repo and drive download_all from M_ARENA_HARD_BASELINES. * Thinking model support (linearized net diff) * Address PR #54 review on prompt presets - Drop the always-zero num_inconsistent return from judge_mt_bench_with_preset; return a 3-tuple and unpack 3 in _run_mt_bench_preset (the FastChat path keeps its real value). - Document that provide_explanation is a legacy alias that intentionally overrides the task default preset. * Address PR #55 review on MT-Bench preset judging - Drop the always-constant swap_policy parameter and dead guard from the preset _append_results closure (and the now-unused _PRESET_SWAP_POLICY constant / MTBenchSwapPolicy import). - Record the results "date" as UTC isoformat, consistent with the run-metadata started_at_utc timestamp. * Drop MTBenchSwapPolicy in favor of inline comments - The swap_policy parameter was always constant on both judging paths (the only caller passed the matching module constant and the guard was unreachable), and the enum was never serialized or used for dispatch. - Remove it from _resolve_fastchat_item_result too (it was already removed from the preset path), delete the unused enum, and document the two swap_mode="both" combine strategies with inline comments instead: conservative agreement for FastChat, append-inverted-score for presets. * Address PR #57 review: simplify CLI/judge kwargs, safe env parsing, drop unused metadata plumbing - cli.py: remove the re-introduced --model alias and _resolve_model_a; callers use --model_A directly. - generate_and_evaluate.py: inline build_default_judge_model_kwargs at its single call site and delete the _build_judge_engine_kwargs wrapper. - utils.py: add safe_parse_int(env_var) so malformed JUDGEARENA_* integers fall back to defaults (with a warning) instead of crashing at import; use it for the vLLM init constants and the judge concurrency cap (also removes a double int() call). - utils.py: remove the unused finish_reason/stop_reason metadata scaffolding (_extract_ai_message_metadata, do_inference return_metadata, ChatVLLM.batch_with_metadata) whose only consumer is the deferred truncation-tracking work; fold batch_with_metadata back into batch and add docstrings to do_inference and batch. * Finalize #53 merge: drop deferred thinking-budget observability (Option A) Align the thinking-model branch with the current base, which removed do_inference's metadata path (truncation/metadata tracking is deferred): - utils.py: drop the now-unread _thinking_budget_marker/_value attributes (budget enforcement via sampling params is unchanged). - test_chat_vllm.py: remove the two budget-exhaustion observability tests that exercised the dropped batch_with_metadata path. - test_mt_bench_preset_judging.py: unpack the 3-tuple judge_mt_bench_with_preset return (drop the removed always-0 num_inconsistent element). * 5.5. random 1k battles elo sampling (#58) * Add native baselines and judge controls * Fix versioned m-Arena-Hard download test Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split. Includes-AI-Code: true * remove backward compatibility code * revert * Add prompt presets and localized judge prompts Move prompt preset resolution, MT-Bench preset judging, and localized m-Arena-Hard prompt assets into their own stack layer so inference metadata can build on top. Includes-AI-Code: true * skywork prompt preset * Add clarifying comment about localized presets * remove localized and skywork presets * Merge pr 40 logic, add registry and task defaults * refactor and add tests * add mt-bench reproducibility metadata and completion alignment Record MT-Bench run metadata alongside results and fail early when cached or baseline completions are missing question rows, so dedicated MT-Bench runs are easier to reproduce and debug. * Remove unnecessary cases added * Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf - Make model setup and inference more robust: retry transient vLLM GPU-init races, strip vLLM-only engine kwargs for remote API providers, and propagate finish_reason/stop_reason through both the async and batch inference paths. - Centralize judge engine kwargs (incl. FP8 KV default) in build_default_judge_model_kwargs, accept an explicit parser_mode in judge_and_parse_prefs, restore the deprecated --model alias for --model_A, and tidy generation orchestration. - Point dataset downloads at the judge-arena/judge-arena-dataset repo and drive download_all from M_ARENA_HARD_BASELINES. * Thinking model support (linearized net diff) * add elo random 1k sampling * Address PR #54 review on prompt presets - Drop the always-zero num_inconsistent return from judge_mt_bench_with_preset; return a 3-tuple and unpack 3 in _run_mt_bench_preset (the FastChat path keeps its real value). - Document that provide_explanation is a legacy alias that intentionally overrides the task default preset. * Address PR #55 review on MT-Bench preset judging - Drop the always-constant swap_policy parameter and dead guard from the preset _append_results closure (and the now-unused _PRESET_SWAP_POLICY constant / MTBenchSwapPolicy import). - Record the results "date" as UTC isoformat, consistent with the run-metadata started_at_utc timestamp. * Drop MTBenchSwapPolicy in favor of inline comments - The swap_policy parameter was always constant on both judging paths (the only caller passed the matching module constant and the guard was unreachable), and the enum was never serialized or used for dispatch. - Remove it from _resolve_fastchat_item_result too (it was already removed from the preset path), delete the unused enum, and document the two swap_mode="both" combine strategies with inline comments instead: conservative agreement for FastChat, append-inverted-score for presets. * Address PR #57 review: simplify CLI/judge kwargs, safe env parsing, drop unused metadata plumbing - cli.py: remove the re-introduced --model alias and _resolve_model_a; callers use --model_A directly. - generate_and_evaluate.py: inline build_default_judge_model_kwargs at its single call site and delete the _build_judge_engine_kwargs wrapper. - utils.py: add safe_parse_int(env_var) so malformed JUDGEARENA_* integers fall back to defaults (with a warning) instead of crashing at import; use it for the vLLM init constants and the judge concurrency cap (also removes a double int() call). - utils.py: remove the unused finish_reason/stop_reason metadata scaffolding (_extract_ai_message_metadata, do_inference return_metadata, ChatVLLM.batch_with_metadata) whose only consumer is the deferred truncation-tracking work; fold batch_with_metadata back into batch and add docstrings to do_inference and batch. * Address PR #58 review: restore judge engine kwargs, dedupe JSON helper - Restore judge_extra_kwargs.update(args.engine_kwargs / args.judge_engine_kwargs) in estimate_elo_ratings so the Elo judge again honors --engine_kwargs / --judge_engine_kwargs (e.g. tensor_parallel_size); the removal was an unintended out-of-scope change. - Drop the local _jsonable in favor of _to_jsonable from judgearena.repro (superset: also handles datetime/Path/NaN), removing the duplication. - Use row["question_id"] instead of row.get("question_id", "") in _sample_fingerprint, matching the hard column access in select_seeded_random_arena_battles. * Finalize #53 merge: drop deferred thinking-budget observability (Option A) Align the thinking-model branch with the current base, which removed do_inference's metadata path (truncation/metadata tracking is deferred): - utils.py: drop the now-unread _thinking_budget_marker/_value attributes (budget enforcement via sampling params is unchanged). - test_chat_vllm.py: remove the two budget-exhaustion observability tests that exercised the dropped batch_with_metadata path. - test_mt_bench_preset_judging.py: unpack the 3-tuple judge_mt_bench_with_preset return (drop the removed always-0 num_inconsistent element). --------- Co-authored-by: Erlis Lushtaku <59629249+ErlisLushtaku@users.noreply.github.com>

ErlisLushtaku added 11 commits May 19, 2026 00:31

Add native baselines and judge controls

d0f4604

Fix versioned m-Arena-Hard download test

092c0e8

Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split. Includes-AI-Code: true

remove backward compatibility code

28c21b8

revert

2213b68

Add prompt presets and localized judge prompts

6112785

Move prompt preset resolution, MT-Bench preset judging, and localized m-Arena-Hard prompt assets into their own stack layer so inference metadata can build on top. Includes-AI-Code: true

skywork prompt preset

366c911

Merge remote-tracking branch 'origin/main' into pr32-split-v2/02-prom…

dfa3231

…pt-presets-localized

Add clarifying comment about localized presets

3280534

remove localized and skywork presets

b1b3241

Merge pr 40 logic, add registry and task defaults

ec06ad3

refactor and add tests

ddde004

ErlisLushtaku changed the base branch from main to pr32-split-v2/02-prompt-presets-only May 28, 2026 08:59

This was referenced May 28, 2026

2. Add prompt presets and mt-bench changes #48

Closed

2. Add prompt presets #54

Merged

ErlisLushtaku requested a review from kargibora May 28, 2026 09:04

ErlisLushtaku added 2 commits May 29, 2026 14:21

add mt-bench reproducibility metadata and completion alignment

76809e0

Record MT-Bench run metadata alongside results and fail early when cached or baseline completions are missing question rows, so dedicated MT-Bench runs are easier to reproduce and debug.

Remove unnecessary cases added

0d6bc76

This was referenced Jun 1, 2026

3.5. Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf #56

Closed

3.5. Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf #57

Merged

kargibora reviewed Jun 8, 2026

View reviewed changes

Comment thread judgearena/mt_bench/preset_judging.py Outdated

Comment thread judgearena/mt_bench/mt_bench_utils.py

ErlisLushtaku added 5 commits June 15, 2026 10:42

ErlisLushtaku changed the base branch from pr32-split-v2/02-prompt-presets-only to main June 16, 2026 07:43

ErlisLushtaku added 2 commits June 16, 2026 09:46

ErlisLushtaku merged commit 57548e9 into main Jun 16, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.5. mt-bench changes#55

2.5. mt-bench changes#55
ErlisLushtaku merged 20 commits into
mainfrom
pr32-split-v2/02.5-mt-bench-preset-judging

ErlisLushtaku commented May 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ErlisLushtaku commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ErlisLushtaku commented May 28, 2026 •

edited

Loading