Integration of Soft-ELO by kargibora · Pull Request #42 · OpenEuroLLM/JudgeArena

kargibora · 2026-05-12T13:10:13Z

Implements the Soft-Elo pipeline: feed the judge's calibrated score-difference into the Bradley–Terry fit as a soft preference $\tilde y = \sigma(\beta s)$ instead of discretising to win/loss/tie. Optionally MLE-fit $\beta$ on human-labeled arena battles before the main run.

What changed

fit_bradley_terry (estimate_elo_ratings.py) replaces compute_bradley_terry. Takes a soft target pref_col ∈ [0,1] (0=A wins,1=B wins, 0.5=tie) and uses the standard soft-CE → weighted-LR decomposition. Hard labels ({0, 0.5, 1}) reduce to the previous fit.
Temperature calibration (evaluate.calibrate_temperature): concave MLE for $\beta^\star$ via scipy.optimize.minimize_scalar on $\sum\log\sigma(\beta(2y-1)\Delta s)$. Driven from estimate_elo_ratings.main — samples human battles, reruns the judge on them, parses raw scores with PairScore(temperature=1.0), fits $\beta^\star$, then re-parses all cached judge completions with the calibrated temperature (handles swap_mode="both" reconstruction).
Reporting: human-only BT ratings computed as ground-truth reference, prints MAE vs Human-Elo on overlapping models; return dict gains human_elo, mae_vs_human, method, calibrated_temperature.

New flags

Flag	Default	Effect
`--soft-elo`	off	Use soft BT targets instead of hard {0, 0.5, 1} labels.
`--soft-elo-temperature`	`0.3`	Initial $\beta$; overridden if calibration runs. Empirical range across judges in the paper: `[0.36, 0.60]`.
`--calibrate-temperature`	off	MLE-fit $\beta^\star$ on human-labeled arena battles before the run. Requires `--soft-elo`; warns and skips otherwise.
`--calibration-size`	all human battles	Number of human battles to sample for calibration. Needs `--calibrate-temperature`.

How to run

Hard-Elo (unchanged behavior):

judgearena --task elo-lmarena-100k \
  --model_A Together/meta-llama/Llama-3.3-70B-Instruct-Turbo \
  --judge_model OpenRouter/deepseek/deepseek-chat-v3.1 \
  --n_instructions 200

Soft-Elo with calibration (recommended):
judgearena --task elo-lmarena-100k \
  --model_A Together/meta-llama/Llama-3.3-70B-Instruct-Turbo \
  --judge_model OpenRouter/deepseek/deepseek-chat-v3.1 \
  --n_instructions 200 \
  --soft-elo --calibrate-temperature --calibration-size 300

How to test

uv run pytest tests/test_cli.py tests/test_estimate_elo_ratings.py

test_cli.py covers the new flags routing through the unified entrypoint;
test_estimate_elo_ratings.py covers fit_bradley_terry and the main pipeline.

kargibora · 2026-05-27T11:55:48Z

With latest commit d53cf64, --soft-elo is the default flag (opt --no-soft-elo to use normal one). However I did not change the --calibrate-temperature as default, because I think this requires some extra computation.

ErlisLushtaku · 2026-05-27T20:38:05Z

@kargibora is this ready for review?

kargibora · 2026-05-28T07:28:22Z

@kargibora is this ready for review?

Yes it is!

ErlisLushtaku · 2026-06-02T13:46:38Z

+                completions_A=cal_completions_a,
+                completions_B=cal_completions_b,
+                swap_mode=args.swap_mode,
+                truncate_input_chars=args.truncate_all_input_chars,


Calibration should mirror the main judge run so T* matches the score distribution it's applied to, so I think we should use truncate_input_chars=args.truncate_judge_input_chars and pass provide_explanation=args.provide_explanation. Otherwise the calibrated temperature is fit on a different prompt/truncation regime than the evaluation.

ErlisLushtaku · 2026-06-02T13:50:04Z

+            _cal_n = (
+                min(args.calibration_size, len(df_arena))
+                if args.calibration_size is not None
+                else len(df_arena)


Defaulting calibration to all arena battles can mean tens of thousands of (uncached) judge calls, which is a big API cost. We could use a default cap (e.g. a few hundred) and wrapping this judge pass in cache_function_dataframe so reruns don't re-pay.

This is something we have discussed with David. We agreed that the default parameters should be the ones that works best according to us, but I am more biased towards making this optional as you have suggested. @geoalgo

kargibora · 2026-06-08T09:13:53Z

I have pushed the changes, thank you @ErlisLushtaku! It will be ready to merge after deciding whether to set soft-elo default or not. I think discussing this is important again.

* Add native baselines and judge controls * Fix versioned m-Arena-Hard download test Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split. Includes-AI-Code: true * remove backward compatibility code * revert * Add prompt presets and localized judge prompts Move prompt preset resolution, MT-Bench preset judging, and localized m-Arena-Hard prompt assets into their own stack layer so inference metadata can build on top. Includes-AI-Code: true * skywork prompt preset * Add clarifying comment about localized presets * remove localized and skywork presets * Merge pr 40 logic, add registry and task defaults * Address PR #54 review on prompt presets - Drop the always-zero num_inconsistent return from judge_mt_bench_with_preset; return a 3-tuple and unpack 3 in _run_mt_bench_preset (the FastChat path keeps its real value). - Document that provide_explanation is a legacy alias that intentionally overrides the task default preset.

* Add native baselines and judge controls * Fix versioned m-Arena-Hard download test Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split. Includes-AI-Code: true * remove backward compatibility code * revert * Add prompt presets and localized judge prompts Move prompt preset resolution, MT-Bench preset judging, and localized m-Arena-Hard prompt assets into their own stack layer so inference metadata can build on top. Includes-AI-Code: true * skywork prompt preset * Add clarifying comment about localized presets * remove localized and skywork presets * Merge pr 40 logic, add registry and task defaults * refactor and add tests * add mt-bench reproducibility metadata and completion alignment Record MT-Bench run metadata alongside results and fail early when cached or baseline completions are missing question rows, so dedicated MT-Bench runs are easier to reproduce and debug. * Remove unnecessary cases added * Address PR #54 review on prompt presets - Drop the always-zero num_inconsistent return from judge_mt_bench_with_preset; return a 3-tuple and unpack 3 in _run_mt_bench_preset (the FastChat path keeps its real value). - Document that provide_explanation is a legacy alias that intentionally overrides the task default preset. * Address PR #55 review on MT-Bench preset judging - Drop the always-constant swap_policy parameter and dead guard from the preset _append_results closure (and the now-unused _PRESET_SWAP_POLICY constant / MTBenchSwapPolicy import). - Record the results "date" as UTC isoformat, consistent with the run-metadata started_at_utc timestamp. * Drop MTBenchSwapPolicy in favor of inline comments - The swap_policy parameter was always constant on both judging paths (the only caller passed the matching module constant and the guard was unreachable), and the enum was never serialized or used for dispatch. - Remove it from _resolve_fastchat_item_result too (it was already removed from the preset path), delete the unused enum, and document the two swap_mode="both" combine strategies with inline comments instead: conservative agreement for FastChat, append-inverted-score for presets. * Fix MT-Bench preset test to unpack 3-tuple return test_judge_mt_bench_with_preset_parses_and_inverts_swapped_scores still unpacked 4 values and asserted num_inconsistent == 0, but the #54/#55 review dropped the always-zero num_inconsistent so judge_mt_bench_with_preset now returns a 3-tuple. Unpack 3 values and drop the stale assertion.

…cleanup, point to judgearena hf (#57) * Add native baselines and judge controls * Fix versioned m-Arena-Hard download test Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split. Includes-AI-Code: true * remove backward compatibility code * revert * Add prompt presets and localized judge prompts Move prompt preset resolution, MT-Bench preset judging, and localized m-Arena-Hard prompt assets into their own stack layer so inference metadata can build on top. Includes-AI-Code: true * skywork prompt preset * Add clarifying comment about localized presets * remove localized and skywork presets * Merge pr 40 logic, add registry and task defaults * refactor and add tests * add mt-bench reproducibility metadata and completion alignment Record MT-Bench run metadata alongside results and fail early when cached or baseline completions are missing question rows, so dedicated MT-Bench runs are easier to reproduce and debug. * Remove unnecessary cases added * Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf - Make model setup and inference more robust: retry transient vLLM GPU-init races, strip vLLM-only engine kwargs for remote API providers, and propagate finish_reason/stop_reason through both the async and batch inference paths. - Centralize judge engine kwargs (incl. FP8 KV default) in build_default_judge_model_kwargs, accept an explicit parser_mode in judge_and_parse_prefs, restore the deprecated --model alias for --model_A, and tidy generation orchestration. - Point dataset downloads at the judge-arena/judge-arena-dataset repo and drive download_all from M_ARENA_HARD_BASELINES. * Address PR #54 review on prompt presets - Drop the always-zero num_inconsistent return from judge_mt_bench_with_preset; return a 3-tuple and unpack 3 in _run_mt_bench_preset (the FastChat path keeps its real value). - Document that provide_explanation is a legacy alias that intentionally overrides the task default preset. * Address PR #55 review on MT-Bench preset judging - Drop the always-constant swap_policy parameter and dead guard from the preset _append_results closure (and the now-unused _PRESET_SWAP_POLICY constant / MTBenchSwapPolicy import). - Record the results "date" as UTC isoformat, consistent with the run-metadata started_at_utc timestamp. * Drop MTBenchSwapPolicy in favor of inline comments - The swap_policy parameter was always constant on both judging paths (the only caller passed the matching module constant and the guard was unreachable), and the enum was never serialized or used for dispatch. - Remove it from _resolve_fastchat_item_result too (it was already removed from the preset path), delete the unused enum, and document the two swap_mode="both" combine strategies with inline comments instead: conservative agreement for FastChat, append-inverted-score for presets. * Address PR #57 review: simplify CLI/judge kwargs, safe env parsing, drop unused metadata plumbing - cli.py: remove the re-introduced --model alias and _resolve_model_a; callers use --model_A directly. - generate_and_evaluate.py: inline build_default_judge_model_kwargs at its single call site and delete the _build_judge_engine_kwargs wrapper. - utils.py: add safe_parse_int(env_var) so malformed JUDGEARENA_* integers fall back to defaults (with a warning) instead of crashing at import; use it for the vLLM init constants and the judge concurrency cap (also removes a double int() call). - utils.py: remove the unused finish_reason/stop_reason metadata scaffolding (_extract_ai_message_metadata, do_inference return_metadata, ChatVLLM.batch_with_metadata) whose only consumer is the deferred truncation-tracking work; fold batch_with_metadata back into batch and add docstrings to do_inference and batch.

* Add native baselines and judge controls * Fix versioned m-Arena-Hard download test Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split. Includes-AI-Code: true * remove backward compatibility code * revert * Add prompt presets and localized judge prompts Move prompt preset resolution, MT-Bench preset judging, and localized m-Arena-Hard prompt assets into their own stack layer so inference metadata can build on top. Includes-AI-Code: true * skywork prompt preset * Add clarifying comment about localized presets * remove localized and skywork presets * Merge pr 40 logic, add registry and task defaults * refactor and add tests * add mt-bench reproducibility metadata and completion alignment Record MT-Bench run metadata alongside results and fail early when cached or baseline completions are missing question rows, so dedicated MT-Bench runs are easier to reproduce and debug. * Remove unnecessary cases added * Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf - Make model setup and inference more robust: retry transient vLLM GPU-init races, strip vLLM-only engine kwargs for remote API providers, and propagate finish_reason/stop_reason through both the async and batch inference paths. - Centralize judge engine kwargs (incl. FP8 KV default) in build_default_judge_model_kwargs, accept an explicit parser_mode in judge_and_parse_prefs, restore the deprecated --model alias for --model_A, and tidy generation orchestration. - Point dataset downloads at the judge-arena/judge-arena-dataset repo and drive download_all from M_ARENA_HARD_BASELINES. * Thinking model support (linearized net diff) * Address PR #54 review on prompt presets - Drop the always-zero num_inconsistent return from judge_mt_bench_with_preset; return a 3-tuple and unpack 3 in _run_mt_bench_preset (the FastChat path keeps its real value). - Document that provide_explanation is a legacy alias that intentionally overrides the task default preset. * Address PR #55 review on MT-Bench preset judging - Drop the always-constant swap_policy parameter and dead guard from the preset _append_results closure (and the now-unused _PRESET_SWAP_POLICY constant / MTBenchSwapPolicy import). - Record the results "date" as UTC isoformat, consistent with the run-metadata started_at_utc timestamp. * Drop MTBenchSwapPolicy in favor of inline comments - The swap_policy parameter was always constant on both judging paths (the only caller passed the matching module constant and the guard was unreachable), and the enum was never serialized or used for dispatch. - Remove it from _resolve_fastchat_item_result too (it was already removed from the preset path), delete the unused enum, and document the two swap_mode="both" combine strategies with inline comments instead: conservative agreement for FastChat, append-inverted-score for presets. * Address PR #57 review: simplify CLI/judge kwargs, safe env parsing, drop unused metadata plumbing - cli.py: remove the re-introduced --model alias and _resolve_model_a; callers use --model_A directly. - generate_and_evaluate.py: inline build_default_judge_model_kwargs at its single call site and delete the _build_judge_engine_kwargs wrapper. - utils.py: add safe_parse_int(env_var) so malformed JUDGEARENA_* integers fall back to defaults (with a warning) instead of crashing at import; use it for the vLLM init constants and the judge concurrency cap (also removes a double int() call). - utils.py: remove the unused finish_reason/stop_reason metadata scaffolding (_extract_ai_message_metadata, do_inference return_metadata, ChatVLLM.batch_with_metadata) whose only consumer is the deferred truncation-tracking work; fold batch_with_metadata back into batch and add docstrings to do_inference and batch. * Finalize #53 merge: drop deferred thinking-budget observability (Option A) Align the thinking-model branch with the current base, which removed do_inference's metadata path (truncation/metadata tracking is deferred): - utils.py: drop the now-unread _thinking_budget_marker/_value attributes (budget enforcement via sampling params is unchanged). - test_chat_vllm.py: remove the two budget-exhaustion observability tests that exercised the dropped batch_with_metadata path. - test_mt_bench_preset_judging.py: unpack the 3-tuple judge_mt_bench_with_preset return (drop the removed always-0 num_inconsistent element).

* Add native baselines and judge controls * Fix versioned m-Arena-Hard download test Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split. Includes-AI-Code: true * remove backward compatibility code * revert * Add prompt presets and localized judge prompts Move prompt preset resolution, MT-Bench preset judging, and localized m-Arena-Hard prompt assets into their own stack layer so inference metadata can build on top. Includes-AI-Code: true * skywork prompt preset * Add clarifying comment about localized presets * remove localized and skywork presets * Merge pr 40 logic, add registry and task defaults * refactor and add tests * add mt-bench reproducibility metadata and completion alignment Record MT-Bench run metadata alongside results and fail early when cached or baseline completions are missing question rows, so dedicated MT-Bench runs are easier to reproduce and debug. * Remove unnecessary cases added * Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf - Make model setup and inference more robust: retry transient vLLM GPU-init races, strip vLLM-only engine kwargs for remote API providers, and propagate finish_reason/stop_reason through both the async and batch inference paths. - Centralize judge engine kwargs (incl. FP8 KV default) in build_default_judge_model_kwargs, accept an explicit parser_mode in judge_and_parse_prefs, restore the deprecated --model alias for --model_A, and tidy generation orchestration. - Point dataset downloads at the judge-arena/judge-arena-dataset repo and drive download_all from M_ARENA_HARD_BASELINES. * Thinking model support (linearized net diff) * add elo random 1k sampling * Address PR #54 review on prompt presets - Drop the always-zero num_inconsistent return from judge_mt_bench_with_preset; return a 3-tuple and unpack 3 in _run_mt_bench_preset (the FastChat path keeps its real value). - Document that provide_explanation is a legacy alias that intentionally overrides the task default preset. * Address PR #55 review on MT-Bench preset judging - Drop the always-constant swap_policy parameter and dead guard from the preset _append_results closure (and the now-unused _PRESET_SWAP_POLICY constant / MTBenchSwapPolicy import). - Record the results "date" as UTC isoformat, consistent with the run-metadata started_at_utc timestamp. * Drop MTBenchSwapPolicy in favor of inline comments - The swap_policy parameter was always constant on both judging paths (the only caller passed the matching module constant and the guard was unreachable), and the enum was never serialized or used for dispatch. - Remove it from _resolve_fastchat_item_result too (it was already removed from the preset path), delete the unused enum, and document the two swap_mode="both" combine strategies with inline comments instead: conservative agreement for FastChat, append-inverted-score for presets. * Address PR #57 review: simplify CLI/judge kwargs, safe env parsing, drop unused metadata plumbing - cli.py: remove the re-introduced --model alias and _resolve_model_a; callers use --model_A directly. - generate_and_evaluate.py: inline build_default_judge_model_kwargs at its single call site and delete the _build_judge_engine_kwargs wrapper. - utils.py: add safe_parse_int(env_var) so malformed JUDGEARENA_* integers fall back to defaults (with a warning) instead of crashing at import; use it for the vLLM init constants and the judge concurrency cap (also removes a double int() call). - utils.py: remove the unused finish_reason/stop_reason metadata scaffolding (_extract_ai_message_metadata, do_inference return_metadata, ChatVLLM.batch_with_metadata) whose only consumer is the deferred truncation-tracking work; fold batch_with_metadata back into batch and add docstrings to do_inference and batch. * Address PR #58 review: restore judge engine kwargs, dedupe JSON helper - Restore judge_extra_kwargs.update(args.engine_kwargs / args.judge_engine_kwargs) in estimate_elo_ratings so the Elo judge again honors --engine_kwargs / --judge_engine_kwargs (e.g. tensor_parallel_size); the removal was an unintended out-of-scope change. - Drop the local _jsonable in favor of _to_jsonable from judgearena.repro (superset: also handles datetime/Path/NaN), removing the duplication. - Use row["question_id"] instead of row.get("question_id", "") in _sample_fingerprint, matching the hard column access in select_seeded_random_arena_battles. * Finalize #53 merge: drop deferred thinking-budget observability (Option A) Align the thinking-model branch with the current base, which removed do_inference's metadata path (truncation/metadata tracking is deferred): - utils.py: drop the now-unread _thinking_budget_marker/_value attributes (budget enforcement via sampling params is unchanged). - test_chat_vllm.py: remove the two budget-exhaustion observability tests that exercised the dropped batch_with_metadata path. - test_mt_bench_preset_judging.py: unpack the 3-tuple judge_mt_bench_with_preset return (drop the removed always-0 num_inconsistent element).

kargibora added 11 commits April 14, 2026 15:33

Add soft elo

af4bced

Add temperature calibration

898b1e4

Update READMe for soft-elo support

e4498b6

Update temperature

6b401e8

Merge branch 'main' into feat/soft-elo

6f960af

Update CLI to unify elo computation

995db21

Remove duplication

b357116

Fix a edge case when all the labels are same

be53e8c

ruff fix

61f1f84

Merge branch 'main' into feat/soft-elo

22aa56e

Make soft-elo default

d53cf64

ErlisLushtaku reviewed Jun 2, 2026

View reviewed changes

Comment thread judgearena/estimate_elo_ratings.py Outdated

ErlisLushtaku reviewed Jun 2, 2026

View reviewed changes

Comment thread judgearena/estimate_elo_ratings.py Outdated

ErlisLushtaku reviewed Jun 2, 2026

View reviewed changes

Comment thread judgearena/estimate_elo_ratings.py Outdated

kargibora added 5 commits June 8, 2026 10:39

fix preference bug

734c08b

fix soft-elo dead flag bug

7411407

fix: calibration error

8b72171

bug: fix the problem in the regex parser

9855baf

Merge branch 'main' into feat/soft-elo

8247957

ErlisLushtaku added 2 commits June 16, 2026 09:42

ErlisLushtaku approved these changes Jun 16, 2026

View reviewed changes

ErlisLushtaku added 2 commits June 16, 2026 10:01

ErlisLushtaku and others added 2 commits June 16, 2026 10:15

Merge branch 'feat/soft-elo' into merge/soft-elo-to-main

5654b6c

kargibora merged commit 3c74288 into main Jun 16, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration of Soft-ELO#42

Integration of Soft-ELO#42
kargibora merged 22 commits into
mainfrom
feat/soft-elo

kargibora commented May 12, 2026

Uh oh!

kargibora commented May 27, 2026

Uh oh!

ErlisLushtaku commented May 27, 2026

Uh oh!

kargibora commented May 28, 2026

Uh oh!

Uh oh!

Uh oh!

ErlisLushtaku Jun 2, 2026

Uh oh!

ErlisLushtaku Jun 2, 2026

Uh oh!

kargibora Jun 8, 2026

Uh oh!

Uh oh!

kargibora commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kargibora commented May 12, 2026

What changed

New flags

How to run

How to test

Uh oh!

kargibora commented May 27, 2026

Uh oh!

ErlisLushtaku commented May 27, 2026

Uh oh!

kargibora commented May 28, 2026

Uh oh!

Uh oh!

Uh oh!

ErlisLushtaku Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

ErlisLushtaku Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

kargibora Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kargibora commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants