Skip to content

perf(na): incremental exact_verify rescore — default-on (~25-30% native-NA mission wall)#254

Merged
ms609 merged 5 commits into
cpp-searchfrom
claude/na-incremental-rescore
Jun 23, 2026
Merged

perf(na): incremental exact_verify rescore — default-on (~25-30% native-NA mission wall)#254
ms609 merged 5 commits into
cpp-searchfrom
claude/na-incremental-rescore

Conversation

@ms609

@ms609 ms609 commented Jun 23, 2026

Copy link
Copy Markdown
Owner

What

Makes the incremental dirty rescore the production default for exact_verify_sweep candidates on inapplicable (NA) data.

exact_verify_sweep is the native-NA unrooted-TBR completeness pass — NA's indirect scan is only approximate, so it certifies a true unrooted optimum by an exact full-neighbourhood sweep, which TS_NA_CLIPPROF measurement showed is ~95% of native-NA TBR wall and ~96% of native-NA MaximizeParsimony mission wall (even at maxReplicates=5). It scored every TBR neighbour by apply_tbr_move + a full three-pass full_rescore.

This PR scores each candidate incrementally instead:

  • 3-seed dirty passesfitch_na_dirty_{down,up}pass gain an optional third seed (default -1 = no-op, so the SPR accept path is unchanged). exact_verify seeds {nz, nx, clip_node}; the clip_node seed covers the reversed-subtree path, so one path handles both SPR and reroot candidates.
  • fused Pass 3fitch_na_pass3_score optionally buckets per-pattern steps during its existing walk, so the IW path skips a redundant extract_char_steps full walk.
  • On reject, restore_prealloc_undo/restore_saved_states undoes the dirty state before restore_topology, keeping the base state valid for the next candidate.

Kill-switch TS_NA_NOINCR restores the legacy full_rescore path.

Result (Hamilton, native NA, MaximizeParsimony default recipe)

1.30× mean speedup (24.7% pooled wall reduction, p = 2×10⁻⁶), all cells byte-identical. Dikow2009 1.48×, Zanol2014 1.34×, Zhu2013 1.21×, Giles2015 1.17×.

Correctness

  • TS_NA_INCR_AUDIT: every candidate's incremental score byte-matched full_rescore (5k–14k candidates per run, EW + IW).
  • Default vs TS_NA_NOINCR (legacy) byte-identical: all 30 bundled NA datasets, 6/6 each (EW + IW × 3 starts = 180 climbs, 23–88 taxa), zero diffs; plus 40/40 mission cells.
  • Default-off paths byte-identical (3-seed param defaults to no-op; fitch_na_pass3_score/dirty-pass callers unchanged).
  • New regression tests in test-ts-na-incremental.R (default vs kill-switch + audit).

Notes

  • getenv read is per-exact_verify call (per-convergence), not per-candidate.
  • Remaining algorithmic headroom (dirty Pass 3) is capped by NA's ~0.4–0.48 dirty-region fraction and needs intricate incremental char-step maintenance — diminishing returns, not pursued.

🤖 Generated with Claude Code

ms609 and others added 5 commits June 23, 2026 08:46
…ult-off)

exact_verify_sweep is ~95% of native-NA TBR/mission wall: it scores every TBR
neighbour of the converged tree by apply_tbr_move + full three-pass full_rescore
(NA needs the exact sweep because its indirect scan is approximate). This makes
each candidate an INCREMENTAL rescore instead.

- 3-seed dirty passes: fitch_na_dirty_{down,up}pass gain an optional third seed
  (default -1 = no-op, so the SPR accept path is unchanged). exact_verify seeds
  {nz (sibling reconnect), nx (regraft node), clip_node (covers the reversed
  subtree path + its rootward chain)} -- covering BOTH SPR and reroot candidates.
- Per candidate: dirty Pass1/Pass2 (O(dirty) not O(n)) + full Pass3 + per-pattern
  extract + compute_weighted (IW/XPIWE/PROFILE) or +ew_offset (EW); on reject,
  restore_prealloc_undo/restore_saved_states undoes the dirty state before
  restore_topology, keeping the base state valid for the next candidate.

Behind TS_NA_INCR (fast path) and TS_NA_INCR_AUDIT (cross-check vs full_rescore);
both DEFAULT OFF, so production is byte-identical (519 testthat pass; 3-seed
default -1).

VALIDATION:
- TS_NA_INCR_AUDIT: every candidate's incremental score byte-matched full_rescore
  -- 5486..13674 candidates each on Vinther/Longrich/DeAssis x {EW, IW}.
- End-to-end byte-identical search outcome (TS_NA_INCR on vs off): 18/18
  (3 datasets x 2 regimes x 3 starts).
- Element wall (exact_verify-heavy direct climb): Zanol 1.12x, Dikow 1.21x,
  scores identical. (Below the ~2.5x ceiling: NA final_ uppass propagation makes
  the dirty region large; Pass3/extract stay full -- headroom remains.)

Mission A/B pending (should translate, unlike the washed extract-fusion: this
speeds exact_verify directly rather than relocating work). NOT default-on /
not for cpp-search until the mission A/B + broader byte-identity confirm.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…emental path)

The incremental exact_verify path ran fitch_na_pass3_score AND a separate
extract_char_steps over the full tree -- both recompute the identical per-node
step bits (standard from local_cost, NA from the Pass3 needs_step formula).
Add an optional char_steps_out to fitch_na_pass3_score that buckets per-pattern
during its existing walk; the IW incremental path uses it and skips extract.

Unlike the dirty Pass1/2 (capped by NA's ~0.4-0.48 dirty fraction), this is a
redundant-walk ELIMINATION, not dirty-fraction-limited. Default-nullptr param
=> all other callers byte-identical.

Element wall (exact_verify-heavy climb), incremental + this fold vs legacy:
Zanol 1.23x (was 1.12x), Dikow 1.32x (was 1.21x); scores byte-identical.
Audit byte-matches full_rescore on every candidate; fast-vs-legacy 18/18
byte-identical; 519 tests pass (production unchanged).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Flip the na_incr gate to default-ON (kill-switch TS_NA_NOINCR restores the
legacy full_rescore path). The incremental dirty rescore is byte-identical to
legacy (per-candidate audit + 180/180 full-roster climbs + 40/40 mission cells)
and cuts native-NA mission wall ~25-30% (1.30x mean, p=2e-06). Disabled in
TS_NA_INCR_AUDIT mode (which uses full_rescore for decisions). getenv read is
per-convergence (exact_verify), not per-candidate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…INCR + audit)

Two enduring tests: (1) the default incremental path is byte-identical to the
legacy full_rescore (TS_NA_NOINCR) across Vinther/DeAssis x {EW, IW} x starts;
(2) TS_NA_INCR_AUDIT runs clean (per-candidate incr == full_rescore, else stop).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
@ms609 ms609 merged commit 47c0494 into cpp-search Jun 23, 2026
2 of 7 checks passed
@ms609 ms609 deleted the claude/na-incremental-rescore branch June 23, 2026 09:27
ms609 added a commit that referenced this pull request Jun 23, 2026
…pt-in) (#256)

* perf(iw): x4 reroot batch + extract_char_steps dirty-region for IW TBR

Cherry-pick of 69febb4 onto cpp-search (post-#254). Resolved the
exact_verify candidate-scoring conflict by keeping #254's incremental
rescore path and dropping the directional-audit (na_dir_audit) block
that #254 deliberately removed. Stripped the settled-dead
ts_iw_gather_bench microbench (gather micro-opt proven a dead heat) and
trimmed superseded dev A/B harnesses, keeping only bench_iw_realized.R.

Opts remain kill-switched (TS_IW_NOX4 / TS_IW_NODIRTY); a follow-up
commit flips them default-off for the shared branch. Mission-null at
morphology scale; kept for the large-N / recipe-retune reopen.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* perf(tbr): fire IW dirty/x4 opts on XPIWE; fix nx_cs active_mask underflow

The extract_char_steps dirty-region + x4 reroot-batch opts were gated to
ScoringMode::IW, but production MaximizeParsimony defaults to extended IW
(ScoringMode::XPIWE, ts_data.cpp:297). So the opts were dead on the production
path for every dataset, and the mission-wall A/B (which toggles TS_IW_NOX4/
NODIRTY through MaximizeParsimony) was toggling code that never executed -> the
1.005x "null result". Widen the gate to iw_family = (IW || XPIWE): both modes
score via compute_iw/precompute_iw_delta from the same weighting-agnostic
char_steps, differing only in per-pattern eff_k/phi (which those functions
already consume). PROFILE stays excluded (compute_profile + info_amounts +
precomputed_steps is a genuinely different char_steps convention).

Perf-only / byte-identical: opts-on vs opts-off scores identical across
Dikow/Vinther/Zanol/Giles/Wortley x 3 seeds; SCANCHK clean on XPIWE.

Firing the opts on the production path exposed a pre-existing dirty-region bug:
the nx_cs accumulation did not skip active_mask==0 blocks, but extract_char_steps
(which builds the F cache) does. Under a ratchet ZERO_ONLY perturbation (the
default, ratchetCycles>=3) that fully deactivates a block, divided_steps =
F + cs_delta - nx_cs went negative (TS_IW_DIRTYCHK: dirty=-1 ref=0 on
Dikow2009/Vinther2008). Weighting-independent (plain IW mismatches identically;
reachable today via MaximizeParsimony(extended_iw=FALSE)). The candidate scan
masks needs_step by active_mask so scores were unaffected, but the invariant was
broken. Fix: skip active_mask==0 in the nx_cs loop only; EW nx_cost is left
counting all blocks (its best_score/delta length convention is self-consistent).
DIRTYCHK now clean across 5 datasets x ratchetCycles {0,3,6}.

Add a testthat guard pinning XPIWE opts-on == opts-off (tolerance=0) under
ratchet (the discriminating config; prior tests ran plain-IW or ratchetCycles=0
and were tautological for this path). 398 existing tests still pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* perf(tbr): NA-IW 4-wide reroot batch (indirect_na_iw_cached_flat_x4) [perf TBD]

Port the T-245 4-wide reroot ILP batch to the implied-weights + inapplicable
(IW/XPIWE + NA) TBR scan -- the one scoring path that never received it. The NA
reroot scan already had an EW-NA x4 (fitch_na_indirect_cached_flat_x4) but
IW-NA/XPIWE-NA fell to the one-at-a-time scalar indirect_na_iw_length_cached;
the IW x4 branch was !has_na-gated.

New kernel indirect_na_iw_cached_flat_x4 (src/ts_fitch.cpp) fuses the NA
active-mask candidate logic of the EW-NA x4 (from1 reduce, shared
clip_has_active, per-candidate below_actives AND) with the per-candidate
iw_delta ctz-gather of indirect_iw_cached_flat_x4. Each accumulator keeps the
scalar add order of indirect_na_iw_length_cached, so per-candidate results are
bit-identical; the shared all-4-exceed-cutoff bail only changes early-exit on
cutoff-losing candidates. Wired by widening the IW x4 branch gate
(ts_tbr.cpp:1985) to dispatch on has_na, mirroring its own per-candidate skip
re-check and main_edges[ei] indexing; the no-NA path stays byte-identical.

Gated on iw_family (IW||XPIWE), so it fires on the production XPIWE+NA path
(MaximizeParsimony default) -- the first banked IW kernel opt that lands on the
NATIVE inapplicable-bearing corpus rather than only recoded matrices.

The dirty-region scan shortcut is deliberately NOT ported (stays !has_na): the
top-down down2 pass breaks the path-bounded F+cs_delta-nx_cs decomposition, and
it is the smaller (once-per-clip) prize.

CORRECTNESS validated, PERF still unknown (Hamilton A/B pending):
- byte-identical x4-on vs TS_IW_NOX4-off: 24/24 (15 direct IW-NA ts_tbr_search +
  9 MaximizeParsimony XPIWE-NA under ratchet; datasets at 10-38% NA)
- existing independent full-rescore oracle passes; 519 testthat pass
- kernel confirmed to execute (throw-probe), so the byte-identity is not vacuous
- new regression guard in test-ts-tbr-dirty-rescore.R keeps inapplicables

Kill-switch TS_IW_NOX4 (shared with the no-NA x4). NOT for cpp-search merge
until the A/B reports and the getenv-consolidation merge-gate is addressed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* perf(iw): land IW/NA x4 + dirty-region opts default-OFF (opt-in)

Flip the kill-switch semantics from default-on (TS_IW_NOX4 /
TS_IW_NODIRTY) to opt-in (TS_IW_X4 / TS_IW_DIRTY). These opts are
mission-null at morphology scale (validated byte-identical, 1.005x wall)
and are preserved on the shared branch only for the large-N /
recipe-retune reopen, so they should impose nothing by default.

Test guards updated in lockstep: the "on" arm now explicitly enables the
opt so the kernel still fires (data triggers the iw_family + has_na gate
as before); byte-identity vs the scalar baseline is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* dev(benchmarks): NA-IW x4 A/B harnesses + large-N matrices for reopen

Preserve the dilution-free element-level and mission-wall A/B harnesses
for the NA-IW x4 reroot batch (updated to the TS_IW_X4 opt-in env var),
plus the two large real inapplicable-bearing matrices (Sun2018, lobo)
the large-N reopen will need to re-test whether the scan share rises.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(tbr): drop directional-audit orphan bundled in cherry-picked ts_tbr.cpp

The worktree's 69febb4 bundled the exact directional (Regime-C) NA
scoring audit/scorer into ts_tbr.cpp alongside the x4/dirty opts; the
header it needs (ts_fitch_na_directional.h, added separately by b2e03a9)
is not on cpp-search and was deliberately excluded (the directional path
is dead -- 24-89x slower than SIMD full_rescore). Strip the include,
na_dir_audit / na_dir_scorer setup, whole-tree cross-check,
build_clip_folds, and the per-candidate directional fast-path so
exact_verify_sweep matches cpp-search's (#254 incremental) version. Kept
<chrono> (needed by the iw_timing diagnostic). Builds clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants