perf(na): incremental exact_verify rescore — default-on (~25-30% native-NA mission wall) by ms609 · Pull Request #254 · ms609/TreeSearch

ms609 · 2026-06-23T07:48:52Z

What

Makes the incremental dirty rescore the production default for exact_verify_sweep candidates on inapplicable (NA) data.

exact_verify_sweep is the native-NA unrooted-TBR completeness pass — NA's indirect scan is only approximate, so it certifies a true unrooted optimum by an exact full-neighbourhood sweep, which TS_NA_CLIPPROF measurement showed is ~95% of native-NA TBR wall and ~96% of native-NA MaximizeParsimony mission wall (even at maxReplicates=5). It scored every TBR neighbour by apply_tbr_move + a full three-pass full_rescore.

This PR scores each candidate incrementally instead:

3-seed dirty passes — fitch_na_dirty_{down,up}pass gain an optional third seed (default -1 = no-op, so the SPR accept path is unchanged). exact_verify seeds {nz, nx, clip_node}; the clip_node seed covers the reversed-subtree path, so one path handles both SPR and reroot candidates.
fused Pass 3 — fitch_na_pass3_score optionally buckets per-pattern steps during its existing walk, so the IW path skips a redundant extract_char_steps full walk.
On reject, restore_prealloc_undo/restore_saved_states undoes the dirty state before restore_topology, keeping the base state valid for the next candidate.

Kill-switch TS_NA_NOINCR restores the legacy full_rescore path.

Result (Hamilton, native NA, MaximizeParsimony default recipe)

1.30× mean speedup (24.7% pooled wall reduction, p = 2×10⁻⁶), all cells byte-identical. Dikow2009 1.48×, Zanol2014 1.34×, Zhu2013 1.21×, Giles2015 1.17×.

Correctness

TS_NA_INCR_AUDIT: every candidate's incremental score byte-matched full_rescore (5k–14k candidates per run, EW + IW).
Default vs TS_NA_NOINCR (legacy) byte-identical: all 30 bundled NA datasets, 6/6 each (EW + IW × 3 starts = 180 climbs, 23–88 taxa), zero diffs; plus 40/40 mission cells.
Default-off paths byte-identical (3-seed param defaults to no-op; fitch_na_pass3_score/dirty-pass callers unchanged).
New regression tests in test-ts-na-incremental.R (default vs kill-switch + audit).

Notes

getenv read is per-exact_verify call (per-convergence), not per-candidate.
Remaining algorithmic headroom (dirty Pass 3) is capped by NA's ~0.4–0.48 dirty-region fraction and needs intricate incremental char-step maintenance — diminishing returns, not pursued.

🤖 Generated with Claude Code

…ult-off) exact_verify_sweep is ~95% of native-NA TBR/mission wall: it scores every TBR neighbour of the converged tree by apply_tbr_move + full three-pass full_rescore (NA needs the exact sweep because its indirect scan is approximate). This makes each candidate an INCREMENTAL rescore instead. - 3-seed dirty passes: fitch_na_dirty_{down,up}pass gain an optional third seed (default -1 = no-op, so the SPR accept path is unchanged). exact_verify seeds {nz (sibling reconnect), nx (regraft node), clip_node (covers the reversed subtree path + its rootward chain)} -- covering BOTH SPR and reroot candidates. - Per candidate: dirty Pass1/Pass2 (O(dirty) not O(n)) + full Pass3 + per-pattern extract + compute_weighted (IW/XPIWE/PROFILE) or +ew_offset (EW); on reject, restore_prealloc_undo/restore_saved_states undoes the dirty state before restore_topology, keeping the base state valid for the next candidate. Behind TS_NA_INCR (fast path) and TS_NA_INCR_AUDIT (cross-check vs full_rescore); both DEFAULT OFF, so production is byte-identical (519 testthat pass; 3-seed default -1). VALIDATION: - TS_NA_INCR_AUDIT: every candidate's incremental score byte-matched full_rescore -- 5486..13674 candidates each on Vinther/Longrich/DeAssis x {EW, IW}. - End-to-end byte-identical search outcome (TS_NA_INCR on vs off): 18/18 (3 datasets x 2 regimes x 3 starts). - Element wall (exact_verify-heavy direct climb): Zanol 1.12x, Dikow 1.21x, scores identical. (Below the ~2.5x ceiling: NA final_ uppass propagation makes the dirty region large; Pass3/extract stay full -- headroom remains.) Mission A/B pending (should translate, unlike the washed extract-fusion: this speeds exact_verify directly rather than relocating work). NOT default-on / not for cpp-search until the mission A/B + broader byte-identity confirm. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…emental path) The incremental exact_verify path ran fitch_na_pass3_score AND a separate extract_char_steps over the full tree -- both recompute the identical per-node step bits (standard from local_cost, NA from the Pass3 needs_step formula). Add an optional char_steps_out to fitch_na_pass3_score that buckets per-pattern during its existing walk; the IW incremental path uses it and skips extract. Unlike the dirty Pass1/2 (capped by NA's ~0.4-0.48 dirty fraction), this is a redundant-walk ELIMINATION, not dirty-fraction-limited. Default-nullptr param => all other callers byte-identical. Element wall (exact_verify-heavy climb), incremental + this fold vs legacy: Zanol 1.23x (was 1.12x), Dikow 1.32x (was 1.21x); scores byte-identical. Audit byte-matches full_rescore on every candidate; fast-vs-legacy 18/18 byte-identical; 519 tests pass (production unchanged). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Flip the na_incr gate to default-ON (kill-switch TS_NA_NOINCR restores the legacy full_rescore path). The incremental dirty rescore is byte-identical to legacy (per-candidate audit + 180/180 full-roster climbs + 40/40 mission cells) and cuts native-NA mission wall ~25-30% (1.30x mean, p=2e-06). Disabled in TS_NA_INCR_AUDIT mode (which uses full_rescore for decisions). getenv read is per-convergence (exact_verify), not per-candidate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…INCR + audit) Two enduring tests: (1) the default incremental path is byte-identical to the legacy full_rescore (TS_NA_NOINCR) across Vinther/DeAssis x {EW, IW} x starts; (2) TS_NA_INCR_AUDIT runs clean (per-candidate incr == full_rescore, else stop). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

…pt-in) (#256) * perf(iw): x4 reroot batch + extract_char_steps dirty-region for IW TBR Cherry-pick of 69febb4 onto cpp-search (post-#254). Resolved the exact_verify candidate-scoring conflict by keeping #254's incremental rescore path and dropping the directional-audit (na_dir_audit) block that #254 deliberately removed. Stripped the settled-dead ts_iw_gather_bench microbench (gather micro-opt proven a dead heat) and trimmed superseded dev A/B harnesses, keeping only bench_iw_realized.R. Opts remain kill-switched (TS_IW_NOX4 / TS_IW_NODIRTY); a follow-up commit flips them default-off for the shared branch. Mission-null at morphology scale; kept for the large-N / recipe-retune reopen. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * perf(tbr): fire IW dirty/x4 opts on XPIWE; fix nx_cs active_mask underflow The extract_char_steps dirty-region + x4 reroot-batch opts were gated to ScoringMode::IW, but production MaximizeParsimony defaults to extended IW (ScoringMode::XPIWE, ts_data.cpp:297). So the opts were dead on the production path for every dataset, and the mission-wall A/B (which toggles TS_IW_NOX4/ NODIRTY through MaximizeParsimony) was toggling code that never executed -> the 1.005x "null result". Widen the gate to iw_family = (IW || XPIWE): both modes score via compute_iw/precompute_iw_delta from the same weighting-agnostic char_steps, differing only in per-pattern eff_k/phi (which those functions already consume). PROFILE stays excluded (compute_profile + info_amounts + precomputed_steps is a genuinely different char_steps convention). Perf-only / byte-identical: opts-on vs opts-off scores identical across Dikow/Vinther/Zanol/Giles/Wortley x 3 seeds; SCANCHK clean on XPIWE. Firing the opts on the production path exposed a pre-existing dirty-region bug: the nx_cs accumulation did not skip active_mask==0 blocks, but extract_char_steps (which builds the F cache) does. Under a ratchet ZERO_ONLY perturbation (the default, ratchetCycles>=3) that fully deactivates a block, divided_steps = F + cs_delta - nx_cs went negative (TS_IW_DIRTYCHK: dirty=-1 ref=0 on Dikow2009/Vinther2008). Weighting-independent (plain IW mismatches identically; reachable today via MaximizeParsimony(extended_iw=FALSE)). The candidate scan masks needs_step by active_mask so scores were unaffected, but the invariant was broken. Fix: skip active_mask==0 in the nx_cs loop only; EW nx_cost is left counting all blocks (its best_score/delta length convention is self-consistent). DIRTYCHK now clean across 5 datasets x ratchetCycles {0,3,6}. Add a testthat guard pinning XPIWE opts-on == opts-off (tolerance=0) under ratchet (the discriminating config; prior tests ran plain-IW or ratchetCycles=0 and were tautological for this path). 398 existing tests still pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * perf(tbr): NA-IW 4-wide reroot batch (indirect_na_iw_cached_flat_x4) [perf TBD] Port the T-245 4-wide reroot ILP batch to the implied-weights + inapplicable (IW/XPIWE + NA) TBR scan -- the one scoring path that never received it. The NA reroot scan already had an EW-NA x4 (fitch_na_indirect_cached_flat_x4) but IW-NA/XPIWE-NA fell to the one-at-a-time scalar indirect_na_iw_length_cached; the IW x4 branch was !has_na-gated. New kernel indirect_na_iw_cached_flat_x4 (src/ts_fitch.cpp) fuses the NA active-mask candidate logic of the EW-NA x4 (from1 reduce, shared clip_has_active, per-candidate below_actives AND) with the per-candidate iw_delta ctz-gather of indirect_iw_cached_flat_x4. Each accumulator keeps the scalar add order of indirect_na_iw_length_cached, so per-candidate results are bit-identical; the shared all-4-exceed-cutoff bail only changes early-exit on cutoff-losing candidates. Wired by widening the IW x4 branch gate (ts_tbr.cpp:1985) to dispatch on has_na, mirroring its own per-candidate skip re-check and main_edges[ei] indexing; the no-NA path stays byte-identical. Gated on iw_family (IW||XPIWE), so it fires on the production XPIWE+NA path (MaximizeParsimony default) -- the first banked IW kernel opt that lands on the NATIVE inapplicable-bearing corpus rather than only recoded matrices. The dirty-region scan shortcut is deliberately NOT ported (stays !has_na): the top-down down2 pass breaks the path-bounded F+cs_delta-nx_cs decomposition, and it is the smaller (once-per-clip) prize. CORRECTNESS validated, PERF still unknown (Hamilton A/B pending): - byte-identical x4-on vs TS_IW_NOX4-off: 24/24 (15 direct IW-NA ts_tbr_search + 9 MaximizeParsimony XPIWE-NA under ratchet; datasets at 10-38% NA) - existing independent full-rescore oracle passes; 519 testthat pass - kernel confirmed to execute (throw-probe), so the byte-identity is not vacuous - new regression guard in test-ts-tbr-dirty-rescore.R keeps inapplicables Kill-switch TS_IW_NOX4 (shared with the no-NA x4). NOT for cpp-search merge until the A/B reports and the getenv-consolidation merge-gate is addressed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * perf(iw): land IW/NA x4 + dirty-region opts default-OFF (opt-in) Flip the kill-switch semantics from default-on (TS_IW_NOX4 / TS_IW_NODIRTY) to opt-in (TS_IW_X4 / TS_IW_DIRTY). These opts are mission-null at morphology scale (validated byte-identical, 1.005x wall) and are preserved on the shared branch only for the large-N / recipe-retune reopen, so they should impose nothing by default. Test guards updated in lockstep: the "on" arm now explicitly enables the opt so the kernel still fires (data triggers the iw_family + has_na gate as before); byte-identity vs the scalar baseline is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * dev(benchmarks): NA-IW x4 A/B harnesses + large-N matrices for reopen Preserve the dilution-free element-level and mission-wall A/B harnesses for the NA-IW x4 reroot batch (updated to the TS_IW_X4 opt-in env var), plus the two large real inapplicable-bearing matrices (Sun2018, lobo) the large-N reopen will need to re-test whether the scan share rises. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(tbr): drop directional-audit orphan bundled in cherry-picked ts_tbr.cpp The worktree's 69febb4 bundled the exact directional (Regime-C) NA scoring audit/scorer into ts_tbr.cpp alongside the x4/dirty opts; the header it needs (ts_fitch_na_directional.h, added separately by b2e03a9) is not on cpp-search and was deliberately excluded (the directional path is dead -- 24-89x slower than SIMD full_rescore). Strip the include, na_dir_audit / na_dir_scorer setup, whole-tree cross-check, build_clip_folds, and the per-candidate directional fast-path so exact_verify_sweep matches cpp-search's (#254 incremental) version. Kept <chrono> (needed by the iw_timing diagnostic). Builds clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

ms609 and others added 5 commits June 23, 2026 08:46

Fix SearchControl expectation and spelling terms (#255)

b336124

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

ms609 merged commit 47c0494 into cpp-search Jun 23, 2026
2 of 7 checks passed

ms609 deleted the claude/na-incremental-rescore branch June 23, 2026 09:27

ms609 mentioned this pull request Jun 23, 2026

perf(tbr): IW/XPIWE/NA x4 reroot batch + dirty-region (default-OFF, opt-in) #256

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(na): incremental exact_verify rescore — default-on (~25-30% native-NA mission wall)#254

perf(na): incremental exact_verify rescore — default-on (~25-30% native-NA mission wall)#254
ms609 merged 5 commits into
cpp-searchfrom
claude/na-incremental-rescore

ms609 commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ms609 commented Jun 23, 2026

What

Result (Hamilton, native NA, MaximizeParsimony default recipe)

Correctness

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants