perf(na): incremental exact_verify rescore — default-on (~25-30% native-NA mission wall)#254
Merged
Merged
Conversation
…ult-off)
exact_verify_sweep is ~95% of native-NA TBR/mission wall: it scores every TBR
neighbour of the converged tree by apply_tbr_move + full three-pass full_rescore
(NA needs the exact sweep because its indirect scan is approximate). This makes
each candidate an INCREMENTAL rescore instead.
- 3-seed dirty passes: fitch_na_dirty_{down,up}pass gain an optional third seed
(default -1 = no-op, so the SPR accept path is unchanged). exact_verify seeds
{nz (sibling reconnect), nx (regraft node), clip_node (covers the reversed
subtree path + its rootward chain)} -- covering BOTH SPR and reroot candidates.
- Per candidate: dirty Pass1/Pass2 (O(dirty) not O(n)) + full Pass3 + per-pattern
extract + compute_weighted (IW/XPIWE/PROFILE) or +ew_offset (EW); on reject,
restore_prealloc_undo/restore_saved_states undoes the dirty state before
restore_topology, keeping the base state valid for the next candidate.
Behind TS_NA_INCR (fast path) and TS_NA_INCR_AUDIT (cross-check vs full_rescore);
both DEFAULT OFF, so production is byte-identical (519 testthat pass; 3-seed
default -1).
VALIDATION:
- TS_NA_INCR_AUDIT: every candidate's incremental score byte-matched full_rescore
-- 5486..13674 candidates each on Vinther/Longrich/DeAssis x {EW, IW}.
- End-to-end byte-identical search outcome (TS_NA_INCR on vs off): 18/18
(3 datasets x 2 regimes x 3 starts).
- Element wall (exact_verify-heavy direct climb): Zanol 1.12x, Dikow 1.21x,
scores identical. (Below the ~2.5x ceiling: NA final_ uppass propagation makes
the dirty region large; Pass3/extract stay full -- headroom remains.)
Mission A/B pending (should translate, unlike the washed extract-fusion: this
speeds exact_verify directly rather than relocating work). NOT default-on /
not for cpp-search until the mission A/B + broader byte-identity confirm.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…emental path) The incremental exact_verify path ran fitch_na_pass3_score AND a separate extract_char_steps over the full tree -- both recompute the identical per-node step bits (standard from local_cost, NA from the Pass3 needs_step formula). Add an optional char_steps_out to fitch_na_pass3_score that buckets per-pattern during its existing walk; the IW incremental path uses it and skips extract. Unlike the dirty Pass1/2 (capped by NA's ~0.4-0.48 dirty fraction), this is a redundant-walk ELIMINATION, not dirty-fraction-limited. Default-nullptr param => all other callers byte-identical. Element wall (exact_verify-heavy climb), incremental + this fold vs legacy: Zanol 1.23x (was 1.12x), Dikow 1.32x (was 1.21x); scores byte-identical. Audit byte-matches full_rescore on every candidate; fast-vs-legacy 18/18 byte-identical; 519 tests pass (production unchanged). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Flip the na_incr gate to default-ON (kill-switch TS_NA_NOINCR restores the legacy full_rescore path). The incremental dirty rescore is byte-identical to legacy (per-candidate audit + 180/180 full-roster climbs + 40/40 mission cells) and cuts native-NA mission wall ~25-30% (1.30x mean, p=2e-06). Disabled in TS_NA_INCR_AUDIT mode (which uses full_rescore for decisions). getenv read is per-convergence (exact_verify), not per-candidate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…INCR + audit)
Two enduring tests: (1) the default incremental path is byte-identical to the
legacy full_rescore (TS_NA_NOINCR) across Vinther/DeAssis x {EW, IW} x starts;
(2) TS_NA_INCR_AUDIT runs clean (per-candidate incr == full_rescore, else stop).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
ms609
added a commit
that referenced
this pull request
Jun 23, 2026
…pt-in) (#256) * perf(iw): x4 reroot batch + extract_char_steps dirty-region for IW TBR Cherry-pick of 69febb4 onto cpp-search (post-#254). Resolved the exact_verify candidate-scoring conflict by keeping #254's incremental rescore path and dropping the directional-audit (na_dir_audit) block that #254 deliberately removed. Stripped the settled-dead ts_iw_gather_bench microbench (gather micro-opt proven a dead heat) and trimmed superseded dev A/B harnesses, keeping only bench_iw_realized.R. Opts remain kill-switched (TS_IW_NOX4 / TS_IW_NODIRTY); a follow-up commit flips them default-off for the shared branch. Mission-null at morphology scale; kept for the large-N / recipe-retune reopen. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * perf(tbr): fire IW dirty/x4 opts on XPIWE; fix nx_cs active_mask underflow The extract_char_steps dirty-region + x4 reroot-batch opts were gated to ScoringMode::IW, but production MaximizeParsimony defaults to extended IW (ScoringMode::XPIWE, ts_data.cpp:297). So the opts were dead on the production path for every dataset, and the mission-wall A/B (which toggles TS_IW_NOX4/ NODIRTY through MaximizeParsimony) was toggling code that never executed -> the 1.005x "null result". Widen the gate to iw_family = (IW || XPIWE): both modes score via compute_iw/precompute_iw_delta from the same weighting-agnostic char_steps, differing only in per-pattern eff_k/phi (which those functions already consume). PROFILE stays excluded (compute_profile + info_amounts + precomputed_steps is a genuinely different char_steps convention). Perf-only / byte-identical: opts-on vs opts-off scores identical across Dikow/Vinther/Zanol/Giles/Wortley x 3 seeds; SCANCHK clean on XPIWE. Firing the opts on the production path exposed a pre-existing dirty-region bug: the nx_cs accumulation did not skip active_mask==0 blocks, but extract_char_steps (which builds the F cache) does. Under a ratchet ZERO_ONLY perturbation (the default, ratchetCycles>=3) that fully deactivates a block, divided_steps = F + cs_delta - nx_cs went negative (TS_IW_DIRTYCHK: dirty=-1 ref=0 on Dikow2009/Vinther2008). Weighting-independent (plain IW mismatches identically; reachable today via MaximizeParsimony(extended_iw=FALSE)). The candidate scan masks needs_step by active_mask so scores were unaffected, but the invariant was broken. Fix: skip active_mask==0 in the nx_cs loop only; EW nx_cost is left counting all blocks (its best_score/delta length convention is self-consistent). DIRTYCHK now clean across 5 datasets x ratchetCycles {0,3,6}. Add a testthat guard pinning XPIWE opts-on == opts-off (tolerance=0) under ratchet (the discriminating config; prior tests ran plain-IW or ratchetCycles=0 and were tautological for this path). 398 existing tests still pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * perf(tbr): NA-IW 4-wide reroot batch (indirect_na_iw_cached_flat_x4) [perf TBD] Port the T-245 4-wide reroot ILP batch to the implied-weights + inapplicable (IW/XPIWE + NA) TBR scan -- the one scoring path that never received it. The NA reroot scan already had an EW-NA x4 (fitch_na_indirect_cached_flat_x4) but IW-NA/XPIWE-NA fell to the one-at-a-time scalar indirect_na_iw_length_cached; the IW x4 branch was !has_na-gated. New kernel indirect_na_iw_cached_flat_x4 (src/ts_fitch.cpp) fuses the NA active-mask candidate logic of the EW-NA x4 (from1 reduce, shared clip_has_active, per-candidate below_actives AND) with the per-candidate iw_delta ctz-gather of indirect_iw_cached_flat_x4. Each accumulator keeps the scalar add order of indirect_na_iw_length_cached, so per-candidate results are bit-identical; the shared all-4-exceed-cutoff bail only changes early-exit on cutoff-losing candidates. Wired by widening the IW x4 branch gate (ts_tbr.cpp:1985) to dispatch on has_na, mirroring its own per-candidate skip re-check and main_edges[ei] indexing; the no-NA path stays byte-identical. Gated on iw_family (IW||XPIWE), so it fires on the production XPIWE+NA path (MaximizeParsimony default) -- the first banked IW kernel opt that lands on the NATIVE inapplicable-bearing corpus rather than only recoded matrices. The dirty-region scan shortcut is deliberately NOT ported (stays !has_na): the top-down down2 pass breaks the path-bounded F+cs_delta-nx_cs decomposition, and it is the smaller (once-per-clip) prize. CORRECTNESS validated, PERF still unknown (Hamilton A/B pending): - byte-identical x4-on vs TS_IW_NOX4-off: 24/24 (15 direct IW-NA ts_tbr_search + 9 MaximizeParsimony XPIWE-NA under ratchet; datasets at 10-38% NA) - existing independent full-rescore oracle passes; 519 testthat pass - kernel confirmed to execute (throw-probe), so the byte-identity is not vacuous - new regression guard in test-ts-tbr-dirty-rescore.R keeps inapplicables Kill-switch TS_IW_NOX4 (shared with the no-NA x4). NOT for cpp-search merge until the A/B reports and the getenv-consolidation merge-gate is addressed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * perf(iw): land IW/NA x4 + dirty-region opts default-OFF (opt-in) Flip the kill-switch semantics from default-on (TS_IW_NOX4 / TS_IW_NODIRTY) to opt-in (TS_IW_X4 / TS_IW_DIRTY). These opts are mission-null at morphology scale (validated byte-identical, 1.005x wall) and are preserved on the shared branch only for the large-N / recipe-retune reopen, so they should impose nothing by default. Test guards updated in lockstep: the "on" arm now explicitly enables the opt so the kernel still fires (data triggers the iw_family + has_na gate as before); byte-identity vs the scalar baseline is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * dev(benchmarks): NA-IW x4 A/B harnesses + large-N matrices for reopen Preserve the dilution-free element-level and mission-wall A/B harnesses for the NA-IW x4 reroot batch (updated to the TS_IW_X4 opt-in env var), plus the two large real inapplicable-bearing matrices (Sun2018, lobo) the large-N reopen will need to re-test whether the scan share rises. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(tbr): drop directional-audit orphan bundled in cherry-picked ts_tbr.cpp The worktree's 69febb4 bundled the exact directional (Regime-C) NA scoring audit/scorer into ts_tbr.cpp alongside the x4/dirty opts; the header it needs (ts_fitch_na_directional.h, added separately by b2e03a9) is not on cpp-search and was deliberately excluded (the directional path is dead -- 24-89x slower than SIMD full_rescore). Strip the include, na_dir_audit / na_dir_scorer setup, whole-tree cross-check, build_clip_folds, and the per-candidate directional fast-path so exact_verify_sweep matches cpp-search's (#254 incremental) version. Kept <chrono> (needed by the iw_timing diagnostic). Builds clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Makes the incremental dirty rescore the production default for
exact_verify_sweepcandidates on inapplicable (NA) data.exact_verify_sweepis the native-NA unrooted-TBR completeness pass — NA's indirect scan is only approximate, so it certifies a true unrooted optimum by an exact full-neighbourhood sweep, whichTS_NA_CLIPPROFmeasurement showed is ~95% of native-NA TBR wall and ~96% of native-NAMaximizeParsimonymission wall (even at maxReplicates=5). It scored every TBR neighbour byapply_tbr_move+ a full three-passfull_rescore.This PR scores each candidate incrementally instead:
fitch_na_dirty_{down,up}passgain an optional third seed (default-1= no-op, so the SPR accept path is unchanged).exact_verifyseeds{nz, nx, clip_node}; theclip_nodeseed covers the reversed-subtree path, so one path handles both SPR and reroot candidates.fitch_na_pass3_scoreoptionally buckets per-pattern steps during its existing walk, so the IW path skips a redundantextract_char_stepsfull walk.restore_prealloc_undo/restore_saved_statesundoes the dirty state beforerestore_topology, keeping the base state valid for the next candidate.Kill-switch
TS_NA_NOINCRrestores the legacyfull_rescorepath.Result (Hamilton, native NA, MaximizeParsimony default recipe)
1.30× mean speedup (24.7% pooled wall reduction, p = 2×10⁻⁶), all cells byte-identical. Dikow2009 1.48×, Zanol2014 1.34×, Zhu2013 1.21×, Giles2015 1.17×.
Correctness
TS_NA_INCR_AUDIT: every candidate's incremental score byte-matchedfull_rescore(5k–14k candidates per run, EW + IW).TS_NA_NOINCR(legacy) byte-identical: all 30 bundled NA datasets, 6/6 each (EW + IW × 3 starts = 180 climbs, 23–88 taxa), zero diffs; plus 40/40 mission cells.fitch_na_pass3_score/dirty-pass callers unchanged).test-ts-na-incremental.R(default vs kill-switch + audit).Notes
getenvread is per-exact_verifycall (per-convergence), not per-candidate.🤖 Generated with Claude Code