Skip to content

fix(parallel): MinGW emutls thread_local teardown heap corruption (Windows CI hang)#258

Merged
ms609 merged 3 commits into
cpp-searchfrom
claude/fix-parallel-emutls-charsteps
Jun 24, 2026
Merged

fix(parallel): MinGW emutls thread_local teardown heap corruption (Windows CI hang)#258
ms609 merged 3 commits into
cpp-searchfrom
claude/fix-parallel-emutls-charsteps

Conversation

@ms609

@ms609 ms609 commented Jun 24, 2026

Copy link
Copy Markdown
Owner

The real Windows CI failure

cpp-search's Windows R CMD check failed (1 error), with the test output truncating at the NA_INCR_AUDIT print from test-ts-na-incremental.R. That print was a red herring — the NA incremental rescore path is clean (gcc-ASan and valgrind on Linux: 0 errors, scores byte-identical). The failure is in the next file, test-ts-parallel.R.

Localised by running the full suite in one Windows R process with a crash-durable per-test marker: test-ts-na-incremental.R completes, and the process then hangs in test-ts-parallel.R at "Parallel hits_to_best" (reproduced 2/2; test-ts-parallel.R alone reproduces it in ~50 s).

Root cause

fitch_score_ew's IW/profile full-rescore path kept its per-pattern step buffer in a function-local static thread_local std::vector<int>. On MinGW a thread_local with a non-trivial destructor is torn down via emutls when each std::thread worker exits, and that teardown corrupts the heap across the parallel search's repeated worker spawn/exit cycles. A corrupted worker's replicate never completes, so the driver's main poll loop (no timeout in this test) spins forever → hang (a hard error on CI). Linux uses native TLS, so it was invisible to every Linux sanitizer — exactly the failure class the codebase already hit and documented for evs_false_cache.

Fix

Move the scratch off thread_local onto DataSet as mutable std::vector<int> char_steps_scratch, mirroring the existing evs_false_cache fix. Each worker owns a private ds_local copy, so this keeps the same per-thread, cross-call capacity persistence with no emutls teardown and no synchronisation (single-writer per copy). Scores are byte-identical; this was the last static thread_local object with a destructor on the worker scoring path.

Second commit relaxes the test-ts-parallel.R timeout assertion from a fragile elapsed < 15.0 wall-clock ceiling to a working-vs-broken discriminator (supersedes #257).

Verification

  • Windows: the full test-ts-parallel.R sequence, which hung 2/2 before, completes 3/3 after the fix (3 back-to-back rounds in one process).
  • Linux: gcc-ASan + valgrind on the NA incremental path remain clean (the fix doesn't touch scoring logic).

🤖 Generated with Claude Code

ms609 and others added 2 commits June 24, 2026 09:55
…GW emutls)

`fitch_score_ew`'s IW/profile full-rescore path kept its per-pattern step
buffer in a function-local `static thread_local std::vector<int>`. On MinGW a
thread_local with a non-trivial destructor is torn down via emutls when each
std::thread worker exits, and that teardown corrupted the heap across the
parallel search's repeated worker spawn/exit cycles — a Windows-only failure
in test-ts-parallel.R (the suite reached "Parallel hits_to_best" and the
process hung on a corrupted worker whose replicate never completed, so the
driver's main poll loop spun forever; on CI it surfaced as a hard error).
Linux uses native TLS and was unaffected (gcc-ASan + valgrind both clean).

Fix mirrors the evs_false_cache change already in this file's history: the
scratch now lives on DataSet as a `mutable std::vector<int> char_steps_scratch`.
Each parallel worker owns a private `ds_local` copy for its whole lifetime, so
this preserves the same per-thread, cross-call capacity persistence the
thread_local provided — without any emutls teardown. Single-writer per copy
(workers touch only their own ds_local; the prototype is used only in the
post-join single-threaded MPT phase), so no synchronisation is required.

Byte-identical scores; this is the last remaining `static thread_local` object
with a destructor on the worker scoring path. Verified on Windows: the full
test-ts-parallel.R sequence, which hung 2/2 before, completed 3/3 after.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…iling

"Parallel search respects timeout" asserted `elapsed < 15.0` for a 2 s budget.
That is a wall-clock precision assertion a shared CI runner cannot guarantee:
the per-test wall time is observed to vary from ~5 s to ~30 s on the same
machine (one in-flight sectorial pass per worker, plus scheduling jitter).

Replace it with a working-vs-broken discriminator: `expect_lt(elapsed, 60)`
(a working timeout returns in seconds-to-low-tens; a broken one runs the full
1000-replicate budget, >1000 s — a ~100x gap) plus `expect_lt(replicates,
1000L)`, keeping the `timed_out` flag check. Supersedes #257.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…bust)

The relaxed `elapsed < 60` bound still failed the Windows covr step at 84.7 s:
covr rebuilds the C++ engine -O0 + gcov-instrumented, so the single in-flight
sectorial pass that overruns the 2 s deadline runs ~10-20x slower than the
optimised build — slow code, not a broken timeout. Any wall-clock ceiling is
fundamentally unworkable under instrumentation.

Drop the elapsed assertion: `result$timed_out` (set only on the deadline/cancel
path) plus `result$replicates < maxReplicates` (proves the search stopped before
exhausting its budget) are a robust working-vs-broken discriminator with zero
wall-clock dependency, holding on any hardware and under covr. A real unbounded
run would exhaust the budget (replicates == 1000) and fail the second check; a
true hang would never return (caught as a job timeout, not this assertion).

This surfaced only after the emutls fix let R CMD check pass on Windows, so the
covr step finally ran the parallel tests to completion.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ms609 ms609 merged commit d25a0f6 into cpp-search Jun 24, 2026
4 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant