Improve applyCompMatr accumulation accuracy#784
Conversation
|
I pushed a follow-up regression test in Local checks:
On the removed |
|
Great! The next step of the challenge:
|
… benchmark Adds a compile-time switch (QUEST_DENSE_ACCUM_NAIVE) to cpu_statevec_anyCtrlAnyTargDenseMatr_sub() so the compensated (Kahan) and original uncompensated accumulation can be benchmarked apples-to-apples. Default stays Kahan (preserving the PR's regression test). artifacts/ contains a standalone benchmark (bench_kahan.cpp) that replicates the exact inner-product kernel for fp1/fp2/fp4 against a genuine 113-bit __float128 reference, an end-to-end driver against the real library, the raw data, plots, and a writeup. On this Apple-Silicon host sizeof(long double)==8, so fp4 is not genuine quad and equals fp2 in accuracy -- documented honestly in the writeup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Fix the end-to-end writeup: the 12-target adversarial CompMatr has exact amp0 = numRows-2 = 4094 (not '~4096'). Kahan recovers 4096 (residual +2), naive yields 0; state this precisely instead of rounding to the matrix dim. - Fix stale comment in e2e_compmatr.cpp (constants are 1e18, not 1e15). - Scope the ~3.3-4.2x cost as an in-cache micro-benchmark upper bound; note the real per-apply path at large target counts is memory-bandwidth bound. - Note the 30x/50x benefit is for an adversarial non-unitary matrix; the gain on well-conditioned/unitary CompMatrs is much milder (worst-case bound). - Soften 'exact kernel' wording: harness uses std::complex, arithmetically identical to base_qcomp but not literally QuEST's struct. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Ran the requested before/after benchmark for To get an apples-to-apples comparison inside the actual function, I made the accumulation compile-time selectable: defining Honest caveat on fp4 / quad. On this Apple-Silicon macOS host, both AppleClang and Homebrew GCC report Findings (worst-case abs error vs the 113-bit reference; adversarial non-unitary matrix, entries spanning 1e-6 .. 1e6 with cancelling large pairs):
Scope caveats worth stating up front:
Verification: the touched well-conditioned tests still pass on the Kahan build ( Conclusion: Kahan is clearly worthwhile for large or ill-conditioned dense matrices, but the blanket 3–4x micro-benchmark cost is too steep for small/well-conditioned ones. A target-count threshold (~6–8) is the natural compromise, and the (No CHANGELOG in the repo, so noting here: this PR adds the compile-time Benchmark harness, end-to-end driver, raw data, plots and writeup are in |
Fixes #598.
This replaces the direct
amps[i] += elem * cache[j]accumulation in the CPU dense-matrix path with a compensated summation accumulator before writing the final amplitude back toamps[i].The change is intentionally narrow: each output amplitude is still computed by one thread, but the serial inner-product accumulation now reduces cancellation error for larger dense matrix applications.
Tests run locally:
/opt/homebrew/bin/cmake -S . -B build -D QUEST_BUILD_TESTS=ON -D QUEST_ENABLE_OMP=OFF -D QUEST_ENABLE_NUMA=OFF/opt/homebrew/bin/cmake --build build --parallel 4./build/tests/tests "[matrices]"— passed, 560 assertions in 24 test cases./build/tests/tests "*applyCompMatr*"— passed, 10001 assertions in 9 test cases