Skip to content

Improve applyCompMatr accumulation accuracy#784

Open
agentlaunchops-ai wants to merge 5 commits into
QuEST-Kit:develfrom
agentlaunchops-ai:unitaryhack-598-compensated-accumulation
Open

Improve applyCompMatr accumulation accuracy#784
agentlaunchops-ai wants to merge 5 commits into
QuEST-Kit:develfrom
agentlaunchops-ai:unitaryhack-598-compensated-accumulation

Conversation

@agentlaunchops-ai

Copy link
Copy Markdown

Fixes #598.

This replaces the direct amps[i] += elem * cache[j] accumulation in the CPU dense-matrix path with a compensated summation accumulator before writing the final amplitude back to amps[i].

The change is intentionally narrow: each output amplitude is still computed by one thread, but the serial inner-product accumulation now reduces cancellation error for larger dense matrix applications.

Tests run locally:

  • /opt/homebrew/bin/cmake -S . -B build -D QUEST_BUILD_TESTS=ON -D QUEST_ENABLE_OMP=OFF -D QUEST_ENABLE_NUMA=OFF
  • /opt/homebrew/bin/cmake --build build --parallel 4
  • ./build/tests/tests "[matrices]" — passed, 560 assertions in 24 test cases
  • ./build/tests/tests "*applyCompMatr*" — passed, 10001 assertions in 9 test cases

@agentlaunchops-ai

Copy link
Copy Markdown
Author

I pushed a follow-up regression test in tests/unit/operations.cpp for the compensated accumulation path. The test constructs a 4-qubit dense matrix row whose naive serial accumulation gives 0 after cancellation, while the compensated path preserves the expected 14.

Local checks:

  • With the source fix present: ./build/tests/tests "applyCompMatr uses compensated accumulation" passes, 1 assertion in 1 test case.
  • Negative control: I temporarily restored the old amps[i] += elem * cache[j] loop while keeping the new test; the same command fails with 0.0 == 14.

On the removed @todo/BEWARE: I checked quest/src/core/base_qcomp.hpp before implementing compensation. Its + and - overloads are plain componentwise real/imaginary arithmetic, so the compensation variables are not relying on incompatible commutator-style behavior.

@TysonRayJones

Copy link
Copy Markdown
Member

Great! The next step of the challenge:

Modify cpu_statevec_anyCtrlAnyTargDenseMatr_sub()) to use Kahan summation and measure the runtime and accuracy effect (compared to the current implementation without compensation) at different system sizes. Visualise the results to show the costs and benefits of Kahan summation, elucidating when (if ever) it is worthwhile. Repeat this for the three possible qcomp precisions (see here).

agentlaunchops-ai and others added 2 commits June 9, 2026 11:26
… benchmark

Adds a compile-time switch (QUEST_DENSE_ACCUM_NAIVE) to
cpu_statevec_anyCtrlAnyTargDenseMatr_sub() so the compensated (Kahan) and
original uncompensated accumulation can be benchmarked apples-to-apples.
Default stays Kahan (preserving the PR's regression test).

artifacts/ contains a standalone benchmark (bench_kahan.cpp) that replicates the
exact inner-product kernel for fp1/fp2/fp4 against a genuine 113-bit __float128
reference, an end-to-end driver against the real library, the raw data, plots,
and a writeup. On this Apple-Silicon host sizeof(long double)==8, so fp4 is not
genuine quad and equals fp2 in accuracy -- documented honestly in the writeup.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Fix the end-to-end writeup: the 12-target adversarial CompMatr has exact
  amp0 = numRows-2 = 4094 (not '~4096'). Kahan recovers 4096 (residual +2),
  naive yields 0; state this precisely instead of rounding to the matrix dim.
- Fix stale comment in e2e_compmatr.cpp (constants are 1e18, not 1e15).
- Scope the ~3.3-4.2x cost as an in-cache micro-benchmark upper bound; note the
  real per-apply path at large target counts is memory-bandwidth bound.
- Note the 30x/50x benefit is for an adversarial non-unitary matrix; the gain on
  well-conditioned/unitary CompMatrs is much milder (worst-case bound).
- Soften 'exact kernel' wording: harness uses std::complex, arithmetically
  identical to base_qcomp but not literally QuEST's struct.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@agentlaunchops-ai

Copy link
Copy Markdown
Author

Ran the requested before/after benchmark for cpu_statevec_anyCtrlAnyTargDenseMatr_sub().

To get an apples-to-apples comparison inside the actual function, I made the accumulation compile-time selectable: defining QUEST_DENSE_ACCUM_NAIVE reverts the inner loop to the original uncompensated sum += elem * cache[j], while the default stays Kahan (so this PR's behaviour and its regression test are preserved). I then benchmarked all three qcomp precisions on a single CPU (OMP/MPI off, Release -O3 -DNDEBUG, no fast-math), from 2 to 14 target qubits, against a deliberately adversarial ill-conditioned matrix.

Honest caveat on fp4 / quad. On this Apple-Silicon macOS host, both AppleClang and Homebrew GCC report sizeof(long double)==8 with a 53-bit mantissa — so long double is double, and the QuEST fp4 build is not genuine quadruple precision here; its error is bit-for-bit identical to fp2 (visible in the data). I therefore did not use long double as the accuracy reference. The ground truth is a genuine 113-bit __float128 (quadmath) recomputation of every output amplitude, and each precision's error is measured against that. The fp4 row is "what this host produces", not a valid quad measurement.

Findings (worst-case abs error vs the 113-bit reference; adversarial non-unitary matrix, entries spanning 1e-6 .. 1e6 with cancelling large pairs):

  • Accuracy benefit grows with target count: negligible at 2–4 targets (no cancellation yet), then ~30x (fp1) and ~50x (fp2) error reduction by 14 targets.
  • Cost in this in-cache micro-benchmark is a roughly constant ~3.3–4.2x slowdown at every non-trivial size: the matrix+cache stay resident across repetitions, so the loop is compute-bound and the extra FLOPs don't amortise. Caveat: the real per-apply cost at large target counts is memory-bandwidth bound (a 14-target fp2 matrix is ~4 GB), so the relative Kahan cost in production may be smaller than this. Treat 3.3–4.2x as a standalone-kernel upper bound, not a real-library figure; Kahan is not "time-free" either way.
  • End-to-end against the real built library, a 12-target adversarial CompMatr (exact amp0 = numRows-2 = 4094) goes from amp0 = 0 under naive accumulation (total cancellation, result entirely lost) to amp0 = 4096 under Kahan — i.e. Kahan recovers the answer to within a residual +2 (an O(1) error that survives subtracting 1e18-scale terms), versus complete loss.

Scope caveats worth stating up front:

  • These benefit magnitudes are the worst case. Real CompMatr inputs are typically unitary with O(1) entries, where cancellation — and hence the benefit — is far milder at any size. So the threshold argument below is a worst-case bound, not the expected benefit on well-conditioned operators.
  • The GPU backend is unchanged. gpu_kernels.cuh still does naive amps[i] += elem * cache[...] and still carries the original @todo. This PR (and the new regression test) is CPU-only, so a CompMatr that needs Kahan would give different results on GPU. Worth deciding whether the GPU path should follow as a separate change or whether the regression test should be explicitly CPU-gated.

Verification: the touched well-conditioned tests still pass on the Kahan build ([matrices] 560 assertions; applyCompMatr 10002 assertions). Negative control: rebuilding with QUEST_DENSE_ACCUM_NAIVE makes the new regression test fail with exactly 0.0 == 14 (total cancellation) while [matrices] still passes — confirming the toggle switches real behaviour surgically. Accuracy data reproduces bit-for-bit on rebuild. Under -O3 (and even -ffp-contract=fast) the compensation survives; it is only defeated by an explicit -ffast-math/-Ofast, which QuEST does not pass.

Conclusion: Kahan is clearly worthwhile for large or ill-conditioned dense matrices, but the blanket 3–4x micro-benchmark cost is too steep for small/well-conditioned ones. A target-count threshold (~6–8) is the natural compromise, and the QUEST_DENSE_ACCUM_NAIVE switch is the hook for it.

(No CHANGELOG in the repo, so noting here: this PR adds the compile-time QUEST_DENSE_ACCUM_NAIVE toggle around the CPU dense-matrix accumulation; default behaviour is unchanged.)

Benchmark harness, end-to-end driver, raw data, plots and writeup are in artifacts/. Figures attached.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants