perf: use BMI2 bit deposit and extract helpers#779
Conversation
Change-Id: I4bdf73fd74f7b4d8d699124cdca2c2bd7b292e8b
Change-Id: I1fe81d656fe85b0c893e4baf0562c871bc275b81
|
Neat! Could you prepare a quick benchmark to demonstrate the performance benefit? Some candidate sizes are mentioned in the issue. Single-threaded, CPU-only (no GPU), single-process is sufficient! For ease of cross-platform benchmarking, you can define a benchmarking script which just prints the timings, placed in the To get a comparison with the original code, we can first get the benchmark running in CI with the new BMI stuff in this PR. Then we can open a new PR with the same benchmarking example and no other changes, to get the old numbers. A simple benchmarker is shown in this example EDIT: given the BMI are used when inputs are sorted, you can ctrl-F for |
|
Thanks, I prepared a quick single-threaded, CPU-only microbenchmark for the two helper paths. My local machine is Apple Silicon, so I used a temporary fork branch and a GitHub-hosted native x86_64 runner with BMI2 exposed:
So the clearest win in this quick check is the candidate This is intentionally just a helper-level microbenchmark rather than a full simulation throughput claim; memory bandwidth and surrounding kernel work can still dilute the effect in end-to-end operations. I kept the benchmark on the temporary fork branch rather than adding it to this PR, but I can add or adjust a benchmark if you would like it kept in-tree. |
Change-Id: Iea9e065dafcbd514b16353eb1df9cc0b3b24f879
|
Following up on your suggestion, I added an in-tree automated benchmark in
Local verification on my Apple Silicon machine: That local run built and printed fallback timings successfully. On x86_64 GCC/Clang runners where |
|
The previously run CI confirms no compilation issues have been introduced on any backend, so I've tailored the CI to run only what's relevant for this diff (single-CPU, automated examples) |
|
The CI results: Only Linux used the intrinsincs (MacOS did not). Is this expected? |
Change-Id: Ia987f8b01088101ddbb8f43c180c19f3e07c085c
|
Yes, this is expected for the current conservative gate. The BMI2 path is only enabled when the target compile actually defines That explains the current CI shape:
I also pushed |
|
One more CI note: the latest runs for |
|
Yes I understand the fallback - put the human on for a second :) You mentioned you were running on an "Apple Silicon machine", initially suggesting you couldn't use BMI2 (i.e. are using ARM) so spun up the x86 image, but subsequently you claim to have "locally verified" BMI2 on your Apple same machine. Which is it? I've updated the CI to include x86 MacOS (if those images are themselves not an AI hallucination - God help us!) |
|
Fair callout. My wording was muddy there. The machine on my desk is Apple Silicon / arm64. On that machine I only built and ran the fallback path; it cannot natively execute BMI2. When I said the Apple Silicon local verification passed for the in-tree benchmark, I meant: the benchmark builds and runs locally and correctly prints the fallback timings / The BMI2 performance numbers came from the separate GitHub-hosted native x86_64 Ubuntu runner in my fork workflow, not from the Apple Silicon machine. Earlier x86_64 So the intended split is:
Sorry for making that sound more mysterious than it was. |
Change-Id: I54386dfde9f0cf209e54e79c80937170e35e622b
|
I also pushed That fixes the two accidental side effects from adding
The refreshed runs for |
|
I've re-triggered the tests. Can you repeat back to me what the next step will be, as earlier discussed? |
|
Yes. The next step is to create a separate baseline PR that adds the same That baseline PR should let CI run the identical benchmark against the current fallback/original implementation. Then we can compare:
That should give a like-for-like CI comparison for the old and new helper paths before making a final call on this optimization. |
|
Created the requested baseline PR here: https://github.com/QuEST-Kit/QuEST/pull/789\n\nIt adds the same automated benchmark on top of current , without the BMI2 helper changes, so CI can produce fallback/original timings for direct comparison with this PR. |
|
Supplemental benchmark update while #789 is still waiting on upstream workflow approval. I ran the same in-tree Runs:
Both selected an AMD EPYC 7763 runner with Ubuntu 24.04 / GCC 13.3.0. Results:
So the fork-only comparison matches the earlier direction: the issue's clearest |
Summary
Closes #717.
This adds guarded x86 BMI2 fast paths for QuEST's hot bit insertion/extraction helpers:
_pdep_u64forinsertBits(...)when BMI2 is available;_pdep_u64directly ininsertBitsWithMaskedValues(...), avoiding the extra loop in its heavily used masked path;_pext_u64ingetValueOfBits(...)only whenbitIndicesare strictly increasing, preserving the existing arbitrary-order behavior by falling back to the loop otherwise;INLINEhost/device functions remain kernel-compatible.The PR also adds an in-tree automated benchmark:
examples/automated/benchmark_bmi2_bitwise.cppexamples/automated/CMakeLists.txtconditionally adds-mbmi2when supported, so unsupported platforms keep building/running the fallback path and printBMI2 intrinsics: disabled.I also added internal unit coverage for:
Benchmark Signal
The upstream PR CI is green. The automated example prints BMI2 timings on Linux/macOS runners where
-mbmi2is enabled, and fallback timings where it is not.As requested in review, I also opened the separate baseline benchmark-only PR #789 so maintainers can compare this PR against the original helper implementation.
While #789 is waiting for upstream workflow approval, I ran the same benchmark on temporary fork-only CI branches. On AMD EPYC 7763 / Ubuntu 24.04 / GCC 13.3.0:
getValueOfBits 2 bitsgetValueOfBits 5 bitsgetValueOfBits 6 bitsinsertBitsWithMask 2 bitsinsertBitsWithMask 5 bitsinsertBitsWithMask 6 bitsValidation
Local validation run on Apple Silicon:
# Syntax-checked the BMI2 branch by targeting x86_64 with -mbmi2. c++ -target x86_64-apple-macos15 -mbmi2 -std=c++20 \ -I/tmp/quest_cfg -I. -x c++ -fsyntax-only -The targeted
bitwisetest run passed with 8 assertions in 3 test cases.GitHub CI for this PR is passing for the maintainer-tailored compile and serial test matrix.
AI Assistance Disclosure
I used Codex as a coding assistant to inspect the existing bitwise implementation, draft the guarded intrinsic path, prepare local validation commands, and assemble the benchmark comparison. I reviewed the issue requirements and the resulting diff, preserved the arbitrary-order semantics of
getValueOfBits(...), and ran the validation listed above before submitting.Closes #717