fix(gpu): avoid neighbor-list prefix scan deadlock#5575
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe ChangesGPU Prefix Scan Tail-Segment Fix
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #5575 +/- ##
=======================================
Coverage 82.14% 82.15%
=======================================
Files 900 900
Lines 104139 104138 -1
Branches 4471 4470 -1
=======================================
Hits 85550 85550
Misses 17178 17178
+ Partials 1411 1410 -1 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Fixes #5117.
The CUDA neighbor-list prefix scan used
threadIdx.xas the loop induction variable aroundcub::BlockScanand__syncthreads(). Whennallis not a multiple of the block size, only the threads covered by the tail segment enter the final iteration, so the block-wide collective diverges and the GPU op can hang.This changes the loop to iterate by segment base so every thread in the block participates in every
BlockScan/__syncthreads()iteration. Threads outside the tail segment load a sentinel and skip output writes.A CUDA regression test covers the tail segment case with
nall = TPB + 68, which matches the divergent pattern observed in #5117.Validation:
CMAKE_CUDA_ARCHITECTURES=120.libdeepmd.so,libdeepmd_op.so,libdeepmd_op_cuda.so,libdeepmd_dyn_cudart.so, andlibop_grads.so.srun --gres=gpu:1 dp --tf train input.json; training passed the previousdata stating...hang point and completedstop_batch = 2000with exit code 0.git diff --checkpassed.clang-format.Notes:
ruff checkon git-tracked Python files passed.ruff format --checkreports an existing unrelated format issue insource/3rdparty/implib/implib-gen.py; this PR only changes C++/CUDA files.Summary by CodeRabbit
Bug Fixes
Tests