Skip to content

fix(gpu): avoid neighbor-list prefix scan deadlock#5575

Open
njzjz wants to merge 1 commit into
deepmodeling:masterfrom
njzjz:fix/gpu-nlist-prefix-scan
Open

fix(gpu): avoid neighbor-list prefix scan deadlock#5575
njzjz wants to merge 1 commit into
deepmodeling:masterfrom
njzjz:fix/gpu-nlist-prefix-scan

Conversation

@njzjz

@njzjz njzjz commented Jun 22, 2026

Copy link
Copy Markdown
Member

Fixes #5117.

The CUDA neighbor-list prefix scan used threadIdx.x as the loop induction variable around cub::BlockScan and __syncthreads(). When nall is not a multiple of the block size, only the threads covered by the tail segment enter the final iteration, so the block-wide collective diverges and the GPU op can hang.

This changes the loop to iterate by segment base so every thread in the block participates in every BlockScan/__syncthreads() iteration. Threads outside the tail segment load a sentinel and skip output writes.

A CUDA regression test covers the tail segment case with nall = TPB + 68, which matches the divergent pattern observed in #5117.

Validation:

  • Built current master plus this patch from source with CUDA 12.9, TensorFlow 2.19.1, and CMAKE_CUDA_ARCHITECTURES=120.
  • Confirmed the test runtime loaded the freshly built libdeepmd.so, libdeepmd_op.so, libdeepmd_op_cuda.so, libdeepmd_dyn_cudart.so, and libop_grads.so.
  • Re-ran the issue input on RTX 5090 with srun --gres=gpu:1 dp --tf train input.json; training passed the previous data stating... hang point and completed stop_batch = 2000 with exit code 0.
  • git diff --check passed.
  • pre-commit hooks for the committed C++ files passed, including clang-format.

Notes:

  • ruff check on git-tracked Python files passed. ruff format --check reports an existing unrelated format issue in source/3rdparty/implib/implib-gen.py; this PR only changes C++/CUDA files.

Summary by CodeRabbit

  • Bug Fixes

    • Improved GPU neighbor list computation to correctly handle edge cases in segment processing, ensuring accurate results across all particle configurations.
  • Tests

    • Added GPU test to verify neighbor list computation correctness in edge cases.

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 283bc3cc-2df9-45f9-8a76-cc8ac38eaa9b

📥 Commits

Reviewing files that changed from the base of the PR and between 03682bf and 335e2ff.

📒 Files selected for processing (2)
  • source/lib/src/gpu/neighbor_list.cu
  • source/lib/tests/test_neighbor_list.cc

📝 Walkthrough

Walkthrough

The parallel_prefix_scan GPU kernel loop is rewritten from a per-thread strided iteration to a block-uniform segment loop stepping by THREADS_PER_BLOCK. Tail threads with ii >= nall now supply masked inputs so every thread participates in each BlockScan call. A new GoogleTest case exercises the tail-segment path with nall = TPB + 68.

Changes

GPU Prefix Scan Tail-Segment Fix

Layer / File(s) Summary
Block-uniform scan loop with tail masking and regression test
source/lib/src/gpu/neighbor_list.cu, source/lib/tests/test_neighbor_list.cc
Kernel loop replaced with base-stepping uniform iteration; threads past nall supply masked inputs (i_data=-1, o_data=0) so BlockScan and __syncthreads() run on every iteration for all threads; output and numneigh writes are index-guarded. New test TestNeighborListStandalone.gpu_tail_segment_prefix_scan allocates GPU buffers for nloc=2, nall=TPB+68, calls build_nlist_gpu, and asserts zero neighbor counts, directly exercising the partially-filled tail segment.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(gpu): avoid neighbor-list prefix scan deadlock' accurately and concisely describes the main change—fixing a CUDA deadlock in neighbor-list prefix scan operations.
Linked Issues check ✅ Passed The PR addresses the core issue #5117 by fixing the CUDA neighbor-list prefix scan deadlock that caused dpgen to hang. The fix modifies the loop iteration pattern to ensure uniform thread participation and includes a regression test.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing the neighbor-list prefix scan deadlock: modifications to neighbor_list.cu loop logic and a new regression test in test_neighbor_list.cc.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@njzjz njzjz requested a review from denghuilu June 22, 2026 19:08
@codecov

codecov Bot commented Jun 22, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.15%. Comparing base (03682bf) to head (335e2ff).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5575   +/-   ##
=======================================
  Coverage   82.14%   82.15%           
=======================================
  Files         900      900           
  Lines      104139   104138    -1     
  Branches     4471     4470    -1     
=======================================
  Hits        85550    85550           
  Misses      17178    17178           
+ Partials     1411     1410    -1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] dpgen is stucked in the stage of "data stating... (this step may take long time)"

1 participant