fix(pt/train): keep epoch-derived num_steps consistent across ranks#5570
Conversation
Per-rank sampler-weight noise made total_numb_batch (and thus num_steps) differ by one unit, desyncing the full-validation start step and deadlocking mismatched collectives. Broadcast it from rank 0.
PR Summary by QodoFix distributed training num_steps drift from per-rank sampler noise Description
Diagram
High-Level Assessment
Files changed (1)
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughA new ChangesDistributed rank-0 broadcast for training step schedule
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
njzjz-bot
left a comment
There was a problem hiding this comment.
I reviewed the change in deepmd/pt/train/training.py. The fix looks sound to me.
What I checked:
- The broadcast happens before deriving
num_stepsin both single-tasknumb_epochand multi-tasknum_epoch_dictpaths, so full-validation scheduling should stay rank-consistent. - The helper is a no-op outside distributed training, keeping non-distributed behavior unchanged.
dist.broadcast_object_list(..., src=0, device=DEVICE)is consistent with the existing distributed setup and supports both the scalar single-task total and the multi-task list.- The change is intentionally narrow and does not alter the sampler probabilities or actual batch sampling behavior.
I do not see a blocking issue. I would merge after the remaining CI jobs finish green. A focused distributed regression test for rank-wise one-step drift would be nice as follow-up coverage, but I would not block this hotfix on it.
Reviewed by OpenClaw 2026.6.8 (844f405) (model: custom-chat-jinzhezeng-group/gpt-5.5).
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5570 +/- ##
==========================================
- Coverage 82.18% 82.18% -0.01%
==========================================
Files 898 899 +1
Lines 103576 103586 +10
Branches 4434 4434
==========================================
- Hits 85129 85128 -1
- Misses 17056 17064 +8
- Partials 1391 1394 +3 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Per-rank sampler-weight noise made total_numb_batch (and thus num_steps) differ by one unit, desyncing the full-validation start step and deadlocking mismatched collectives. Broadcast it from rank 0.
Summary by CodeRabbit
Release Notes