Skip to content

fix(pt/train): keep epoch-derived num_steps consistent across ranks#5570

Merged
njzjz merged 1 commit into
deepmodeling:masterfrom
OutisLi:pr/epochmismatch
Jun 22, 2026
Merged

fix(pt/train): keep epoch-derived num_steps consistent across ranks#5570
njzjz merged 1 commit into
deepmodeling:masterfrom
OutisLi:pr/epochmismatch

Conversation

@OutisLi

@OutisLi OutisLi commented Jun 21, 2026

Copy link
Copy Markdown
Collaborator

Per-rank sampler-weight noise made total_numb_batch (and thus num_steps) differ by one unit, desyncing the full-validation start step and deadlocking mismatched collectives. Broadcast it from rank 0.

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Fixed inconsistencies in training schedules across distributed ranks that could result from floating-point differences in batch count calculations.
    • Ensured consistent step count resolution for multi-task training configurations across all ranks in distributed training setups.

Per-rank sampler-weight noise made total_numb_batch (and thus num_steps)
differ by one unit, desyncing the full-validation start step and
deadlocking mismatched collectives. Broadcast it from rank 0.
@dosubot dosubot Bot added the bug label Jun 21, 2026
@qodo-free-for-open-source-projects

Copy link
Copy Markdown

PR Summary by Qodo

Fix distributed training num_steps drift from per-rank sampler noise
🐞 Bug fix 🕐 10-20 Minutes

Grey Divider

Description

• Broadcast epoch-derived batch totals from rank 0 to keep num_steps identical.
• Prevent full-validation schedule desync that can deadlock distributed collectives.
• Reuse the same broadcast helper for single- and multi-task epoch-based step resolution.
Diagram

graph TD
  A["Per-rank sampler weights"] --> B["Compute batch totals"] --> C{"Distributed?"} -->|"Yes"| D["Broadcast rank0 totals"] --> E["Derive num_steps"] --> F["Full-validation barrier"]
  C -->|"No"| E
Loading
High-Level Assessment

The following are alternative approaches to this PR:

1. Broadcast a typed tensor (int/float) instead of broadcast_object_list
  • ➕ More type-safe and typically faster than Python object broadcast
  • ➕ Avoids Python pickling overhead and potential device argument pitfalls
  • ➖ Needs separate handling for scalar vs list (multi-task) or additional packing/unpacking logic
  • ➖ Slightly more code than the current generic helper
2. Make sampler/batch counting deterministic across ranks
  • ➕ Eliminates the root cause (per-rank floating-point noise) instead of synchronizing results
  • ➕ Keeps each rank fully self-sufficient (no dependency on rank 0)
  • ➖ Potentially invasive: may require changes to sampler construction, RNG seeding, or weight normalization
  • ➖ Higher risk of unintended training behavior changes
3. All-reduce with rounding/consensus (e.g., max/min) on totals
  • ➕ Avoids single-rank source of truth; can choose conservative consensus (e.g., max)
  • ➖ Still requires picking a rule; may hide underlying inconsistencies rather than removing them
  • ➖ More complex to explain and reason about than rank-0 pinning

Recommendation: The PR’s approach (pinning epoch-derived batch totals to rank 0 via broadcast) is the best tradeoff: it is minimal, directly addresses the deadlock by enforcing a single schedule, and is low-risk. If performance or type-safety becomes a concern, consider switching the helper to broadcast a tensor (with list packing for multi-task) later.

Files changed (1) +24 / -0

Bug fix (1) +24 / -0
training.pyBroadcast epoch-derived batch totals to prevent cross-rank num_steps drift +24/-0

Broadcast epoch-derived batch totals to prevent cross-rank num_steps drift

• Synchronizes total_numb_batch (single-task) and per_task_total (multi-task) by broadcasting rank 0’s computed values before deriving num_steps from num_epoch. Adds a small distributed helper using dist.broadcast_object_list to keep training and full-validation schedules consistent and avoid deadlocks from mismatched collectives.

deepmd/pt/train/training.py

@OutisLi OutisLi requested a review from njzjz June 21, 2026 14:05
@coderabbitai

coderabbitai Bot commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 53bcb177-91a2-4c5c-a6eb-74189610289b

📥 Commits

Reviewing files that changed from the base of the PR and between aae3485 and ed33828.

📒 Files selected for processing (1)
  • deepmd/pt/train/training.py

📝 Walkthrough

Walkthrough

A new Trainer._broadcast_value_from_rank0 method is added to deepmd/pt/train/training.py, wrapping dist.broadcast_object_list for distributed runs. It is called at two points in Trainer.__init__: once to synchronize total_numb_batch for single-task training and once to synchronize per_task_total for multi-task training before step-schedule resolution.

Changes

Distributed rank-0 broadcast for training step schedule

Layer / File(s) Summary
Rank-0 broadcast helper
deepmd/pt/train/training.py
Adds Trainer._broadcast_value_from_rank0(value), which is a no-op outside distributed mode and uses dist.broadcast_object_list to synchronize a Python object from rank 0 to all other ranks.
Broadcast call sites in __init__
deepmd/pt/train/training.py
Applies _broadcast_value_from_rank0 to total_numb_batch (single-task path, lines 656–663) and to per_task_total (multi-task num_epoch_dict path, lines 692–695), pinning both values to rank 0 before num_steps and model_prob are resolved.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: ensuring epoch-derived num_steps remains consistent across ranks in distributed training, which is the core issue addressed in the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@qodo-free-for-open-source-projects

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider

Great, no issues found!

Qodo reviewed your code and found no material issues that require review

Grey Divider

Qodo Logo

@njzjz-bot njzjz-bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the change in deepmd/pt/train/training.py. The fix looks sound to me.

What I checked:

  • The broadcast happens before deriving num_steps in both single-task numb_epoch and multi-task num_epoch_dict paths, so full-validation scheduling should stay rank-consistent.
  • The helper is a no-op outside distributed training, keeping non-distributed behavior unchanged.
  • dist.broadcast_object_list(..., src=0, device=DEVICE) is consistent with the existing distributed setup and supports both the scalar single-task total and the multi-task list.
  • The change is intentionally narrow and does not alter the sampler probabilities or actual batch sampling behavior.

I do not see a blocking issue. I would merge after the remaining CI jobs finish green. A focused distributed regression test for rank-wise one-step drift would be nice as follow-up coverage, but I would not block this hotfix on it.

Reviewed by OpenClaw 2026.6.8 (844f405) (model: custom-chat-jinzhezeng-group/gpt-5.5).

@codecov

codecov Bot commented Jun 21, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 12.50000% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.18%. Comparing base (c5c57d6) to head (ed33828).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
deepmd/pt/train/training.py 12.50% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5570      +/-   ##
==========================================
- Coverage   82.18%   82.18%   -0.01%     
==========================================
  Files         898      899       +1     
  Lines      103576   103586      +10     
  Branches     4434     4434              
==========================================
- Hits        85129    85128       -1     
- Misses      17056    17064       +8     
- Partials     1391     1394       +3     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@njzjz njzjz added this pull request to the merge queue Jun 21, 2026
Merged via the queue into deepmodeling:master with commit 2651de9 Jun 22, 2026
73 checks passed
@OutisLi OutisLi deleted the pr/epochmismatch branch June 22, 2026 07:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants