fix(pt/train): keep epoch-derived num_steps consistent across ranks by OutisLi · Pull Request #5570 · deepmodeling/deepmd-kit

OutisLi · 2026-06-21T14:04:17Z

Per-rank sampler-weight noise made total_numb_batch (and thus num_steps) differ by one unit, desyncing the full-validation start step and deadlocking mismatched collectives. Broadcast it from rank 0.

Summary by CodeRabbit

Release Notes

Bug Fixes
- Fixed inconsistencies in training schedules across distributed ranks that could result from floating-point differences in batch count calculations.
- Ensured consistent step count resolution for multi-task training configurations across all ranks in distributed training setups.

Per-rank sampler-weight noise made total_numb_batch (and thus num_steps) differ by one unit, desyncing the full-validation start step and deadlocking mismatched collectives. Broadcast it from rank 0.

qodo-free-for-open-source-projects · 2026-06-21T14:05:06Z

PR Summary by Qodo

Fix distributed training num_steps drift from per-rank sampler noise
🐞 Bug fix 🕐 10-20 Minutes

Description

• Broadcast epoch-derived batch totals from rank 0 to keep num_steps identical.
• Prevent full-validation schedule desync that can deadlock distributed collectives.
• Reuse the same broadcast helper for single- and multi-task epoch-based step resolution.

Diagram

graph TD
  A["Per-rank sampler weights"] --> B["Compute batch totals"] --> C{"Distributed?"} -->|"Yes"| D["Broadcast rank0 totals"] --> E["Derive num_steps"] --> F["Full-validation barrier"]
  C -->|"No"| E

High-Level Assessment

The following are alternative approaches to this PR:

1. Broadcast a typed tensor (int/float) instead of broadcast_object_list

➕ More type-safe and typically faster than Python object broadcast
➕ Avoids Python pickling overhead and potential device argument pitfalls
➖ Needs separate handling for scalar vs list (multi-task) or additional packing/unpacking logic
➖ Slightly more code than the current generic helper

2. Make sampler/batch counting deterministic across ranks

➕ Eliminates the root cause (per-rank floating-point noise) instead of synchronizing results
➕ Keeps each rank fully self-sufficient (no dependency on rank 0)
➖ Potentially invasive: may require changes to sampler construction, RNG seeding, or weight normalization
➖ Higher risk of unintended training behavior changes

3. All-reduce with rounding/consensus (e.g., max/min) on totals

➕ Avoids single-rank source of truth; can choose conservative consensus (e.g., max)
➖ Still requires picking a rule; may hide underlying inconsistencies rather than removing them
➖ More complex to explain and reason about than rank-0 pinning

Recommendation: The PR’s approach (pinning epoch-derived batch totals to rank 0 via broadcast) is the best tradeoff: it is minimal, directly addresses the deadlock by enforcing a single schedule, and is low-risk. If performance or type-safety becomes a concern, consider switching the helper to broadcast a tensor (with list packing for multi-task) later.

Files changed (1) +24 / -0

Bug fix (1) +24 / -0

training.pyBroadcast epoch-derived batch totals to prevent cross-rank num_steps drift +24/-0

Broadcast epoch-derived batch totals to prevent cross-rank num_steps drift

• Synchronizes total_numb_batch (single-task) and per_task_total (multi-task) by broadcasting rank 0’s computed values before deriving num_steps from num_epoch. Adds a small distributed helper using dist.broadcast_object_list to keep training and full-validation schedules consistent and avoid deadlocks from mismatched collectives.

deepmd/pt/train/training.py

coderabbitai · 2026-06-21T14:07:16Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 53bcb177-91a2-4c5c-a6eb-74189610289b

📥 Commits

Reviewing files that changed from the base of the PR and between aae3485 and ed33828.

📒 Files selected for processing (1)

deepmd/pt/train/training.py

📝 Walkthrough

Walkthrough

A new Trainer._broadcast_value_from_rank0 method is added to deepmd/pt/train/training.py, wrapping dist.broadcast_object_list for distributed runs. It is called at two points in Trainer.__init__: once to synchronize total_numb_batch for single-task training and once to synchronize per_task_total for multi-task training before step-schedule resolution.

Changes

Distributed rank-0 broadcast for training step schedule

Layer / File(s)	Summary
Rank-0 broadcast helper `deepmd/pt/train/training.py`	Adds `Trainer._broadcast_value_from_rank0(value)`, which is a no-op outside distributed mode and uses `dist.broadcast_object_list` to synchronize a Python object from rank 0 to all other ranks.
Broadcast call sites in `__init__` `deepmd/pt/train/training.py`	Applies `_broadcast_value_from_rank0` to `total_numb_batch` (single-task path, lines 656–663) and to `per_task_total` (multi-task `num_epoch_dict` path, lines 692–695), pinning both values to rank 0 before `num_steps` and `model_prob` are resolved.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: ensuring epoch-derived num_steps remains consistent across ranks in distributed training, which is the core issue addressed in the PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

qodo-free-for-open-source-projects · 2026-06-21T14:08:48Z

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0)

Great, no issues found!

Qodo reviewed your code and found no material issues that require review

njzjz-bot

I reviewed the change in deepmd/pt/train/training.py. The fix looks sound to me.

What I checked:

The broadcast happens before deriving num_steps in both single-task numb_epoch and multi-task num_epoch_dict paths, so full-validation scheduling should stay rank-consistent.
The helper is a no-op outside distributed training, keeping non-distributed behavior unchanged.
dist.broadcast_object_list(..., src=0, device=DEVICE) is consistent with the existing distributed setup and supports both the scalar single-task total and the multi-task list.
The change is intentionally narrow and does not alter the sampler probabilities or actual batch sampling behavior.

I do not see a blocking issue. I would merge after the remaining CI jobs finish green. A focused distributed regression test for rank-wise one-step drift would be nice as follow-up coverage, but I would not block this hotfix on it.

Reviewed by OpenClaw 2026.6.8 (844f405) (model: custom-chat-jinzhezeng-group/gpt-5.5).

codecov · 2026-06-21T15:35:03Z

Codecov Report

❌ Patch coverage is 12.50000% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.18%. Comparing base (c5c57d6) to head (ed33828).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
deepmd/pt/train/training.py	12.50%	7 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5570      +/-   ##
==========================================
- Coverage   82.18%   82.18%   -0.01%     
==========================================
  Files         898      899       +1     
  Lines      103576   103586      +10     
  Branches     4434     4434              
==========================================
- Hits        85129    85128       -1     
- Misses      17056    17064       +8     
- Partials     1391     1394       +3

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

fix(pt/train): keep epoch-derived num_steps consistent across ranks

ed33828

Per-rank sampler-weight noise made total_numb_batch (and thus num_steps) differ by one unit, desyncing the full-validation start step and deadlocking mismatched collectives. Broadcast it from rank 0.

dosubot Bot added the bug label Jun 21, 2026

OutisLi requested a review from njzjz June 21, 2026 14:05

github-actions Bot added the Python label Jun 21, 2026

njzjz-bot reviewed Jun 21, 2026

View reviewed changes

njzjz approved these changes Jun 21, 2026

View reviewed changes

njzjz added this pull request to the merge queue Jun 21, 2026

Merged via the queue into deepmodeling:master with commit 2651de9 Jun 22, 2026
73 checks passed

OutisLi deleted the pr/epochmismatch branch June 22, 2026 07:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pt/train): keep epoch-derived num_steps consistent across ranks#5570

fix(pt/train): keep epoch-derived num_steps consistent across ranks#5570
njzjz merged 1 commit into
deepmodeling:masterfrom
OutisLi:pr/epochmismatch

OutisLi commented Jun 21, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

qodo-free-for-open-source-projects Bot commented Jun 21, 2026

Uh oh!

coderabbitai Bot commented Jun 21, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

qodo-free-for-open-source-projects Bot commented Jun 21, 2026

Uh oh!

njzjz-bot left a comment

Uh oh!

codecov Bot commented Jun 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

OutisLi commented Jun 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

qodo-free-for-open-source-projects Bot commented Jun 21, 2026

PR Summary by Qodo

Uh oh!

coderabbitai Bot commented Jun 21, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

qodo-free-for-open-source-projects Bot commented Jun 21, 2026

Code Review by Qodo

Great, no issues found!

Uh oh!

njzjz-bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

OutisLi commented Jun 21, 2026 •

edited by coderabbitai Bot

Loading

codecov Bot commented Jun 21, 2026 •

edited

Loading