Skip to content

build(ci): ensure C++ toolchain before pytest#5573

Closed
wanghan-iapcm wants to merge 1 commit into
deepmodeling:masterfrom
wanghan-iapcm:build/ci-ensure-cxx-toolchain
Closed

build(ci): ensure C++ toolchain before pytest#5573
wanghan-iapcm wants to merge 1 commit into
deepmodeling:masterfrom
wanghan-iapcm:build/ci-ensure-cxx-toolchain

Conversation

@wanghan-iapcm

@wanghan-iapcm wanghan-iapcm commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Summary

The Test Python job (.github/workflows/test_python.yml) does checkout → setup-python → pip install → pytest and never verifies the C++ compiler, yet torch inductor shells out to the system g++ to JIT-compile C++ kernels at test time (e.g. the _CompiledModel / torch.compile tests).

When a GitHub-hosted runner VM occasionally boots with a broken gcc — the g++ driver present but its backend cc1plus missing — this surfaces as an opaque failure deep inside the suite:

torch._inductor.exc.InductorError: CppCompileError: C++ compile error
g++: fatal error: cannot execute 'cc1plus': execvp: No such file or directory

This was observed once on an unrelated docs-only PR. The re-run on a fresh VM with identical code passed, confirming it is a transient per-VM toolchain corruption, not a code defect — and a survey of other recent Test Python failures showed all of them were genuine code/test failures, none with this cc1plus signature.

Change

Add a step right after setup-python that:

  • reinstalls build-essential (re-laying down gcc, which ships cc1plus), self-healing a corrupted VM, and
  • verifies a trivial C++ compile, so if the toolchain is still broken the job fails immediately with a clear message instead of 40 minutes into pytest.

Notes / known limitations

  • This makes the flake self-correct without a manual job re-run. It does not eliminate the underlying hosted-runner image variability; pinning to a container image would be the fully-deterministic alternative but is a larger change.
  • pytest-rerunfailures would not help here: retrying the same test on the same VM hits the same missing cc1plus.

Summary by CodeRabbit

  • Chores
    • Updated internal CI/CD workflow to enhance build validation during automated testing.

Hosted-runner VMs occasionally boot with a broken gcc where the g++
driver exists but cc1plus is missing. torch inductor JIT-compiles C++
kernels at test time, so this surfaces as an opaque
'g++: cannot execute cc1plus' CppCompileError deep inside the Test
Python suite (observed once on an unrelated docs PR; a re-run on a
fresh VM passed with identical code).

Add a step that reinstalls build-essential and verifies a trivial C++
compile before the test run, self-healing a corrupted VM and failing
fast with a clear message if the toolchain is still broken.
@dosubot dosubot Bot added the build label Jun 22, 2026
@wanghan-iapcm wanghan-iapcm requested a review from njzjz June 22, 2026 11:30

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ed5b92ddc6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# fail fast here with a clear message if it is still broken.
run: |
sudo apt-get update
sudo apt-get install -y --reinstall build-essential

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reinstall the compiler package, not the metapackage

For the runner corruption described here (the g++ driver exists but cc1plus is missing), this step does not actually restore the missing backend: build-essential is only a small informational metapackage whose dependency list includes g++, and an apt-get -s install --reinstall build-essential dry run reinstalls exactly one package, build-essential, not its already-installed dependencies. As a result, a VM with a damaged g++/g++-11 package will still fail the compile check instead of self-healing; reinstall g++/the versioned compiler package directly.

Useful? React with 👍 / 👎.

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 68815877-1ac9-411b-aea9-72397e0e288c

📥 Commits

Reviewing files that changed from the base of the PR and between 03682bf and ed5b92d.

📒 Files selected for processing (1)
  • .github/workflows/test_python.yml

📝 Walkthrough

Walkthrough

The Python CI workflow gains a new step in the testpython job that reinstalls build-essential via apt-get and compiles a minimal C++ program with g++ to verify that cc1plus is present before the torch inductor tests execute.

Changes

Python CI C++ Toolchain Verification

Layer / File(s) Summary
g++ compile verification step
.github/workflows/test_python.yml
Adds a step that reinstalls build-essential and compiles a trivial C++ program with g++, failing the job early if cc1plus is absent.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'build(ci): ensure C++ toolchain before pytest' directly and specifically describes the main change—adding a CI step to verify the C++ toolchain before running pytest tests.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov

codecov Bot commented Jun 22, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.15%. Comparing base (03682bf) to head (ed5b92d).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5573   +/-   ##
=======================================
  Coverage   82.14%   82.15%           
=======================================
  Files         900      900           
  Lines      104139   104138    -1     
  Branches     4471     4470    -1     
=======================================
  Hits        85550    85550           
  Misses      17178    17178           
+ Partials     1411     1410    -1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@wanghan-iapcm

Copy link
Copy Markdown
Collaborator Author

Closing this — the diagnosis is wrong, as @njzjz pointed out: if the gcc toolchain were broken at boot, the editable pip install -e . (which compiles deepmd's PyTorch C++ ops) would have failed first.

The failed-run log confirms it. On the affected VM (run 27929887537, job Test Python (10, 3.13)):

  • 04:36:50 -> 04:38:16: deepmd-kit's C++ built successfully (~86s, needs cc1plus)
  • 04:38:47: dp --version works -> install fully succeeded
  • 04:38:49: pytest starts
  • ~05:32:16 (~53 min later): torch inductor's g++ ... main.cpp fails with cannot execute 'cc1plus'

So the toolchain was healthy at install and only failed deep into the test run. A reinstall-at-start (this PR) runs while the toolchain is still fine and does nothing for a mid-run degradation, so it would not reliably prevent this. Only 1 of 24 shards failed and a re-run on a fresh VM passed, i.e. a transient single-VM event. Closing as misdiagnosed; investigating the real mid-run cause separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant