build(ci): ensure C++ toolchain before pytest#5573
Conversation
Hosted-runner VMs occasionally boot with a broken gcc where the g++ driver exists but cc1plus is missing. torch inductor JIT-compiles C++ kernels at test time, so this surfaces as an opaque 'g++: cannot execute cc1plus' CppCompileError deep inside the Test Python suite (observed once on an unrelated docs PR; a re-run on a fresh VM passed with identical code). Add a step that reinstalls build-essential and verifies a trivial C++ compile before the test run, self-healing a corrupted VM and failing fast with a clear message if the toolchain is still broken.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ed5b92ddc6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # fail fast here with a clear message if it is still broken. | ||
| run: | | ||
| sudo apt-get update | ||
| sudo apt-get install -y --reinstall build-essential |
There was a problem hiding this comment.
Reinstall the compiler package, not the metapackage
For the runner corruption described here (the g++ driver exists but cc1plus is missing), this step does not actually restore the missing backend: build-essential is only a small informational metapackage whose dependency list includes g++, and an apt-get -s install --reinstall build-essential dry run reinstalls exactly one package, build-essential, not its already-installed dependencies. As a result, a VM with a damaged g++/g++-11 package will still fail the compile check instead of self-healing; reinstall g++/the versioned compiler package directly.
Useful? React with 👍 / 👎.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe Python CI workflow gains a new step in the ChangesPython CI C++ Toolchain Verification
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #5573 +/- ##
=======================================
Coverage 82.14% 82.15%
=======================================
Files 900 900
Lines 104139 104138 -1
Branches 4471 4470 -1
=======================================
Hits 85550 85550
Misses 17178 17178
+ Partials 1411 1410 -1 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
Closing this — the diagnosis is wrong, as @njzjz pointed out: if the gcc toolchain were broken at boot, the editable The failed-run log confirms it. On the affected VM (run 27929887537, job
So the toolchain was healthy at install and only failed deep into the test run. A reinstall-at-start (this PR) runs while the toolchain is still fine and does nothing for a mid-run degradation, so it would not reliably prevent this. Only 1 of 24 shards failed and a re-run on a fresh VM passed, i.e. a transient single-VM event. Closing as misdiagnosed; investigating the real mid-run cause separately. |
Summary
The
Test Pythonjob (.github/workflows/test_python.yml) doescheckout → setup-python → pip install → pytestand never verifies the C++ compiler, yet torch inductor shells out to the systemg++to JIT-compile C++ kernels at test time (e.g. the_CompiledModel/torch.compiletests).When a GitHub-hosted runner VM occasionally boots with a broken gcc — the
g++driver present but its backendcc1plusmissing — this surfaces as an opaque failure deep inside the suite:This was observed once on an unrelated docs-only PR. The re-run on a fresh VM with identical code passed, confirming it is a transient per-VM toolchain corruption, not a code defect — and a survey of other recent
Test Pythonfailures showed all of them were genuine code/test failures, none with thiscc1plussignature.Change
Add a step right after
setup-pythonthat:build-essential(re-laying downgcc, which shipscc1plus), self-healing a corrupted VM, andNotes / known limitations
pytest-rerunfailureswould not help here: retrying the same test on the same VM hits the same missingcc1plus.Summary by CodeRabbit