ci_analysis: focus on post-merge develop runs; medians, trend, critical path by tautschnig · Pull Request #9035 · diffblue/cbmc

tautschnig · 2026-06-12T13:09:23Z

The CI performance analysis previously selected the most recent successful runs of the workflow regardless of trigger, so it mixed in pull_request runs (which execute on throw-away merge commits and run far more often) and excluded the failed/timed-out runs that a performance investigation most needs. It also reported per-run means, which are easily skewed by the large run-to-run variance of the macOS runners.

This reworks scripts/ci_analysis.py to:

select runs by branch (--branch develop) and event (--event push) so the default is post-merge develop runs, and include completed runs of any conclusion (failed/timed-out included);
report medians (robust to runner variance) with min/max/n rather than means;
add a per-run trend table (date, commit, result, slowest job) so a genuine upward trend is distinguishable from one-off slow runs;
add a per-job "critical path" section reporting the longest single ctest suite (which cannot be parallelised and hence bounds wall-clock under -jN) alongside the sum of all suites, highlighting suites worth splitting.

Each commit message has a non-empty body, explaining why the change was made.
n/a Methods or procedures I have added are documented, following the guidelines provided in CODING_STANDARD.md.
n/a The feature or user visible behaviour I have added or modified has been documented in the User Guide in doc/cprover-manual/
Regression or unit tests are included, or existing tests cover the modified code (in this case I have detailed which ones those are in the commit message).
n/a My commit message includes data points confirming performance improvements (if claimed).
My PR is restricted to a single feature or bugfix.
n/a White-space or formatting changes outside the feature-related changed lines are in commits of their own.

Copilot

Pull request overview

This PR refines scripts/ci_analysis.py to make CI performance investigations more representative of post-merge behavior by filtering to branch/event runs (default: develop + push), including failed/timed-out runs, and switching summary statistics from mean to median. It also expands reporting to include a per-run trend view and a per-job “critical path” section to highlight serial bottlenecks.

Changes:

Filter analysed workflow runs by branch/event (defaults to post-merge develop push) while including all completed conclusions (not just successes).
Replace cross-run mean-based summaries with median/min/max/n reporting.
Add run trend output (slowest job per run) and per-job critical path reporting (longest single ctest suite vs total).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    longest: dict[str, list[float]] = defaultdict(list)
+    longest_name: dict[str, str] = {}
+    total: dict[str, list[float]] = defaultdict(list)
+    for r in results:
+        for job_name, detail in r.get("details", {}).items():
+            suites = detail.get("ctest_tests", [])
+            if not suites:
+                continue
+            top_suite = max(suites, key=lambda s: s["duration_s"])
+            longest[job_name].append(top_suite["duration_s"])
+            total[job_name].append(sum(s["duration_s"] for s in suites))
+            # Remember the name of the longest suite seen most recently.
+            longest_name[job_name] = top_suite["name"]
+    out: dict[str, dict] = {}
+    for job_name in longest:
+        out[job_name] = {
+            "longest_med": median(longest[job_name]),
+            "longest_suite": longest_name.get(job_name, "?"),
+            "total_med": median(total[job_name]),
+            "n": len(longest[job_name]),
+        }
+    return out


+def _stats(durs: list[float]) -> str:
+    """``<median> (med) · <min>–<max> · n=<count>`` for a list of durations."""
+    return (f"{fmt_duration(median(durs))} (med) · "
+            f"{fmt_duration(min(durs))}–{fmt_duration(max(durs))} · "
+            f"n={len(durs)}")


@@ -580,14 +666,10 @@
    if test_durations:
        info(f"\n  Slowest individual ctest tests (by mean, across runs):")


@@ -600,14 +682,10 @@
    if suite_durations:
        info(f"\n  Slowest make test suites (by mean, across runs):")


        info(f"\n  Slowest individual tests within ctest suites "
              f"(by mean, across runs):")
        sorted_indiv = sorted(indiv_durations.items(),


codecov · 2026-06-12T14:03:58Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.65%. Comparing base (40cbfd8) to head (0a0b544).

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #9035   +/-   ##
========================================
  Coverage    80.65%   80.65%           
========================================
  Files         1713     1713           
  Lines       189427   189427           
  Branches        73       73           
========================================
  Hits        152783   152783           
  Misses       36644    36644

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…al path The CI performance analysis previously selected the most recent successful runs of the workflow regardless of trigger, so it mixed in pull_request runs (which execute on throw-away merge commits and run far more often) and excluded the failed/timed-out runs that a performance investigation most needs. It also reported per-run means, which are easily skewed by the large run-to-run variance of the macOS runners. This reworks scripts/ci_analysis.py to: - select runs by branch (--branch develop) and event (--event push) so the default is post-merge develop runs, and include completed runs of any conclusion (failed/timed-out included); - report medians (robust to runner variance) with min/max/n rather than means; - add a per-run trend table (date, commit, result, slowest job) so a genuine upward trend is distinguishable from one-off slow runs; - add a per-job "critical path" section reporting the longest single ctest suite (which cannot be parallelised and hence bounds wall-clock under -jN) alongside the sum of all suites, highlighting suites worth splitting. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

The PR introducing the median-based aggregation, trend table and critical-path section had no test coverage. Most of ci_analysis.py is I/O-bound on the gh CLI, but the new helpers (median, _stats, _slowest_job_per_run, _critical_paths) are pure functions over plain dicts/lists and are worth pinning down. Add a stdlib-only, pytest-style module covering odd/even/empty/single-element medians, the empty-input behaviour of _stats, slowest-job selection, and -- in particular -- that the critical-path output reports a suite name consistent with the median it shows even when the slowest suite differs between runs. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

We have a number of Python helper scripts under scripts/ but no automated tests for any of them. Add a workflow that runs the pytest suite over scripts/, so test coverage for these scripts can be built out incrementally (ci_analysis.py is the first to gain tests). The job is restricted to pull requests and path-filtered to runs that touch a Python script under scripts/ (or this workflow), so it stays off the critical path of unrelated changes. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

tautschnig self-assigned this Jun 12, 2026

Copilot AI review requested due to automatic review settings June 12, 2026 13:09

tautschnig requested review from a team, kroening and peterschrammel as code owners June 12, 2026 13:09

Copilot started reviewing on behalf of tautschnig June 12, 2026 13:14 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

tautschnig and others added 3 commits June 12, 2026 20:16

tautschnig force-pushed the ci-analysis-improvement branch from 406aabb to 0a0b544 Compare June 12, 2026 20:17

tautschnig assigned peterschrammel and unassigned tautschnig Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci_analysis: focus on post-merge develop runs; medians, trend, critical path#9035

ci_analysis: focus on post-merge develop runs; medians, trend, critical path#9035
tautschnig wants to merge 3 commits into
diffblue:developfrom
tautschnig:ci-analysis-improvement

tautschnig commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

codecov Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -580,14 +666,10 @@
		if test_durations:
		info(f"\n Slowest individual ctest tests (by mean, across runs):")

		@@ -600,14 +682,10 @@
		if suite_durations:
		info(f"\n Slowest make test suites (by mean, across runs):")

Conversation

tautschnig commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

codecov Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Jun 12, 2026 •

edited

Loading