diff --git a/.felt/dependabot-pr-triage/dependabot-pr-triage.md b/.felt/dependabot-pr-triage/dependabot-pr-triage.md
index 804ea499e..48a81c4f8 100644
--- a/.felt/dependabot-pr-triage/dependabot-pr-triage.md
+++ b/.felt/dependabot-pr-triage/dependabot-pr-triage.md
@@ -1,4 +1,5 @@
---
+id: 01KTCHWZSHQXRQHTHH9Q7Q4MT7
name: Triage open dependabot uv.lock PRs
status: closed
tags:
@@ -84,3 +85,7 @@ Update this fiber's `outcome:` to reflect what landed (e.g. "5 merged, 1 deferre
## Skills
Activate `felt` (you'll touch the fiber). The `shuttle` skill is already loaded by virtue of being the dispatched worker.
+
+## Follow-on: PR #742 (2026-06-11)
+
+PR #742 ("chore(deps): bump the lockfile-minor-patch group, 4 updates": mpi4py 4.1.1→4.1.2, numpy 2.4.4→2.4.6, snakemake 9.20.0→9.21.0, black) arrived after the original batch. CI green (`build-test-publish` passes). Not merged — pyproject.toml modified alongside uv.lock: lower bounds tightened (`mpi4py>=4.0→4.1.2`, `numpy>=2.0→2.4.6`, `snakemake>=9.20.0→9.21.0`). Tightening `>=` floors is a policy call, not a mechanical bump. Comment posted at https://github.com/CosmoStat/shapepipe/pull/742#issuecomment-4675762821 asking maintainers to confirm intent before merging.
diff --git a/.felt/docker-multistage/docker-multistage.md b/.felt/docker-multistage/docker-multistage.md
index 94951a708..8b5ed1076 100644
--- a/.felt/docker-multistage/docker-multistage.md
+++ b/.felt/docker-multistage/docker-multistage.md
@@ -1,4 +1,5 @@
---
+id: 01KTCHWZV7263MYPNB8DT1FXPX
name: 'Docker multi-stage: runtime + dev targets'
tags:
- shapepipe
diff --git a/.felt/docker-uv-revert/docker-uv-revert.md b/.felt/docker-uv-revert/docker-uv-revert.md
index dd138a97f..a1f4a16d0 100644
--- a/.felt/docker-uv-revert/docker-uv-revert.md
+++ b/.felt/docker-uv-revert/docker-uv-revert.md
@@ -1,16 +1,18 @@
---
+id: 01KTCHWZW8DT8VRYQJ4CEPDRGB
name: 'Docker: revert skaha→python base, switch to uv lockfile'
-status: active
+status: closed
tags:
- shapepipe
- docker
- infra
created-at: 2026-04-27T11:26:45.677512058+02:00
-outcome: 'PR #719 (chore: switch Dockerfile to slim Python + uv lockfile) opened and CI-green on first try (3m31s); ready for review. Drops conda double-install, makes pyproject SSOT + uv.lock the pinned manifest, switches WeightWatcher from sed-patched source build to Debian''s pre-patched 1.12+dfsg-3 package, adds binary smoke tests to deploy-image.yml.'
+closed-at: 2026-06-10T17:14:45.965931602+02:00
+outcome: 'Superseded: conda removal landed via #733 (merged); the #719-era docker-uv work is absorbed into ci-green-on-develop, which tracks remaining follow-ups.'
decisions:
base:
label: Base image
- rationale: Conda double-install was the actual problem; cleanest resolution is to drop conda entirely. The canfar deployment concern is satisfied as long as the slim image works on canfar.
+ rationale: Conda double-install was the actual problem; cleanest resolution is to drop conda entirely. Martin's canfar concern is satisfied as long as the slim image works on canfar.
default: python-slim
options:
python-slim:
@@ -50,7 +52,7 @@ decisions:
label: uv + pyproject + uv.lock; uv sync --frozen in Dockerfile
modernize:
label: Modernize package versions
- rationale: 'We determined which versions MUST stay pinned: only ngmix (pinned to a stable_version fork branch — replacement is tracked separately). Everything else can move to current latest because uv resolved cleanly and CI smoke test still passes (3m42s). If a real pipeline run on canfar surfaces a numpy-2 / pandas-3 break, the fix is a targeted constraint + uv lock, not a wholesale revert.'
+ rationale: 'We determined which versions MUST stay pinned: only ngmix (Axel''s stable_version branch — replacement is tracked separately). Everything else can move to current latest because uv resolved cleanly and CI smoke test still passes (3m42s). If a real pipeline run on canfar surfaces a numpy-2 / pandas-3 break, the fix is a targeted constraint + uv lock, not a wholesale revert.'
default: stay-current
options:
stay-conservative:
@@ -58,7 +60,7 @@ decisions:
excluded: true
excluded_reason: Drift between pyproject signal and lockfile reality; loses the chance to surface numpy-2/pandas-3 incompatibilities at PR time when CI is fast
stay-current:
- label: Bump pyproject minimums to current major versions (numpy 2, astropy 7, pandas 3, galsim 2.8, mpi4py 4.1, etc.); pin ngmix to its stable_version fork branch
+ label: Bump pyproject minimums to current major versions (numpy 2, astropy 7, pandas 3, galsim 2.8, mpi4py 4.1, etc.); pin ngmix to Axel's stable_version branch
insights:
ci-fast:
claim: 'First CI run on PR #719 went green in 3m31s. uv installed 238 packages in 322ms — everything resolved to prebuilt wheels, no source compilation of galsim/mpi4py/python-pysap/etc. Massive speedup vs. previous build.'
@@ -97,10 +99,11 @@ The `--frozen` flag is the discipline mechanism: a stale lockfile cannot ship.
## Followups
- Watch CI on #719. The slim-base apt list is conjectural — galsim/mpi4py/python-pysap pull a lot of system deps and we may need to add more (`libatlas-base-dev`, `libblas-dev`, etc).
-- If CI needs anything beyond what's in the apt block, that's worth noting for next time.
-- After this lands, PRs #708 and #714 may need a small rebase.
-- Optional: separate `Dockerfile.canfar` building on skaha if there's a concrete deployment reason. Currently conjectural — floated as a possibility, but slim should work on canfar.
+- If CI needs anything beyond what's in the apt block, that's the surface that benefits from a [[shapepipe/prs-in-flight]] note for next time.
+- After this lands, [[shapepipe/prs-in-flight]] PRs #708 and #714 may need a small rebase.
+- Optional: separate `Dockerfile.canfar` building on skaha if there's a concrete deployment reason. Currently conjectural — Martin floated it but we agreed slim should work on canfar.
## Connections
- [[shapepipe]] — root
+- [[shapepipe/prs-in-flight]] — touches the testing-scaffold xfail set and the develop-bugs PR
diff --git a/.felt/fabian-coord-bug/fabian-coord-bug.md b/.felt/fabian-coord-bug/fabian-coord-bug.md
new file mode 100644
index 000000000..bcbbc1bb2
--- /dev/null
+++ b/.felt/fabian-coord-bug/fabian-coord-bug.md
@@ -0,0 +1,11 @@
+---
+id: 01KTCHWZWEXTJ337APFSH4NF6X
+name: Fabian's coord-propagation bug + image-sim code on github
+tags:
+ - shapepipe
+ - bug
+ - collaboration
+ - future
+created-at: 2026-04-27T11:26:52.878118978+02:00
+outcome: 'Fabian: 1-line fix in shapepipe needs porting; first need him to put image-sim code/configs on github so it''s testable. Beg if necessary.'
+---
diff --git a/.felt/ngmix-update/ngmix-update.md b/.felt/ngmix-update/ngmix-update.md
index 723871e65..bd11d135f 100644
--- a/.felt/ngmix-update/ngmix-update.md
+++ b/.felt/ngmix-update/ngmix-update.md
@@ -1,9 +1,10 @@
---
-name: ngmix library upgrade + wrapper sync
+id: 01KTCHWZWY48FVCEDX6DGE6CTY
+name: ngmix library upgrade + Lucy wrapper sync
tags:
- shapepipe
- ngmix
- future
created-at: 2026-04-27T11:26:51.026191639+02:00
-outcome: 'Replace the pinned ngmix fork (a stable_version branch carrying not-yet-upstreamed fixes) with upstream ngmix once those land; reconcile the wrapper afterward.'
+outcome: 'Future: replace Axel''s stable_version fork with upstream ngmix; reconcile with Lucy''s cleaned-up wrapper from her visit'
---
diff --git a/.felt/prs-in-flight/prs-in-flight.md b/.felt/prs-in-flight/prs-in-flight.md
new file mode 100644
index 000000000..364db869f
--- /dev/null
+++ b/.felt/prs-in-flight/prs-in-flight.md
@@ -0,0 +1,78 @@
+---
+id: 01KTCHWZWYYTDSYHAP4FPE7V3V
+name: PRs in flight after v2 merge
+tags:
+ - shapepipe
+ - pr
+created-at: 2026-04-27T11:26:49.300097608+02:00
+outcome: 'Post-v2 + post-propagation: infra stream now landed (#718 setuptools, #719 uv-lockfile, #728 dependabot+SHA-pin), supply-chain hygiene done (20 → 0 alerts). Issue #712 empirically verified resolved against current `:develop` (all 11 packages in Martin''s May 18 list import in both read-only and writable sandbox modes); comment posted, awaiting Martin reply before closing. Science PRs still open: #714 develop-bugs (closes #709 + #711 only — #712 closes separately), #708 testing-scaffold (mine); #725 centroid shift (Axel), several older Martin PRs (#704 #703 #699 #660 #650 #636), #670 lbaumo file_io. Next thread: merge #714.'
+insights:
+ 714-already-redundant:
+ claim: 'Surprise from rebasing #714: its Dockerfile commit (cf304f8f, adding astroquery/numba/fitsio + setuptools<81 pin) was *already* redundant on current develop — the v2 merge silently put astroquery/numba/fitsio into pyproject and the v2 Dockerfile installs them via ''pip install -e ".[fitsio]"'' at the end. setuptools<81 went away via #718. So ''rebase to drop the obsolete commit'' wasn''t waiting on #719 — it was already obsolete the moment v2 merged. Worth checking sooner next time before assuming a fix is still load-bearing.'
+ xfail-mostly-fixable:
+ claim: 'Most #708 xfails are about to be resolved: canfar_monitor IndentationError (4 xfails) and summary_run -h (1 xfail) are fixed in #714; astroquery/numba/fitsio import xfails (5 modules) resolve in #719 because uv sync installs them from pyproject. Only stile/treecorr corr2 (4 modules) is a separate issue requiring stile removal or upstream patch.'
+ dependabot-policy:
+ claim: 'shapepipe now ships `.github/dependabot.yml` (#728) with 14-day cooldown, monthly grouped lockfile PRs, github-actions ecosystem opted in, and SHA-pinned actions across all four workflows. Reasoning lives in the file itself + the #728 PR body. Companion fiber [[shapepipe/sqlitedict-pickle-smell]] tracks the single dismissed alert.'
+ 712-empirically-resolved:
+ claim: 'Issue #712 is empirically resolved against current `ghcr.io/cosmostat/shapepipe:develop` (dev target, post-#728). Both the original packages (astroquery, numba, fitsio) and Martin''s May 18 follow-up list (scipy, joblib, importlib_metadata, tqdm, LSSTDESC.Coord, pyyaml, astropy_iers_data, pyerfa) import cleanly in both read-only and writable sandbox modes, as do the three originally-flagged runner modules. Pyproject confirms astroquery/numba/joblib/tqdm are core deps; the rest are transitives of astropy/mccd/modopt/galsim; fitsio is gated in both runtime (`--extra jupyter --extra fitsio`) and dev (`--extra dev`) targets. Comment posted; awaiting Martin reply before closing. Likely root cause of the May 18 report: cached/older image.'
+decisions:
+ setuptools-pin:
+ label: drop setuptools<81 pin
+ default: merged
+ options:
+ merged:
+ label: 'Already merged as #718 (c9e71df8) — small one-liner, agreed in transcript'
+---
+
+Snapshot of CosmoStat/shapepipe PR state, maintained as a living index.
+
+## Open — infra
+
+(All infra PRs landed. The dependabot stream is resolved; supply-chain
+posture set; SHA-pins in place. See [[shapepipe/sqlitedict-pickle-smell]]
+for the one open security-fiber.)
+
+## Open — issues (mine)
+
+| # | What | Status |
+|---|---|---|
+| #712 | Dockerfile missing runtime deps | Empirically resolved against current `:develop` ([comment](https://github.com/CosmoStat/shapepipe/issues/712#issuecomment-4562085977)). Both original list (astroquery/numba/fitsio) and Martin's May 18 follow-up (scipy/joblib/importlib_metadata/tqdm/LSSTDESC.Coord/pyyaml/astropy_iers_data/pyerfa) import cleanly in read-only + writable sandbox modes. Awaiting Martin reply before closing. |
+| #711 | summary_run -h crashes | Fixed by #714 (auto-closes on merge) |
+| #709 | canfar_monitor IndentationError | Fixed by #714 (auto-closes on merge) |
+
+## Open — mine (science / fixes)
+
+| # | Branch | What | Status |
+|---|---|---|---|
+| #731 | `chore/smoke-test-read-only` | smoke-test in read-only mode | Open. Adds `shapepipe_run_example` wrapper; CI now runs the entry-point smoke under `docker --read-only --tmpfs /tmp:rw`. See [[shapepipe/smoke-test-read-only]]. |
+| #714 | `fix/develop-bugs` | small develop bugs (#709, #711) | Open. Originally a multi-bug fix; the Dockerfile portion got absorbed into #719. Worth checking what's still load-bearing here vs already-fixed-upstream. |
+| #708 | `chore/testing-scaffold` | Tier 0–2 test scaffolding | Open. Some xfails should have flipped to xpass after the v2 + uv-lockfile work; needs a rebase + xfail-list audit. |
+
+## Open — others' PRs awaiting attention
+
+| # | Author | What |
+|---|---|---|
+| #741 | martinkilbinger / lbaumo | **Ngmix v2.0** — upstream ngmix 2.4.0 + Lucy's new classes. Canonical PR (CI mirror; fork PR #740 closed by Martin). OPEN, mergeable, CI green. Two-part review + next-steps triage delivered; 11 findings open (5 cut-and-dry, 5 decisions, 1 resume), 2 are merge-gates (weight-norm, `*_psfo`). See [[review-ngmix-v2-pr740]]. |
+| #725 | aguinot | Fix centroid shift (overlaps #741 centroid work — cross-ref/supersede?) |
+| #704 | martinkilbinger | Contributors |
+| #703 | martinkilbinger | V1.3.x |
+| #699 | martinkilbinger | Coverage mask |
+| #670 | lbaumo | file_io handles sextractor header |
+| #660 | martinkilbinger | Existing output directory |
+| #650 | martinkilbinger | Third-party catalogue for tile objects |
+| #636 | martinkilbinger | Rho statistics: flexible training/test split |
+
+## Recently closed
+
+- **#728** `chore/dependabot-config` — dependabot.yml + SHA-pin all actions. Merged 2026-05-28.
+- **#727, #726, #724, #722, #721, #720** — dependabot security bumps for idna/urllib3/gitpython/mistune/jupyter-server/jupyterlab. All squash-merged 2026-05-28 (see [[shapepipe/dependabot-pr-triage]]).
+- **#719** `chore/uv-lockfile` — merged 2026-05-05 (Martin).
+- **#718** `chore/drop-setuptools-pin` — merged.
+- **v2.0 PR** — merged. Source of the skaha/conda situation that #719 unwound.
+
+## Connections
+
+- [[shapepipe]] — root
+- [[shapepipe/docker-uv-revert]] — drove #719
+- [[shapepipe/dependabot-pr-triage]] — drove the 6 security-bump merges (closed)
+- [[shapepipe/sqlitedict-pickle-smell]] — future-work fiber for the one dismissed alert
diff --git a/.felt/review-ngmix-v2-pr740/deep-dive-report.html b/.felt/review-ngmix-v2-pr740/deep-dive-report.html
new file mode 100644
index 000000000..267991b77
--- /dev/null
+++ b/.felt/review-ngmix-v2-pr740/deep-dive-report.html
@@ -0,0 +1,199 @@
+
+
+
A fan-out of six investigators + two synthesizers chased the two science-shaped
+ findings from the #741 review to ground. Both came back high-confidence, with verified arithmetic
+ and — for the weight regression — an observed empirical confirmation. Each problem
+ now has a concrete recommendation and a clean scoping plan. Full reports:
+ weights-report.md and size-report.md in this fiber.
+
+
+
+
+
+
§ Problem A · weights / inverse-variance
+
v2.0 has two real, coupled weight-map regressions.
+
+
The live weight path is prepare_ngmix_weights (ngmix.py:871). Against
+ the v1 baseline (develop), v2.0 dropped two things:
+
+
+
Regr.
What changed
Effect
+
R1
Noise estimate regressed from the object-free windowed get_noise (still in the file at :826, now dead — zero callers) to whole-stamp sigma_mad(gal), contaminated by galaxy flux.
Size/flux-dependent mis-weighting of the χ² → biased g_err/T_err/s2n; the signature of a multiplicative shear bias m (cf. old-path m~-2.8e-2). Dominant today.
+
R2
Lost the binarization weight_map[weight_map != 0] = 1, so line 906 multiplies the raw weight by 1/sig_noise².
For 0/1 masks: per-epoch 1/Fscale² scale error. Latent hazard: feed a real inverse-variance map (THELI / BACKGROUND_RMS) and v2.0 double-counts the variance.
+
+
+
Observed, not inferred
+
Fed make_data's truth inverse-variance map (1/noise²) through the actual
+ function in the dev container:
+
+ noise=1e-3 → truth ivar 1e6 | recovered 8.81e11 | ratio 8.81e5 (correct = 1.0)
+ noise=1e-4 → truth ivar 1e8 | recovered 7.35e15 | ratio 7.35e7 (correct = 1.0)
+
+
The weight is off by ≈1/noise² — exactly the R2 double-count. The regression is a
+ clean red→green test target (make_data injects the oracle; no ngmix
+ fit needed). Today's only weight-path test asserts run(42)==run(42) — determinism,
+ never truth. Coverage gap.
+
+
+
recommendation — split into two PRs
+
In #741 (now): minimal v1-restore — reinstate the
+ binarization + call get_noise instead of sigma_mad(gal) — plus the
+ red→green unit test. Two lines; fixes a regression this PR introduced; restores robustness to
+ real ivar maps.
+
Separate PR (next): the in-house SExtractor
+ BACKGROUND_RMS baseline you steered toward (THELI is external). Mostly
+ config — the check-image & vignetmaker ME loops are already list-driven, so delivering the
+ RMS stamps is near-free; the earned work is rewriting prepare_ngmix_weights to use
+ per-pixel 1/RMS² with sigma_mad as fallback, + before/after validation.
+ Effort S–M, ~4–8 h. Closes #604.
+
+
+
+ Open questions for Martin (6)
+
+
Confirm SExtractor-BACKGROUND_RMS is the baseline, THELI a later add-on (non-blocking)?
+
RMS input optional (with sigma_mad/get_noise fallback) or required? (rec: optional)
+
#741 minimal fix: restore windowed get_noise (reintroduces the gal-guess machinery the PR may have deliberately stripped), or accept bare sigma_mad until #604? R1 is dominant, so it matters even short-term.
+
Does the RMS map feed anything beyond weight + noise-fill (e.g. background subtraction)? (rec: weight + noise-fill only)
+
Config-edit scope for #604: cfis only, or cfis + cfis_simu + canfar batch?
+
Merge bar for #604: smoke before/after on the example tile, or a full sim m/c calibration?
+
+
+
+
+
+
+
+
+
§ Problem B · r50 / T / σ size naming
+
The v2.0 “r50” columns aren’t half-light radii.
+
+
All five r50* columns are new in v2.0 (v1 wrote sizes only as
+ T). None is a true half-light radius:
+
+
+ galaxy r50 (l.409) = pars[4] = T = 2σ² — an area, bit-for-bit a copy of T
+ PSF r50psf (l.411) = √(T/2) = σ — a length, but off by ×1.1774
+ true half-light radius = 1.1774·σ = 1.1774·√(T/2) — no column equals this
+
+
+
So the same name root means an area on the galaxy side and a length
+ on the PSF side. Meanwhile the official catalogue paper (UNIONS-3500 WL I,
+ arXiv:2605.13549) reports the half-light radius r_h (= r50) as the primary
+ size for both galaxy and PSF, with the size cut written as the dimensionless ratio
+ 0.707 ≤ r_h/r_h,psf ≤ 3; T is given only as the derived DES area (Eq. 1).
+ The paper wants exactly the quantity these columns claim but fail to provide.
+
+
sp_validation reads none of the r50* columns (only the
+ dimensionless ratio T/Tpsf, which is robust) — so the mislabel harms nothing today,
+ but it's a live foot-gun for any future consumer and contradicts the paper.
+
+
+
recommendation — transform at the source
+
Make the ngmix writer emit honest columns: galaxy
+ r50 = 1.1774·√(T/2) with propagated error; PSF r50psf = existing σ ×
+ 1.1774. Keep all T* columns. Then every "r50" is a length meaning the same thing on
+ both sides, matching the paper. De-duplicate (T_psfo_ngmix≡Tpsf,
+ r50_psfo_ngmix≡r50psf).
+
Put the conversion web in cs_util once
+ (T_to_r50, r50_to_T, sigma_to_fwhm, corrected
+ T_to_fwhm) — the shared producer/consumer layer. Beats a pure rename (throws away
+ the paper's headline quantity) and a read-side-only fix (leaves the foot-gun on disk).
+
+
+
Bonus find: a real downstream bug in sp_validation
+
galaxy.py:T_to_fwhm is T/1.17741*2.355 — dimensionally wrong
+ (2.355/1.17741 = 2.000; it treats the area T as if it were σ). Its own
+ MKDEBUG comment already doubts it. It builds the fwhm_PSF regressor for the
+ scale-dependent PSF-leakage fit (run_object.py:510), so it distorts
+ the PSF-size axis and biases α(size). The correct form is
+ 2.355·√(T/2). A clean cs_util surface lets this be fixed and the trap retired.
+
+
+ Open questions for Lucy / Martin (6)
+
+
Standardize on r50-as-primary (per UNIONS-3500 I) while keeping T, or keep T as working column with r50 derived?
+
Drop or alias the duplicate columns — is anything outside sp_validation src/ (notebooks, older catalogues) bound to the _psfo_ngmix names?
+
cs_util as the home for the converter web — agreed? Naming?
+
Were in-flight scale-dependent leakage results produced with the buggy T_to_fwhm, and how much does α(size) move once fixed?
+
Galaxy r50_err: simple propagation, or from the full pars covariance?
+
Catalogue-format version bump / changelog for the r50* semantics change?
+
+
+
+
+
+
+
+
§ Where this leaves us
+
Two work-streams, both scoped.
+
+
Weights → a Codex shuttle + separate PR (the #604 BACKGROUND_RMS baseline), with the minimal v1-restore landing in #741 first. The science calls above are for Martin.
+
Size → transform-at-source in ngmix + a cs_util converter web; the sp_validation T_to_fwhm fix rides along. Touches the producer/shared/consumer stack — Lucy + Martin on the convention.
+
Still no Martin/Lucy reply drafted — that waits for Cail. These reports are the raw material for it.
+
+
+
+
+
+
diff --git a/.felt/review-ngmix-v2-pr740/draft-review.md b/.felt/review-ngmix-v2-pr740/draft-review.md
new file mode 100644
index 000000000..31cd8852e
--- /dev/null
+++ b/.felt/review-ngmix-v2-pr740/draft-review.md
@@ -0,0 +1,117 @@
+# Draft review — PR #740 "Ngmix v2.0" (staging, NOT posted)
+
+Staged for the call with Cail. Posting/pushing happens *with* him, in his voice.
+Authorship convention: "— Claude on behalf of Cail" unless he says otherwise.
+
+---
+
+## Reviewer checklist (Martin's template)
+
+| Box | Verdict |
+|---|---|
+| Targets `develop` | ✅ head `ngmix_v2.0` → base `develop` |
+| Assigned to developer | ⬜ (Cail to check on GitHub) |
+| Appropriate labels | ☑️ `enhancement` only — fine; could add a `dependencies` label given the ngmix bump |
+| Projects/milestones | ⬜ (Cail to check) |
+| Clear description | ⚠️ Summary is terse (4 bullets). For a +3894/−3410, 67-file PR that bumps a core dep and rewrites the shape-measurement module, a fuller description would help — esp. the centroid-bias result. |
+| "closes #" links | ❌ No issue links. #725 (Axel's centroid-shift PR) overlaps this work — should this PR close/supersede it? Worth an explicit cross-ref. |
+| Code/doc style | ⚠️ Mostly good. Pockets of debug cruft (see line-level). |
+| Docs updated | ❌ No `docs/` changes. The ngmix pin note in `CLAUDE.md` ("don't modernize this line") is now stale — this PR *is* the modernization. Needs updating on/with merge. |
+| **CI passing** | ✅ **GREEN** (full suite). Got here by pushing a same-repo mirror (PR #741, branch `ngmix_v2.0` on origin) and merging `develop` into it (clean except `uv.lock`, regenerated via `uv lock`) to pull in the modernized CI. `build-test-publish` ran every step green: images build (new `uv.lock` + ngmix 2.4.0 resolve), binary smokes, entry point, **pytest suite** (incl. develop's `test_ngmix.py`), example pipeline; publish correctly skipped on the PR. Original fork PR #740 got no CI (Actions approval gate). | The modern test-running workflow (`pull_request → develop`) doesn't queue because GitHub evaluates the trigger from the *head branch's* workflow file, which predates the modernization. `ci-release.yml` runs the full suite only on `pull_request → main/master`. **To run the full suite + example pipeline, the branch must have `develop` merged in** (pulls in modern CI + resolves drift). Original #740 got no CI at all because fork PRs don't trigger our Actions without maintainer approval. |
+| API docs built | ➖ n/a-ish (no public API doc surface changed materially) |
+| All files checked + comments | ✅ done in this review |
+
+---
+
+## Science-quality read (the part the checklist misses)
+
+### 1. ngmix 2.4.0 API correctness — ✅ verified empirically
+Installed ngmix **2.4.0** (from `esheldon/ngmix@v2.4.0`, the PR's own source) over the
+container stack and ran the **actual module code**. All 13 ngmix symbols the module
+uses resolve in 2.4.0 (`fitting.Fitter`, `guessers.TPSFFluxAndPriorGuesser`/`TFluxGuesser`,
+`runners.PSFRunner`/`Runner`, `metacal.MetacalBootstrapper`, `joint_prior.PriorSimpleSep`,
+`priors.*`, `observation.*`, `Jacobian`). The metacal/bootstrap path runs end-to-end.
+
+### 2. Shear recovery — ✅ unbiased
+Ran `do_ngmix_metacal` on 200 galsim galaxies (sub-pixel shifted, the centroid regime),
+recovered shear via the metacal response:
+- **m = +2×10⁻⁴ ± 3×10⁻⁴** (multiplicative bias consistent with zero)
+- **c₂ = −1×10⁻⁵** (additive consistent with zero), R₁₁≈R₂₂≈0.924, 200/200 fits OK.
+- Author's own `centroid_bias_v2.py` (noiseless, 60 trials, ±-cancellation): **m = +3.7×10⁻⁴ ± 0.5×10⁻⁴ (1σ)**, c₂ = (−0.3 ± 0.15)×10⁻⁵, R₁₁≈0.9234.
+- **Read:** few×10⁻⁴ residual m — small, likely within UNIONS requirements, but the noiseless limit resolves a marginal *positive* residual (≈7σ given tiny noiseless bars). Not the centroid fix (fix-off gives the same) → metacal/fitter higher-order. Worth stating the m-requirement we're holding this to.
+
+### 3. The centroid fix — present, correct in form, but **necessity not demonstrated in idealized sim**
+The fix (`make_ngmix_observation`, `ngmix.py:987-1004`) re-centers the galaxy Jacobian on
+the HSM `FindAdaptiveMom` centroid so the centroid prior (centered at the Jacobian origin)
+doesn't bias the fit. Sound idea. **But** running the same ensemble with the recenter
+disabled gives the *same* unbiased m (both ~3×10⁻⁴) — i.e. in this small-shift (±½ px),
+high-S/N regime the fix is harmless but not measurably necessary. **Question for the call:**
+in what regime does the bias the fix targets actually appear? The author's centroid scripts
+(`centroid_bias.py`, configs bug/fix/v2) presumably demonstrate it — is the before/after
+result captured anywhere we can point to? This is the single most important thing to pin down.
+
+### 4. `r50` vs `T` labeling — ⚠️ likely mislabel, worth clarifying
+In `compile_results` (`ngmix.py:415-417`): `output_dict[name]['r50'] = results[...]['pars'][4]`.
+For an ngmix `gauss` model, `pars[4]` is **T** (≈2σ², arcsec²), *not* a half-light radius —
+and `output_dict[name]['T']` stores the same value. And `r50_psfo = sqrt(T_psf/2)` is the
+Gaussian **σ**, not r50 (true Gaussian r50 = 1.1774σ). So columns named `r50*` actually carry
+T / σ. Downstream (sp_validation) needs to know what these columns mean. Is "r50" intentional
+shorthand, or should these be true half-light radii? (Constitution flagged "r50 as the size
+guess changed from T" — but the *guesser* still uses `T=0.25`; the change is in output column
+naming, not the guess. Worth confirming Martin's intent.)
+
+---
+
+## Code-review pass (multi-agent, high-effort) — findings
+
+Ran a 6-angle code-review over the ngmix delta. CI is green, so none of these break the
+test suite — but the suite is shallow on these paths. Ranked.
+
+### Correctness — confirmed
+1. **`get_guess` ImportError — shipped script dead on arrival.** `scripts/validation/centroid/centroid_bias.py:37` does `from ...ngmix import get_guess, get_noise`, but this PR removed `get_guess` from the module. The script ImportErrors on load. (Only `centroid_bias.py`; `test_centroid_shift.py` imports just `get_noise`.) Fix the import, restore `get_guess`, or drop the v1 script.
+2. **Reproducibility break — unseeded RNG.** `ngmix.py:948-949` `prepare_ngmix_weights` uses global `np.random.randn` for the noise image + masked-pixel fill, while the module deliberately switched to a seeded `self._rng` (`:266`). Same tile + same seed → different masked-pixel values and metacal noise → non-reproducible shears. Thread `rng` into `prepare_ngmix_weights`/`make_ngmix_observation`.
+3. **`*_psfo` columns now carry the metacal-RECONVOLVED PSF, not the original PSFEx/MCCD PSF.** `average_multiepoch_psf` reads `obsdict['noshear'][n_e].psf.meta['result']` — the reconvolved metacal PSF — and feeds `g_PSFo`/`T_PSFo`/`r50_PSFo`. But `compile_results` documents these as "the original image psf from psfex or mccd," and PSF-leakage / ρ-stat / null tests consume them as the input PSF. The old code fit the raw pre-metacal PSF separately. Confirm intent; the comment is now wrong either way.
+4. **`CHECK_EXISTING_DIR` resume silently dropped.** New `process()` never reads `self._check_existing_dir` or calls `get_last_id` (both now dead). A configured resume reprocesses from scratch and overwrites — a documented feature silently no-ops.
+
+### Correctness — plausible, needs Martin/Lucy intent (raise as questions)
+5. **Weight-map normalization changed.** Old `prepare_ngmix_weights` binarized the mask (`weight[weight!=0]=1`) then applied a flat `1/sig_noise²`. New keeps the per-pixel (already FSCALE-rescaled) weight AND multiplies by `1/sigma_mad(gal)²`, with `sigma_mad` taken over the *object-containing* stamp (flux-contaminated, vs the old windowed object-free `get_noise`). Looks like double inverse-variance + a biased noise estimate → shifts the fit's χ² weighting, errors, s2n, and PSF-averaging weights vs the validated v1. Intentional?
+6. **Per-type PSF columns collapsed.** `Tpsf`/`g1_psf`/`g2_psf` are now identical across all 5 metacal HDUs (all from `T_PSFo`/`g_PSFo`); old code stored per-type values. May be fine if the metacal target PSF is type-independent — but if any response/selection step differences them it now gets exactly zero.
+7. **Galaxy zero-pixel guard narrowed (`any`→`all`).** `prepare_postage_stamps:737` now skips an epoch only if `np.all(gal_vign==0)`; the old code skipped if *any* pixel was 0. Partial CCD-edge stamps (a band of exact-zero pixels) now enter the fit, relying on flags/weights to catch them. Intentional loosening or a regression?
+8. **PSF fit uses the galaxy joint prior + galaxy `FLUX_AUTO` as the PSF flux guess** (`do_ngmix_metacal`). Matches the upstream ngmix example's prior reuse, but the galaxy-flux guess for the PSF is odd; possible PSF-fit convergence/quality impact. Lower confidence.
+
+### Altitude / structure
+9. **Runner input-contract mismatch (4 runners).** `sextractor_runner`, `read_ext_sexcat_runner`, `psfex_interp_runner`, `vignetmaker_runner` moved the WCS/exp-numbers paths from named config options (`LOG_WCS`/`ME_LOG_WCS`/`ME_DOT_PSF_DIR`) to positional `input_file_list[i]` — but, unlike `ngmix_runner`, their `@module_runner` decorators were **not** updated. The real input contract now lives entirely in each config's `FILE_PATTERN`/`FILE_EXT` ordering; the shipped v2.0 example configs are correct, but a hand-written/v1-style config silently mis-maps inputs or IndexErrors with no schema to catch it. `ngmix_runner` is the model — extend the other four decorators to match.
+10. **`r50`/`T` mislabel (units).** `compile_results:415-417` stores `pars[4]` (= `T`, arcsec²) under `r50`/`r50_err`, and `r50_PSFo = sqrt(T_psf/2)` is the Gaussian σ, not the half-light radius (`r50 = 1.1774σ`). sp_validation consumes these size columns. Intentional shorthand or a mislabel?
+
+### Cleanups (bundle into one comment)
+- `save_results`: `ngmix.py:547` non-`f` string → error prints literal `{ext_name}`; leftover `print(...)` debug at `:564-565`; the `np.append`-per-batch FITS append is O(n²) over a tile.
+- Dead code: unreachable `print` after `return` in `MegaCamFlip` (`:293`); unused `sextractor_e1e2` (`:1131`, with a copy-pasted wrong docstring); `# I still don't know how to handle this` (`:63`); commented `#self._gal_vignet_path` block (`:240-244`); `# MKDEBUG` markers; redundant `self._f_wcs_path`.
+- `51 * 51` hardcoded masked-fraction denominator (`:766`) → use `v_flag_tmp.size`; near-term goal is configurable stamp sizes, so this will silently go wrong.
+- Duplication: `scripts/python/fitting.py` and `scripts/jupyter/test_centroid_shift.py` each redefine their own `make_data`/`get_prior`, duplicating `testing/simulate.py` + the module `get_prior` added in this same PR (defeats simulate.py's "stable across branches" purpose). `fitting.py` is a stray upstream-ngmix example (`fitting_bd_empsf.py` header, `espy`/`images` refs) — likely an unintended add; delete or make it import the canonical helpers.
+- Minor efficiency: doubled `galsim.Image(gal, scale=1.0)` in `make_ngmix_observation`; repeated uncached `SqliteDict[str(obj_id)]` deserialization per object/epoch; `get_exp_output_dirs` re-globs identical dirs when a runner repeats in `ME_IMAGE_EXP_RUNNERS`.
+
+---
+
+## Line-level (earlier staged notes — superseded by the code-review pass above for ngmix.py)
+
+- **`pyproject.toml`** — ngmix now `>=2.4` with `[tool.uv.sources] ngmix = git esheldon/ngmix tag v2.4.0`. Why pin to a git tag rather than the PyPI release `ngmix==2.4.0`? PyPI would be more reproducible and drop the git dependency. (PyPI was unreachable from here to confirm 2.4.0 is published — Cail can check.)
+- **`ngmix.py:63`** — stray `# I still don't know how to handle this` above `Tile_cat`.
+- **`ngmix.py:254`,`:528`-ish** — `# MKDEBUG` markers and `print(...)` debug statements (e.g. `:293` unreachable `print('rotating megapipe image')` after a `return`; `:564-565` `print("output_dict"...)` in an error branch). Clean up before merge.
+- **`ngmix.py:766`** — masked-fraction cut hardcodes `51 * 51`; if vignet size ever changes this silently breaks. Use `flag_vign.size`.
+- **`scripts/python/fitting.py`** — looks like a lightly-edited copy of esheldon's `fitting_bd_empsf.py` example; docstring still references `fracdev` / `fitting_bd_empsf.py` and a bulge+disk fit that no longer applies. Either trim to match or drop — is it meant to ship?
+- **`scripts/jupyter/test_centroid_shift.py`** — jupyter-exported (cell markers `# In[14]:`, commented-out code, inline-duplicated `make_data`). Now that `shapepipe.testing.simulate.make_data` exists, this could import it instead of carrying its own copy. Ship as-is or tidy?
+- **`src/shapepipe/testing/simulate.py`** — 👍 clean, dependency-light (galsim+numpy only), gives exactly the cheap validation path. Nice addition.
+- **Supporting runners** (`psfex_interp_runner`, `vignetmaker_runner`, `ngmix_runner`) — the `$SP_EXP`-tree resolver (`get_exp_output_dirs`) with config-gated fallback to the v1 run-log path is clean and backward-compatible. `f_wcs_path` now threaded via `input_file_list` consistently. No concerns.
+
+---
+
+## Suggested PR-level summary comment (draft)
+> Reviewed the diff and exercised it on candide against **ngmix 2.4.0** (installed from the
+> PR's own source). The 2.x API is used correctly and the metacal path runs end-to-end; on
+> simulated galaxies the v2.0 fitter recovers shear with **m ≈ 2×10⁻⁴** (consistent with zero).
+> Two things before I can tick the checklist: (1) **CI hasn't run** — it's a fork PR, needs an
+> "Approve and run" or a push to origin; (2) I couldn't reproduce a centroid bias in the
+> idealized sim with the fix *off*, so I'd like to understand the regime where the centroid fix
+> matters — is the before/after result from `centroid_bias.py` captured somewhere? Plus a few
+> cleanups (debug prints, the `r50`/`T` column naming, stale `CLAUDE.md` ngmix note). Details inline.
+> — Claude on behalf of Cail
diff --git a/.felt/review-ngmix-v2-pr740/report.html b/.felt/review-ngmix-v2-pr740/report.html
new file mode 100644
index 000000000..3004d77f9
--- /dev/null
+++ b/.felt/review-ngmix-v2-pr740/report.html
@@ -0,0 +1,248 @@
+
+
+
+
+
+Report — review-ngmix-v2-pr740
+
+
+
+
+
+
+
+
+
The two-part review is delivered, CI is green, and Martin has consolidated the
+ work onto PR #741 — closing the fork PR #740. This morning (06-05)
+ Martin replied to six of the eleven findings: he greenlit two cleanups, explained two
+ changes as intentional, pointed the weight-map concern at a long-standing issue, and poked Lucy
+ on the size-column naming — explicitly asking us to check the r50/T
+ arithmetic. I ran that check and confirmed the bug. Five findings — including
+ the second merge-gate, *_psfo — still have no response. This report carries the
+ updated dispositions and the sharpened sequence.
+
+
+
+
+
§ Current state
+
#741 is the PR of record — green, mergeable, and Martin has started answering.
+
+
Topology. Martin closed #740 (his fork branch
+ martinkilbinger:ngmix_v2.0) on 06-03 and consolidated onto #741 —
+ the CI mirror on CosmoStat:ngmix_v2.0. #741 is the canonical PR:
+ OPEN, MERGEABLE,
+ CI green (build-test-publish pass, full suite). It carries
+ the three part-1 fixes (seeded-RNG thread, scripts-import smoke, v1
+ centroid_bias.py deletion) and the develop merge. No code has
+ changed since the part-2 review — every finding still sits at its anchored line.
+
+
Martin engaged this morning (06-05, 07:14–07:33Z) — six inline replies across
+ the findings, his first real pass on the science:
+
+
Greenlit two cleanups.254 resume:
+ "a hack to resume interrupted ngmix runs… can be removed now, and if needed implemented
+ better within the module."766 hardcoded 51*51:
+ "yes would be good to not hard-code this", pointing at the existing
+ STAMP_SIZE = 51 / VIGNET_SIZE = 51 config convention.
+
Explained one change as intentional.737 zero-pixel guard:
+ "the any == 0 test had pass close to no postage stamps. I thought this was too
+ restrictive." So any→all was deliberate (loosen, not tighten).
+
Pointed the weight-map concern at a standing issue.949:
+ links #604 — Lucy's 2023
+ "weight handling" issue. That thread establishes the known problem: megapipe weights
+ are 0/1 masks, ngmix needs inverse-variance, and the proper fix is a real RMS map from
+ SExtractor BACKGROUND_RMS (or THELI weights), not a per-stamp
+ sigma_mad estimate. Context, not a direct answer — see the triage note.
+
Opened the size-naming question and poked Lucy.415:
+ "the intent is to resolve the confusion about 'T'… We need to check that pars[4]
+ is indeed 'T', and that r50 should be 1.1774·σ and not just
+ σ. Would be good to not hard-code these transformations — functions exist in
+ sp_validation:galaxy.py, we'd move them to cs_util." + "Poking
+ @lbaumo".
+
+
+
I ran Martin's check. The model is gauss (both PSF and galaxy),
+ and the code itself settles the question: line 911 builds
+ galsim.Gaussian(sigma=np.sqrt(pars[4]/2)), which is only valid if
+ pars[4] = T = 2σ². So yes, pars[4] is T — and the
+ r50 columns are worse off than the question implies:
+
galaxy r50 (l.415) = pars[4] = T = 2σ² — an area, arcsec²
+ PSF r50_PSFo (l.676) = sqrt(T/2) = σ — the std-dev, arcsec
+ true half-light radius = 1.1774·σ — neither column equals this
+
The two r50 columns are on different scales (T vs σ) and neither
+ is a half-light radius. Martin's instinct was right; the fix is a single honest
+ transformation, ideally the one moved into cs_util.
+
+
Still unanswered (5):293 dead print, 1140
+ sextractor_e1e2, the 4 runner decorators, fitting.py,
+ 1068 PSF prior/flux — and crucially 1045
+ *_psfo reconvolved-PSF, the second merge-gate, which is the one open item
+ that genuinely changes what downstream null-tests read.
+
+
+
+
+
+
§ Triage
+
The eleven findings, by what unblocks them.
+
+
Martin has now spoken to six; five are still open. The ✓ column marks his
+ morning disposition. The action cut is still by kind:
+
+
A · cut-and-dry mechanical, no science — push as one cleanup
+ commit. B · decision needs Martin/Lucy intent; some move
+ downstream science. C · resume dead feature — decide
+ rewire vs remove.
+
+
+
Loc
Finding
Kind
Martin ✓
Action
+
+
293
Unreachable print after return in the megapipe-rotate branch.
A
—
Delete the line.
+
1140
sextractor_e1e2 is unused and its docstring is copy-pasted from another function.
A
—
Drop it, or fix the docstring if kept.
+
766
Masked-fraction cut divides by a literal 51 * 51; breaks when stamp size becomes configurable (on the roadmap).
A
agreed — make configurable; use the STAMP_SIZE/VIGNET_SIZE config convention.
Add STAMP_SIZE config (default 51); divide by it, not the literal.
+
sextractor_ runner:74 +3
WCS moved to a positional input_file_list[-1] in 4 runners (sextractor, read_ext_sexcat, psfex_interp, vignetmaker) but their @module_runner decorators weren't updated (only ngmix_runner was). Shipped configs are correct, so not a live bug — but the decorator is the input contract.
A
—
Declare the WCS input in each decorator's file_pattern/file_ext.
+
fitting.py
Near-verbatim copy of esheldon's ngmix example; not wired to the pipeline, and duplicates the new testing/simulate.py helpers it ships alongside.
A
—
Confirm it's not meant to ship → drop, or trim to import the canonical helpers.
+
+
1045
*_psfo now carries the metacal-reconvolved PSF.g/T/r50_PSFo read from obsdict['noshear'].psf.meta (reconvolved), and the per-type PSF columns collapse to one value. PSF-leakage / ρ-stat / null tests consume these as the input PSF; the column doc still says "original image psf."
B
— unanswered 2nd merge-gate
Confirm null-test consumers tolerate reconvolved-PSF columns, or restore the raw pre-metacal PSF fit. Fix the doc either way.
+
949
Weight-map normalization changed vs v1.sigma_mad(gal) is taken over the whole (flux-contaminated) stamp where v1 used the windowed, object-free get_noise (still defined, now unused); and the per-pixel weight map is then multiplied by 1/sig_noise² — reads as double inverse-variance. Shifts χ²-weighting, reported errors/s2n, and PSF-averaging weights.
B
→ #604 — known weight-handling issue (0/1 masks → need real inv-var). Contextualizes, doesn't resolve the v1-regression sub-finding.
Distinguish the two: (a) the long-game proper-variance fix lives in #604; (b) the whole-stamp sigma_mad + double 1/sig² vs v1's windowed estimate is a this-PR question.
+
415 / 676
r50 columns aren't half-light radii. Verified: pars[4]is T (l.911 uses Gaussian(σ=sqrt(pars[4]/2))). Galaxy r50 = T = 2σ² (area); PSF r50_PSFo = sqrt(T/2) = σ. Different scales; neither is the true r50 = 1.1774·σ. sp_validation consumes these size columns.
B
verified + poked Lucy. Martin wants the transform un-hard-coded (move sp_validation:galaxy.py → cs_util).
One honest transform in cs_util: galaxy 1.1774·sqrt(pars[4]/2), PSF 1.1774·sqrt(T/2) — or rename to T/sigma. Cail's call as consumer.
+
737
Zero-pixel guard loosened any→all. v1 skipped an epoch if any pixel was exactly 0; now only if the whole stamp is 0. Partial CCD-edge stamps now enter the fit.
B
intended — "any==0 passed close to no postage stamps, too restrictive" (commit d01503d0).
Resolved as deliberate. Residual: is a fractional-zero threshold wanted, or is all==0 fine? Low priority.
+
1068
PSF fit reuses the galaxy joint prior + galaxy FLUX_AUTO as the PSF flux guess. Mirrors the upstream example, but a galaxy-flux guess for the PSF is unusual. Lower confidence.
B
—
Sanity-check PSF-fit convergence; confirm or adjust the guess.
+
+
254 / 257 / 464
CHECK_EXISTING_DIR resume is dead._check_existing_dir is set but never read in process(); get_last_id is defined but never called. A configured resume silently reprocesses from scratch and overwrites.
C
remove — "a hack to resume interrupted runs… can be removed now, implemented better in future if needed."
Remove the option + get_last_id + the MKDEBUG marker. Folds into the Bucket-A commit.
+
+
+
Tally: 6 answered (2 greenlit cleanups, 1 verified, 1 intended, 1 → #604, 1 RNG ack) · 5 still open, of which 1045 *_psfo is the lone unanswered merge-gate.
+
+
+
+
+
+
§ Recommended sequence
+
How to get #741 mergeable.
+
+
+
Land the cleanups + Martin's greenlit items (Bucket A+C). One commit on
+ #741: drop the unreachable print (293) and dead sextractor_e1e2
+ (1140), fix the four runner decorators, drop/trim fitting.py, remove the
+ CHECK_EXISTING_DIR resume + get_last_id (254, Martin: remove)
+ and the MKDEBUG marker, and add the STAMP_SIZE config
+ replacing the literal 51*51 (766, Martin: agreed). Pure mechanical, keeps CI green,
+ narrows the review surface to the two real science questions. (Cail's push offer stands.)
+
+
Resolve the size columns (415/676). Verified bug — the fix is one honest
+ transform. Either land it in cs_util as Martin suggested (galaxy
+ 1.1774·sqrt(pars[4]/2), PSF 1.1774·sqrt(T/2)) or rename the columns to
+ T/sigma. Cail decides as the sp_validation consumer; Lucy's been poked.
+
+
The lone unanswered merge-gate: *_psfo (1045). Martin/Lucy
+ still need to confirm that the null-test / ρ-stat consumers tolerate the metacal-reconvolved
+ PSF in g/T/r50_PSFo, or restore the raw pre-metacal fit. Doc fix either way. This
+ is the one open item that changes what downstream reads.
+
+
Weight-norm (949) — split the thread. The proper-variance redesign is
+ tracked in #604 (post-merge,
+ long game). But ask separately whether the whole-stamp sigma_mad + double
+ 1/sig² is an intended v2.0 change or a regression from v1's windowed estimate — a
+ this-PR question #604 doesn't cover.
+
+
Loose ends: PSF prior/flux guess (1068) sanity-check; zero-pixel guard
+ (737) resolved as intended. Merge #741 — Cail/Martin's gesture, never the
+ worker's — then delete the stale fork branch behind closed #740.
+
+
+
The honest read on "mergeable today": CI is green and nothing here breaks the
+ suite. With Martin's morning pass, the merge-gate count drops to one clear item —
+ *_psfo (1045), which silently changes the PSF columns sp_validation / null-tests
+ consume — plus the size-column fix (415/676), now verified and mechanical once the convention is
+ chosen. Weight-norm (949) is reframed: the design fix is #604's (post-merge), leaving only the
+ narrower "is the sigma_mad change a regression?" question. Everything else can merge-then-iterate.
+
+
Housekeeping still open (from
+ the part-2 summary, none blocking): no closes # link (overlaps Axel's #725 centroid
+ work — cross-ref or supersede?); no docs update; the CLAUDE.md "don't modernize
+ ngmix" note is now stale; the pyproject ngmix pin should land on a tagged
+ esheldon/ngmix release rather than a fork branch.
+
+
+
+
+
+
§ Open questions
+
For Cail.
+
+
Push Bucket A now? Martin greenlit 254 (remove) and 766 (configurable);
+ the rest are mechanical. One commit on #741 clears five-plus findings and narrows the review to
+ the science. Push it, or wait for an explicit "go the whole batch"? (My read: push — it's
+ what Martin's replies invite.)
+
r50/T — your call as consumer. Honest transform in cs_util
+ (1.1774·sqrt(T/2), what Martin leans toward) vs rename columns to
+ T/sigma. The transform keeps the r50 name meaningful;
+ the rename is blunter but unambiguous. Which churns sp_validation less for you?
+
*_psfo (1045) — the open gate. Do you already know whether
+ the null-tests can read a reconvolved PSF, or is this a real question for Martin/Lucy? If the
+ former, we can answer it in the thread now.
+
Weight-norm framing. Agree that #604 owns the long-game fix and we only
+ press Martin on the narrower sigma_mad-vs-v1 regression question?
+
How do you want to reply? Inline thread replies to Martin's six (in your
+ voice / "— Claude on behalf of Cail"), a summary comment, or talk it through here first and I
+ draft once we've landed the calls?
+
+
+
+
+
+
diff --git a/.felt/review-ngmix-v2-pr740/review-ngmix-v2-pr740.md b/.felt/review-ngmix-v2-pr740/review-ngmix-v2-pr740.md
new file mode 100644
index 000000000..c44543eb8
--- /dev/null
+++ b/.felt/review-ngmix-v2-pr740/review-ngmix-v2-pr740.md
@@ -0,0 +1,75 @@
+---
+id: 01KTCHWZX873VG28AA663A3ZQE
+name: 'Review + work: ngmix v2.0 (PR #740)'
+status: closed
+tags:
+ - constitution
+ - shapepipe
+ - ngmix
+ - review
+created-at: 2026-06-01T11:50:17.967872934+02:00
+closed-at: 2026-06-05T23:59:38.502Z
+outcome: 'INTERACTIVE — live with Cail. Bucket-A cleanups PUSHED to ngmix_v2.0/#741 (dd4f656a..bd60dc8e: 293/1140/254-resume/766 + fitting.py deleted); CI running. Dev image built (/n17data/cdaley/containers/shapepipe-dev) + ngmix 2.4.0 installed → full suite + example pipeline now runnable. Ran a 6+2-agent WORKFLOW on the two hard problems → two decision-ready reports in fiber: weights-report.md, size-report.md, deep-dive-report.html (sent to Cail). WEIGHTS (#604+949): found TWO coupled regressions in prepare_ngmix_weights (ngmix.py:871) — R1 noise estimator regressed from object-free windowed get_noise (now DEAD at :826) to flux-contaminated whole-stamp sigma_mad; R2 lost the v1 binarization → double-counts a real inverse-variance map. EMPIRICALLY CONFIRMED (truth ivar 1e6 → recovered 8.8e11, ratio≈1/noise²). Clean red→green test target via make_data. Rec: SPLIT — minimal v1-restore (reinstate binarization + get_noise) + test in #741; SExtractor BACKGROUND_RMS baseline as a SEPARATE PR (Codex shuttle, ~4-8h, closes #604). SIZE (r50/T): galaxy r50=pars[4]=T (area), PSF r50psf=σ; neither is 1.1774σ; all 5 r50* cols new in v2.0. UNIONS-3500 WL I (arXiv:2605.13549) reports half-light radius r_h as PRIMARY. Rec: TRANSFORM at source (honest r50 in ngmix) + cs_util converter web; bonus find — sp_validation galaxy.py:T_to_fwhm is dimensionally wrong (feeds the scale-dependent PSF-leakage fit). Both work-streams now DISPATCHED as shuttles (Cail''s go): [[ngmix-weights-ivar]] (Codex — minimal #741 regression fix w/ red→green test + the SExtractor BACKGROUND_RMS baseline as a separate PR closing #604) and [[ngmix-size-columns]] (claude-opus — honest r50 at the ngmix source + cs_util converter web + the sp_validation T_to_fwhm leakage fix). Workers prepare branches/commits/PR-descriptions + reports; merge/PR-creation stays Cail''s. STILL OPEN for the eventual #741 reply (not in these streams): *_psfo reconvolved-PSF merge-gate (1045) + the runner-decorator contract (testable via the example pipeline). PLAN: let the two shuttles finish → reassess → write the Martin reply. No reply drafted yet.'
+horizon: now
+shuttle:
+ enabled: true
+ kind: oneshot
+ interactive: true
+ host: candide
+ project_dir: /automnt/n17data/cdaley/unions/shapepipe
+ agent: claude-opus
+---
+
+# Review + work: ngmix v2.0 (PR #740)
+
+Martin requested Cail's review on **[CosmoStat/shapepipe #740 "Ngmix v2.0"](https://github.com/CosmoStat/shapepipe/pull/740)** (head `ngmix_v2.0` → base `develop`, +3894/−3410 across 67 files). This is the PR that **realizes [[ngmix-update]]**: replace Axel's `stable_version` ngmix fork with **upstream ngmix 2.4.0** and adopt **@lbaumo's (Lucy Baumont's) new ngmix classes + interface**, reconciling the cleaned-up wrapper from her visit. Track it in [[prs-in-flight]]; it sits in the broader Martin collaboration under [[shapepipe]].
+
+## Desired State
+
+PR #740 has a **thorough, scientifically-grounded review**, and the collaboration moves forward:
+
+1. **A structured review exists**, worked against Martin's reviewer checklist (targets `develop` ✓, labels, clear description, "closes #" links, code/doc style, docs updated, **CI passing**, all changed files read with comments) — *and* against a **science-quality bar** that the checklist doesn't capture (below).
+2. **Comments are posted to GitHub** in Cail's voice (signed "— Claude on behalf of Cail" per the authorship convention, or Cail's own voice as he chooses on the call) — line-level where it's a specific code point, PR-level for the broader read.
+3. Where Cail decides, **fixes/suggestions are pushed** (as PR review suggestions or a branch off `ngmix_v2.0`) or **requested** from Martin/Lucy. **Merge/approve is Cail's gesture, never the worker's.**
+
+### The science-quality bar (what the checklist misses)
+
+The diff is dominated by `src/shapepipe/modules/ngmix_package/ngmix.py` (~1252 lines changed) — the module overhaul. The review must actually reason about the science, not just the diff mechanics:
+
+- **ngmix 2.4.0 API correctness.** The 1.x→2.x ngmix API changed substantially (observation/fitter/bootstrapper interfaces). Does the new code use the 2.4.0 API correctly — priors, PSF fitting, the metacal/bootstrapping path, guesses? @lbaumo's new classes: are the abstractions sound, and does the interface match how shapepipe calls them (`ngmix_runner.py`)?
+- **The centroid-bias fix.** A large part of this PR is centroid-bias work (`scripts/validation/centroid/centroid_bias{,_v2}.py`, the bug/fix/v2 configs, `test_centroid_shift.py`). Centroid shift biases shapes — this matters for UNIONS shear. **Does the fix actually remove the bias?** Look for the before/after evidence; if it's not in the PR, that's the key thing to ask for (or to *run* on candide).
+- **`r50` as the galaxy-size guess** (changed from `T`). Reasonable? Convergence/robustness implications for the fitter guesses.
+- **New `src/shapepipe/testing/simulate.py` + `scripts/python/fitting.py`** — what do they test/simulate, and do they give a cheap way to validate the v2.0 fitter independent of a full pipeline run?
+
+### Run it on candide (the repo is right here)
+
+`project_dir` is the live shapepipe checkout. Don't review the diff cold — exercise it:
+
+- `gh pr checkout 740` (or fetch `martin/ngmix_v2.0`) into a worktree so `develop`/`docs-rework` stays clean.
+- Run the **ngmix module / its tests**; run `test_centroid_shift.py` and the centroid configs if feasible; check the v2.0 fitter produces sane output on a small input. Use the repo's `CLAUDE.md` for build/run conventions and the `uv` lockfile.
+- Confirm **CI status** via `gh pr checks 740`.
+- Capture what you ran and what you saw — that empirical check is the most valuable part of the review (the curious eye; don't report a result you didn't observe).
+
+## Interactive — how this runs
+
+`claude-opus`, **interactive**. Do the full autonomous prep first: read the diff (`gh pr diff 740`), read the changed `ngmix.py` / runner / Lucy's classes closely, run what you can on candide, and assemble (a) the checklist pass and (b) the science-quality read, with specific line-level notes staged as *draft* comments. **Then hold and wait for Cail to attach** — he'll talk through the science, sharpen the comments, and decide what's posted, what's pushed, and whether to approve. Posting to GitHub and any push happen *with* him, in his voice; the worker drafts, Cail directs. Do not close or `kill` until Cail signals done.
+
+## Context
+
+- **Repo:** `/automnt/n17data/cdaley/unions/shapepipe` on candide. Remotes: `origin` = CosmoStat/shapepipe, `martin` = martinkilbinger/shapepipe-1. `gh` is authenticated here.
+- **PR #740** by Martin Kilbinger, opened 2026-06-01; head `ngmix_v2.0`, base `develop`; OPEN. Reviewer checklist is in the PR body.
+- **@lbaumo = Lucy Baumont** — her ngmix classes/interface are central; her wrapper cleanup came out of her visit (see [[ngmix-update]]).
+- **Prior art / related:** [[ngmix-update]] (the future-intent this realizes), [[prs-in-flight]] (the PR tracker — add #740 and its disposition), [[shapepipe]] (the Martin collaboration root). The repo `CLAUDE.md` carries build/test/run conventions — read it.
+- **Authorship:** GitHub comments in Cail's voice, signed "— Claude on behalf of Cail" unless Cail asks otherwise on the call.
+
+## Round 2 (2026-06-05): next-steps report from Martin's responses — SUPERSEDES the merge-comment scope for THIS round
+
+**Cail's ask (night of 2026-06-05):** read Martin's comments/responses to the standing two-part review (now on #741), do a *further code-review pass*, and produce a **report of recommended next steps** given where Martin has landed.
+
+**Desired State (round 2):**
+- Read every Martin (`martinkilbinger`) response across **#740** and **#741** — inline replies, review threads, and any new commits/force-pushes since the part-2 review (2026-06-03). Note the `# MKDEBUG` markers still in the code. (As of dispatch, Martin's engagement is light: one COMMENTED review + one inline reply at `ngmix.py:951` "Good. This probably was previous code before we changed to a random seed per tile." Verify whether more has landed.)
+- **Further code-review pass** over the ngmix module + the 4 runners whose decorator contracts part-2 flagged (`sextractor_runner` et al.): for each of the 11 line-anchored findings, determine whether Martin's v2.0 now addresses it or it still stands.
+- Produce a **next-steps report**: `report.html` in the fiber dir **and** a posted PR summary comment on #741 (Cail's voice / "— Claude on behalf of Cail"). For each open item: `{resolved by Martin | still open | needs a decision}`, owner (`Cail / Martin / Lucy`), and the **recommended sequence to get #740/#741 mergeable**.
+- **REPORT ONLY — do NOT push new commits to the ngmix branch this round.** Pure analysis + PR comment. Keeps it collision-free while other shapepipe/pure_eb work is in flight.
+
+**Exit:** autonomous — produce the report + PR comment, then close for Cail's review.
diff --git a/.felt/review-ngmix-v2-pr740/size-report.md b/.felt/review-ngmix-v2-pr740/size-report.md
new file mode 100644
index 000000000..e43e997c3
--- /dev/null
+++ b/.felt/review-ngmix-v2-pr740/size-report.md
@@ -0,0 +1,94 @@
+# ngmix v2.0 size-column naming: ground truth, official convention, recommendation
+
+**Scope:** the five new `r50*` size columns added in ngmix v2.0 (PR #741), what they actually contain, what the official UNIONS papers report, and how to make the producer (shapepipe) → shared (cs_util) → consumer (sp_validation) stack honest and consistent.
+
+**Verification:** code claims verified in the worktree; conversion arithmetic verified numerically; paper convention from arXiv:2605.13549 and arXiv:2204.04798; downstream consumption from the sp_validation source tree.
+
+**Headline:** the v2.0 `r50` columns are mislabeled — galaxy `r50` == the area `T`; PSF `r50psf` == `σ` (not `1.1774σ`). Cleanest fix: make the ngmix writer emit honest, correctly-converted columns at the source (true half-light radius `r50 = 1.1774·√(T/2)` for both galaxy and PSF, alongside the existing `T`), rather than a rename or a downstream-only transform. A *separate* load-bearing bug lives in sp_validation's `T_to_fwhm`, which a clean r50/σ surface lets us retire.
+
+---
+
+## 1. Code ground truth — what each size column actually holds
+
+In ngmix `gauss`, `pars = [cen1, cen2, g1, g2, T, flux]` and **`pars[4] = T = 2·σ²` is an AREA** (arcsec²). The half-light radius is `r50 = 1.1774·σ = 1.1774·√(T/2)`. The `σ = √(T/2)` mapping is confirmed twice in-file (`galsim.Gaussian(sigma=np.sqrt(guess[4]/2))`; `r50_psfo = np.sqrt(max(psf_res['T_psf'],0)/2)`).
+
+| Column (FITS) | Source | Value in σ | True r50 = 1.1774σ? | Status |
+|---|---|---|---|---|
+| `T`, `T_err` | `results[…]["T"]` | `2σ²` (area) | — | Correctly named area |
+| `r50`, `r50_err` | `pars[4]`, `pars_err[4]` | **`2σ²` (area)** | **No** | **MISLABELED: == `T`/`T_err` bit-for-bit. Neither √ nor ×1.1774 applied.** |
+| `Tpsf` | `T_PSFo` | `2σ_psf²` (area) | — | Correctly named area |
+| `T_psfo_ngmix`, `T_err_psfo_ngmix` | `T_PSFo`, `T_err_PSFo` | `2σ_psf²` | — | Correct; `T_psfo_ngmix` **duplicates** `Tpsf` |
+| `r50psf` | `r50_PSFo` = `√(T_psf/2)` | **`σ_psf` (length)** | **No — it's σ, off by 1.1774×** | Genuine length, missing the factor |
+| `r50_psfo_ngmix` | `r50_PSFo` | `σ_psf` | No — it's σ | **Duplicates `r50psf`** |
+| `r50_err_psfo_ngmix` | `T_psf_err/(2·r50_psfo)` | `σ_psf_err` | No — error on σ | ~~Correct error-prop of σ~~ **also a factor-2 over-estimate: dσ/dT = 1/(4σ), not 1/(2σ)** (caught in [[shapepipe/ngmix-size-columns]] implementation); NaN when σ=0 |
+
+**Headline hazard:** the same name root `r50` means an **area** on the galaxy side and a **length** on the PSF side. A user ratioing galaxy `r50` against `r50psf` divides an area by a length. **Zero columns in the file are a true half-light radius** — every "r50" is either `T` (galaxy) or `σ` (PSF).
+
+**Arithmetic check (verified):** `√(2·ln2) = 1.17741`; `2.355/1.17741 = 2.000` (why the downstream `T_to_fwhm` is dimensionally wrong — §4).
+
+**v1 provenance:** `origin/develop` wrote sizes **only as `T`**. **All five `r50*` columns are NEW in v2.0**, and the galaxy two were wired to the area `pars[4]` — introduced as mislabeled duplicates.
+
+---
+
+## 2. Official convention + downstream consumption
+
+**UNIONS-3500 WL I — A Galaxy Shape Catalogue (arXiv:2605.13549), the current shape-catalogue paper:**
+- Reports the **half-light radius `r_h` (= ngmix `r50`)** as the **PRIMARY** size for **both galaxy and PSF**.
+- Size cut is the dimensionless **`0.707 ≤ r_h/r_h,psf ≤ 3`**, applied *inside Metacalibration* (enters the selection response).
+- Resolution factor `R = r_PSF/(r_PSF + r_h)` — written in `r`, not `T`.
+- **`T` is derived only**, via Eq. 1 `T = (r_h²/ln2)·(1+g1²+g2²)/(1−g1²−g2²)`, "to relate r_h to the DES size definition." Round-object case `T = r_h²/ln2 = 2σ²` — identical to standard ngmix `T = 2σ²`.
+
+**Guinot et al. 2022 (arXiv:2204.04798):** mixed — galaxy cut in `T` (`T_gal/T_PSF > 0.5`), PSF size `T = 2σ²`, but ngmix fit param + a prior in `r50` (arcsec).
+
+**sp_validation:** treats galaxy/PSF size only as the dimensionless **ratio** `T/Tpsf` (size cut, DES-weight & leakage binning) — robust regardless of absolute meaning — plus one place where PSF `T` is converted to FWHM for the leakage fit (§4). **It reads NONE of the five `r50*` columns** (0 hits in `src/`).
+
+**Takeaway:** the official paper wants `r_h` (= r50) as the headline size — exactly the quantity the v2.0 `r50*` columns *claim* to provide but don't. Standardize on **r50 = half-light radius as primary, T = 2σ² as the derived DES area.**
+
+---
+
+## 3. Recommendation — TRANSFORM at the source, thin cs_util surface
+
+Three options: a **pure RENAME** (`r50`→`T`/`σ`) is honest but discards the chance to ship the paper's headline quantity; a **cs_util-only TRANSFORM** (fix on read) leaves a foot-gun baked into every FITS file; **TRANSFORM at the source** (recommended) makes the writer emit honest, paper-consistent columns with cs_util holding the convention once.
+
+### 3a. ngmix.py — every size column honest and on the same scale
+Keep all `T*` columns unchanged (correct areas). Change the radius columns to true half-light radii:
+- **Galaxy:** `r50 = 1.1774·√(T/2)` (= `√(ln2·T)`); error `r50_err = (1.1774/(2√2))·T_err/√T`. *(Currently append the area `pars[4]`/`pars_err[4]`.)*
+- **PSF:** multiply existing σ values by `1.17741` so `r50psf`/`r50_psfo_ngmix`/`r50_err_psfo_ngmix` are true half-light radii, not σ.
+
+After this, **every "r50" column is a length meaning the same thing (1.1774σ)** on both sides, matching UNIONS-3500 I Eq. 1 for round objects.
+
+### 3b. De-duplicate
+`T_psfo_ngmix` duplicates `Tpsf`; `r50_psfo_ngmix` duplicates `r50psf`. Retire the redundant names (or document as aliases). Cleanest end state — one name per quantity: galaxy `{T, T_err, r50, r50_err}`, PSF `{Tpsf, Tpsf_err, r50psf, r50psf_err}`. Also fix the asymmetry: a `T_err_psfo_ngmix` exists but no `Tpsf_err` twin.
+
+### 3c. cs_util — single source of truth for the size web
+Expose thin, correctly-named, tested converters (home: `cs_util`, since `sp_validation/galaxy.py` already does `from cs_util import cfis`, and cs_util is the shared layer):
+
+```
+T_to_r50(T) = 1.1774 * sqrt(T / 2) # = sqrt(ln2 * T)
+r50_to_T(r50) = 2 * (r50 / 1.1774)**2
+sigma_to_fwhm(s) = 2.355 * s # already correct in galaxy.py
+T_to_fwhm(T) = 2.355 * sqrt(T / 2) # CORRECTED (see §4)
+```
+
+---
+
+## 4. Migration impact on sp_validation
+
+**No change for the size-ratio paths.** Galaxy `T` and PSF `Tpsf` enter only as the dimensionless ratio `T/Tpsf` (size cut, `size_ratio` binning) — robust by construction. None of the five `r50*` columns is read in sp_validation `src/` (0 hits), so the mislabel harms no current computation.
+
+**One genuine, load-bearing fix — the PSF-leakage size regressor.** `fwhm_PSF = T_to_fwhm(T_PSFo)` (built in `notebooks/extract_info.py`) is the third regressor in the scale-dependent PSF-leakage fit (`run_object.py:510`, `size_PSF_col` at `calibration.py:540`). The current `galaxy.py:T_to_fwhm` is `T/1.17741*2.355` and its own `MKDEBUG` comment doubts it ("Why is FWHM not quadratic in T???"). It is **wrong**: `T = 2σ²` is an area, so `FWHM = 2.355·σ = 2.355·√(T/2)` — a √ is required, not a linear factor (`2.355/1.17741 = 2.000` shows it treats T as if already σ). Result: a monotonic-but-nonlinear distortion of the PSF-size axis, biasing any `α(PSF-size)` trend and per-bin leakage coefficients. Spatially-constant leakage unaffected.
+
+**Migration:** point the `fwhm_PSF` builder at the new corrected PSF radius column via `cs_util.sigma_to_fwhm` (or import the fixed `cs_util.T_to_fwhm`), retire the local `galaxy.py:T_to_fwhm`. This is the change that actually moves a science number; the column rename/transform is hygiene + paper-consistency that also removes the foot-gun.
+
+**Sequencing:** the catalogue-writer change (§3a/3b) and the sp_validation regressor fix (§4) are independent — sp_validation doesn't read the `r50*` columns today, so fixing ngmix won't break it. cs_util converters should land before sp_validation switches to importing them.
+
+---
+
+## 5. Open questions for Lucy / Martin
+
+1. **Primary on-disk convention:** standardize on **r50 as primary** (matching UNIONS-3500 I) while keeping `T`? Or keep `T` as the working column with r50 derived?
+2. **Drop vs keep duplicates** (`T_psfo_ngmix`≡`Tpsf`, `r50_psfo_ngmix`≡`r50psf`): anything outside sp_validation `src/` (notebooks, older catalogues) bound to the `_psfo_ngmix` names? If so, alias rather than delete.
+3. **cs_util as the home** for the converter web — agreed? Naming preference (`T_to_r50` vs `T_to_halflight`)?
+4. **PSF-leakage refit:** were current/in-flight scale-dependent leakage results produced with the buggy `T_to_fwhm`, and should they be regenerated once fixed? How much does `α(size)` move?
+5. **`r50_err` for the galaxy:** is the error-propagation form acceptable, or prefer the full pars covariance?
+6. **Catalogue versioning:** does changing on-disk `r50*` semantics warrant a format version bump / changelog so downstream users know v2.0 pre- vs post-fix differ?
diff --git a/.felt/review-ngmix-v2-pr740/weights-report.md b/.felt/review-ngmix-v2-pr740/weights-report.md
new file mode 100644
index 000000000..17edea424
--- /dev/null
+++ b/.felt/review-ngmix-v2-pr740/weights-report.md
@@ -0,0 +1,136 @@
+# ngmix v2.0 weight / inverse-variance handling — decision-ready report
+
+**Scope:** PR #741 (branch `ngmix_v2.0`), the v2.0 refactor of noise/weight handling in `ngmix.py`, against the v1 baseline (`origin/develop`), and its relationship to Issue #604 (Weight Handling). All file:line refs are in the worktree (`ngmix_v2.0` head + review cleanups). Load-bearing claims verified against source **and confirmed empirically** (see §2a).
+
+**Headline:** v2.0 introduced two real, test-verifiable regressions in the ngmix weight map (dead noise estimator + lost mask binarization). The minimal v1-restore belongs in #741; the proper inverse-variance fix from #604 is a clean, mostly-config follow-up PR that should land separately.
+
+---
+
+## 1. Precise regression characterization
+
+The entire live weight path is `prepare_ngmix_weights` (`ngmix.py:871–908`):
+
+```python
+weight_map = np.copy(weight) # 894 FSCALE-rescaled Megapipe 0/1 mask
+weight_map[flag != 0] = 0.0 # 895
+sig_noise = sigma_mad(gal) # 897 noise scalar over the WHOLE stamp
+...
+weight_map *= 1 / sig_noise ** 2 # 906
+```
+
+v1 (`origin/develop`, `do_ngmix_metacal`), in order:
+
+```python
+weight_map = np.copy(weights[n_e])
+weight_map[np.where(flags[n_e] != 0)] = 0.0
+weight_map[weight_map != 0] = 1 # binarize to exactly 0/1 <-- DROPPED in v2
+...
+sig_noise = get_noise(...) # object-free windowed estimator <-- ORPHANED in v2
+...
+weight_map *= 1 / sig_noise**2
+```
+
+Two coupled regressions, both verified:
+
+**R1 — Noise estimator regressed from object-free to whole-stamp.** v1 called the windowed `get_noise` (masks the object via a model-Gaussian window, re-estimates `sigma_mad` on object-free pixels). v2 uses bare `sigma_mad(gal)` over the full, flux-contaminated stamp. `get_noise` still exists (`ngmix.py:826`, body byte-identical to v1) but **has zero call sites in `src/`** — confirmed dead. v2's `sig_noise` is biased high by the galaxy's own flux, in a **size/flux-dependent** way. Dominant numerical effect for today's inputs.
+
+**R2 — Lost the mask binarization.** v1 collapsed the weight to exactly 0/1 before scaling, so the final weight was a clean uniform `1/sig_noise²` per unmasked pixel. v2 dropped that line, so line 906 multiplies the *raw rescaled mask* by `1/sig_noise²`.
+
+What `weight` contains: traced from `ngmix_runner.py` input #4 → vignetmaker `RUN_2` → single-exposure weight images = Megapipe weight maps, which Gwyn confirmed are **0/1 masks** (#604). The only transform before line 894 is the FSCALE rescale (`rescale_epoch_fluxes`: `weight * 1/Fscale**2`) — identical to v1.
+
+Consequences of R2:
+- **For today's 0/1 masks:** each unmasked pixel carries `Fscale⁻² / sig_noise²`. FSCALE is applied to the weight *twice* and to the noise *once* → per-pixel ivar uniformly off by `1/Fscale²` per epoch vs v1. A latent scale bug, varies epoch-to-epoch.
+- **Latent hazard (dangerous):** if a *real* inverse-variance map is fed in (THELI, or the #604 BACKGROUND_RMS path), v2 multiplies a genuine `1/var_pix` by another `1/sig_noise²` → **variance counted twice**. v1's binarization made the code robust to this; v2 is not.
+
+### Downstream direction of the effect
+`weight_map` → `Observation.weight` = per-pixel inverse-variance in the ngmix χ².
+- Inflated `sig_noise` → too-low weight → under-weighted likelihood → **over-estimated** `g_err`/`T_err`/`flux_err`, **under-estimated** `s2n`. Because whole-stamp `sigma_mad` grows with brighter/larger galaxies, this is a **size/flux-dependent miscalibration** — the signature of a multiplicative shear bias `m`, consistent with the prior old-path finding `m ~ -2.8e-2`.
+- **Multi-epoch PSF averaging** weights each epoch by `obs.weight.sum()` → now driven by whole-stamp MAD and FSCALE; reported `PSFo` g/T shift accordingly.
+
+---
+
+## 2. Test-verifiability: **YES — clean red→green**
+
+`make_data` (`simulate.py:91`) injects the exact truth inverse-variance: `weights = im*0 + 1/noise**2`, `flags = 0`. That `1/noise²` is the oracle. A unit-level test on `prepare_ngmix_weights` (no ngmix fit needed — fast, deterministic) asserts the returned weight equals the injected inverse-variance:
+
+```python
+def test_weight_map_recovers_injected_inverse_variance():
+ noise = 1e-3
+ gals, psfs, _, weights, flags, jacobs = make_data(
+ rng=np.random.RandomState(123), shear=(0.0, 0.0),
+ noise=noise, n_epochs=1, img_size=201)
+ _, weight_map, _ = prepare_ngmix_weights(
+ gals[0], weights[0], flags[0], np.random.RandomState(0))
+ truth_ivar = 1.0 / noise ** 2
+ recovered = np.median(weight_map[weight_map > 0])
+ npt.assert_allclose(recovered, truth_ivar, rtol=0.10)
+```
+
+**Coverage gap today:** `test_ngmix.py`'s only weight-path test (`test_metacal_is_reproducible_with_fixed_seed`) asserts `run(42)==run(42)` — a determinism check that passes for *any* deterministic normalization, correct or not. The regression is exercised but never checked against truth.
+
+**Caveats:** `make_data` injects *homoscedastic* noise — the test catches the double-normalization + scalar bias, but to prove the fix respects *spatially-varying* ivar (#604's point), `make_data` needs a per-pixel RMS option (follow-up, not a blocker).
+
+### 2a. EMPIRICAL CONFIRMATION (observed, this session)
+
+Ran the check against the actual code (dev container, galsim+numpy; no ngmix fit needed):
+
+| noise | truth ivar | sigma_mad(gal) | recovered median | ratio (correct = 1.0) |
+|---|---|---|---|---|
+| 1e-3 | 1e6 | 1.065e-3 | 8.81e11 | **8.81e5** |
+| 1e-4 | 1e8 | 1.167e-4 | 7.35e15 | **7.35e7** |
+
+The recovered weight is off by ≈`1/noise²` — exactly the R2 double-count (real ivar × another `1/sig_noise²`). `sigma_mad(gal)` runs ~6% above truth noise (R1 flux contamination, small for a compact galaxy in a 201² stamp). The data matches the by-reading analysis to ~6 orders of magnitude. **The regression is real and observed, not inferred.**
+
+---
+
+## 3. Recommended baseline: SExtractor BACKGROUND_RMS — concrete change plan
+
+The existing machinery is already generic: SExtractor's check-image handler is list-driven and image-agnostic; vignetmaker's ME loop loops over `ME_IMAGE_PATTERN` and reads `get_data(0)` — a single-HDU BACKGROUND_RMS check-image flows through the same path as BACKGROUND with **zero code change**.
+
+| Step | What | File(s) | Cost |
+|---|---|---|---|
+| 1 | `CHECKIMAGE = BACKGROUND, BACKGROUND_RMS` | `example/cfis/config_exp_*.ini` | **config only** |
+| 2 | Add `background_rms` to vignetmaker `ME_IMAGE_DIR`/`ME_IMAGE_PATTERN` | `config_tile_MiViSmVi.ini` | **config only** |
+| 3 | Wire one **optional** input (RMS sqlite) through runner + `Vignet` | `ngmix_runner.py`, `ngmix.py` `Vignet`/`Ngmix.__init__` | small code |
+| 4 | **Core:** build per-pixel ivar from RMS, gated on mask | `ngmix.py` `prepare_ngmix_weights` et al. | real work |
+| 5 | Keep `sigma_mad`/`get_noise` as **fallback** when no RMS map present | `ngmix.py` (branch on `bkg_rms is not None`) | small code |
+
+**Core (Step 4) sketch** — Megapipe weight stays the *mask*; RMS supplies the *variance* (the two roles were conflated before):
+
+```python
+weight_map = np.copy(weight) # Megapipe mask (0 where masked)
+weight_map[flag != 0] = 0.0
+mask = weight_map != 0
+ivar = np.zeros_like(gal)
+ivar[mask] = 1.0 / bkg_rms[mask] ** 2 # per-pixel inverse variance
+sig_noise = np.median(bkg_rms[mask]) if mask.any() else sigma_mad(gal)
+return gal_masked, ivar, noise_img
+```
+
+RMS rescaling: BACKGROUND_RMS is in image counts, scales by `Fscale`; `1/rms²` scales by `1/Fscale²` — identical to the existing weight rescale, slots into `rescale_epoch_fluxes`.
+
+**Effort: S–M, ~4–8 h.** Config + plumbing ~1–2 h; core rewrite + fallback ~2–3 h; example-pipeline before/after on shear/s2n/flags ~2–3 h.
+
+**Honest read on aguinot's "almost free":** true for the plumbing (steps 1–2 are pure config), undersells the core. Delivering the data is cheap; *using it correctly* (changing what the weight means, getting Fscale + mask roles right, validating no fit regression) is the earned part.
+
+**Risks:** RMS units/rescaling; zeros/NaNs in RMS at masked/off-image pixels → `1/rms²` blowup (gate on mask, guard `rms<=0`); double-counting the mask; positional input-list shift breaking configs (mitigated by optional-input design).
+
+---
+
+## 4. Scoping: **separate PRs.**
+
+- **In #741 (now):** the minimal v1-restore — reinstate `weight_map[weight_map != 0] = 1` and call `get_noise(...)` instead of `sigma_mad(gal)` — plus the red→green unit test. Two-line change fixing a regression *this PR introduced*; restores known-good behavior; restoring the binarization makes the fallback robust to real ivar maps again. Does not depend on #604.
+- **#604 as its own PR (next):** the BACKGROUND_RMS baseline changes what the weight map *means*, touches new I/O, needs its own before/after validation. A distinct review surface; bundling into #741 would mix "fix the regression I just introduced" with "implement a 3-year-old feature request."
+
+The red→green test validates both the #741 minimal fix and the #604 baseline; it travels with #741 now and proves out #604 later.
+
+---
+
+## 5. Open questions for Martin (science call)
+
+1. **Baseline vs THELI.** Confirm in-house SExtractor-BACKGROUND_RMS is the baseline, with THELI weight images slotting in later as another `ME_IMAGE_PATTERN` source (not blocking).
+2. **Optional vs required RMS input.** Recommend optional with `sigma_mad`/`get_noise` fallback. Confirm.
+3. **#741 minimal fix:** restore the windowed `get_noise` (reintroduces the gal-guess machinery the PR may have deliberately stripped), or accept bare `sigma_mad(gal)` until #604 lands? R1 is the dominant effect, so this matters even short-term.
+4. **Does the RMS map feed anything beyond the weight + noise-fill scale** (e.g. background subtraction)? Recommend weight + noise-fill only.
+5. **Scope of config edits** for the #604 PR: cfis only, or cfis + cfis_simu + canfar batch?
+6. **Validation bar to merge #604:** smoke before/after on the example tile, or a full sim m/c calibration before merge?
diff --git a/.felt/review-pr741-fresh-pass/review-pr741-fresh-pass.md b/.felt/review-pr741-fresh-pass/review-pr741-fresh-pass.md
new file mode 100644
index 000000000..ba3f2c2a9
--- /dev/null
+++ b/.felt/review-pr741-fresh-pass/review-pr741-fresh-pass.md
@@ -0,0 +1,21 @@
+---
+id: 01KTT34NW92KPHH9GDDXAGZZ89
+name: 'Fresh-pass review: PR #741 (post-integration)'
+tags:
+ - shapepipe
+ - ngmix
+ - review
+created-at: 2026-06-11T03:00:58.633703716+02:00
+outcome: 'Fresh full-diff review of #741 at b2dcd793 (empirical, against ngmix 2.4.0): scientific core verified sound (seed-reproducibility holds; injected shear recovered to 0.2% with response correction; r50/ivar-weights math checks out) but two empirically demonstrated blockers found in the v2.0 rewrite''s robustness shell — TILE_ID metadata key truncates the post-process CCD scan (~80% of epochs silently lost to multi-epoch bookkeeping) and failed galaxy fits crash the whole tile at save (KeyError, v1''s per-object BootGalFailure contract lost). Plus: one failed PSF epoch kills the object despite ignore_failed_psf; mcal_flags hardcoded 0 (dead quality column); cfis_simu configs broken against the positional-WCS contract. Science note: with psf=''fitgauss'' the reconvolved metacal PSF is round by construction, so g1/g2_psfo are identically zero — PSF-leakage regressions on them are degenerate. Fixes + review part 3 posted to the PR same night.'
+---
+
+Second-look review requested by Cail after the constitution-sweep integration round, run by a fresh agent with no investment in the branch. Full findings live in PR #741's "Review — part 3" comment; this fiber is the pointer plus what matters beyond the PR.
+
+## Why the blockers hid
+
+Both blockers survive a green unit-test suite and a one-tile smoke run with easy objects: the TILE_ID truncation only bites multi-CCD bookkeeping breadth (the first 8 CCDs still work), and the compile_results KeyError needs a hard-to-fit object (easy smoke objects all converge). Lesson for the test suite: structural tests catch import/config rot; these needed *adversarial empirical* tests — forced-failure objects, metadata-polluted sqlite fixtures. The regression tests added with the fixes are exactly that shape.
+
+## Science threads left deliberately open
+
+- **fixnoise homoscedasticity** (#604 thread): metacal noise-symmetrization currently draws at median(bkg_rms) while weights are per-pixel — second-order within a 51px stamp, one-line to change, Martin's call.
+- **g1/g2_psfo ≡ 0 under psf='fitgauss'** (*_psfo thread): keep/rename/drop decision affects any downstream PSF-leakage regression; relevant to the tutorial.
diff --git a/.felt/shapepipe.md b/.felt/shapepipe.md
index 044d7a3b1..502e65d3e 100644
--- a/.felt/shapepipe.md
+++ b/.felt/shapepipe.md
@@ -1,40 +1,54 @@
---
-name: ShapePipe — project knowledge & active threads
+id: 01KTCHX00NMQ1VDPGXRYJ6RGZR
+name: ShapePipe maintenance & PRs
tags:
- shapepipe
+ - portolan
created-at: 2026-04-27T11:26:38.71538657+02:00
-outcome: 'Root of ShapePipe''s felt store: the stack division, repo conventions, and the why behind in-flight infra/cleanup threads.'
+outcome: 'Root: collaboration with Martin on ShapePipe — PRs, infra, future ngmix and Fabian work'
---
-This is the root of ShapePipe's felt store — shared notes on architecture
-decisions, conventions, and in-flight work, for the team and AI agents alike.
-ShapePipe is the UNIONS galaxy shape-measurement pipeline; `CLAUDE.md` covers the
-build / container / CI overview, and the fibers here carry the *why*. Start here,
-then follow the links.
+ShapePipe is the UNIONS shape-measurement pipeline. I'm not the primary
+maintainer (that's Martin Kilbinger); my role is collaborator helping
+clean up infra, surface bugs, and keep the merge queue moving while
+Martin focuses on science threads.
-## Stack division
+## Working agreement with Martin
-ShapePipe **produces** shear catalogues; `sp_validation` / `cosmo_val`
-**consume** and validate them; `cs_util` holds code shared across both. A concern
-about *validating* catalogues belongs downstream, not in ShapePipe.
+Surfaced over a 2026-04-27 walking conversation. Captured in
+[[shapepipe/prs-in-flight]] and the per-thread fibers below.
+
+- I review and patch his PRs; he reviews mine. Bugs found during review
+ go to a dedicated PR rather than getting bundled into his feature
+ branch (per `feedback_separate_infra_prs`).
+- v2.0 was merged fast (it was ready). The skaha base it brought in is
+ the active source of pain → see [[shapepipe/docker-uv-revert]].
+- I file the issues; Claude usually drafts the PRs in my voice.
+ Disclosure on Claude-only review per
+ `feedback_claude_only_review_disclosure`.
+
+## Active threads (refreshed 2026-06-10)
+
+- **PR #737** (MPI on candide + containerized SLURM scripts) — **MERGED 2026-06-10** (rebased onto develop, both CI runs green).
+- **PR #738** (versioned docs + switcher) — **MERGED 2026-06-10** after independent review; sfarrens offered post-hoc comments. Stable root refreshes on next master push. See [[docs-versioning]].
+- **PR #739** (machine-specific cluster docs tree) — awaiting Martin's review; check rebase state after the #737/#738 merges. See [[docs-cluster-tree]].
+- **PR #741** (ngmix v2.0, CI mirror of #740) — Martin left 10 inline comments Jun 5, no verdict yet; shear recovery verified unbiased to ~1e-4 in m. See [[ngmix-weights-ivar]] (PR-ready regression fix + #604 ivar plan) and [[ngmix-size-columns]] (honest r50 spec).
+- **Martin's PRs #704 (contributors) & #699 (coverage mask)** — Cail's review requested; both conflicting, need Martin's rebase first.
+- **[[ci-green-on-develop]]** — conda fully removed (#733 merged); remaining follow-ups tracked there.
+
+## Earlier threads (superseded)
+
+- [[shapepipe/docker-uv-revert]] / [[shapepipe/prs-in-flight]] — the #719-era conda removal; landed via #733, current state in [[ci-green-on-develop]].
+- **[[shapepipe/ngmix-update]]** — became the #740/#741 ngmix v2.0 review thread.
+- **[[shapepipe/fabian-coord-bug]]** — port Fabian's 1-line coord propagation fix; still pending his image-sim code reaching github.
## Conventions specific to this repo
-- **Rho-statistics are obsolete inside ShapePipe.** PSF-systematics validation
- moved downstream to `sp_validation` / `cosmo_val` (via `shear_psf_leakage`);
- the stile/treecorr rho code was removed in #715. But the **meanshapes /
- ellipticity focal-plane plots** (`mccd_plots_runner`) are *deliberately kept* —
- they are a general PSF/star-catalogue diagnostic, not rho-stats, and feed
- catalogue-paper figures. Don't delete that path along with rho-stats; see
- [[shapepipe/cleanup-rhostats-jobscripts]] for where the boundary actually sits.
-- Run the pipeline through the container; use `python3.12` explicitly inside it.
-- **ngmix** is pinned to a fork branch until fixes land upstream — don't bump
- that dependency line. [[ngmix-update]] tracks the path back to upstream.
-
-## Active threads
-
-- **[[shapepipe/ci-green-on-develop]]** / **[[shapepipe/test-suite]]** — a
- tiered, in-image test suite and trustworthy CI on `develop`.
-- **[[docker-uv-revert]]** — slim Python base + uv lockfile, dropping conda.
-- **[[shapepipe/mpi-hybrid]]** — running hybrid MPI through the container on candide.
-- **[[ngmix-update]]** — replacing the pinned ngmix fork with upstream.
+- Container runs through `app` (apptainer wrapper); use `python3.12`
+ inside the shapepipe container (see `reference_containers`).
+- ShapePipe produces; `sp_validation` consumes; `cs_util` is shared (see
+ `project_stack_division`).
+- Rho stats are obsolete here — sp_validation/cosmo_val took over (see
+ `project_rho_stats_obsolete`).
+- Royal "we" in PR/issue voice; specific findings attributed to Claude
+ by name (see `feedback_writing_voice_on_cails_behalf`).
diff --git a/.felt/shapepipe/ci-develop-trigger/ci-develop-trigger.md b/.felt/shapepipe/ci-develop-trigger/ci-develop-trigger.md
index 629d6c23d..2c5e4ebec 100644
--- a/.felt/shapepipe/ci-develop-trigger/ci-develop-trigger.md
+++ b/.felt/shapepipe/ci-develop-trigger/ci-develop-trigger.md
@@ -1,4 +1,5 @@
---
+id: 01KTCHWZX8GRCMJHGCRARBGFS8
name: CI silently broken on develop; install_shapepipe conda.sh lookup
status: closed
tags:
diff --git a/.felt/shapepipe/ci-green-on-develop/ci-green-on-develop.md b/.felt/shapepipe/ci-green-on-develop/ci-green-on-develop.md
index 3b304539f..3ae6c6bc3 100644
--- a/.felt/shapepipe/ci-green-on-develop/ci-green-on-develop.md
+++ b/.felt/shapepipe/ci-green-on-develop/ci-green-on-develop.md
@@ -1,4 +1,5 @@
---
+id: 01KTCHWZX9Q1MG2FB20N5Y52TD
name: 'ShapePipe CI: green & trustworthy test suite on develop'
status: open
tags:
@@ -6,7 +7,7 @@ tags:
- ci
- constitution
created-at: 2026-05-29T12:34:31.800189806+02:00
-outcome: 'Conda removed from ShapePipe (#733 merged); develop is green including image publish. The container (slim Python 3.12 + apt + uv lockfile) is the single source of truth and CI runs pytest inside the dev image. ci-release.yml + doc-tests.yml + install_shapepipe + environment*.yml deleted; cd.yml builds docs in the image; install docs rewritten. The conda path had rotted into a multi-layer install failure — deleted rather than repaired, so the production astropy bug is moot. Remaining for a clean/tight/shared dev environment: (1) commit a shared .felt in the repo + curate which fibers go public; (2) make repo CLAUDE.md a real onboarding doc; (3) port cluster job scripts off conda; (4) #729 dependabot actions (recreating); (5) verify docs deploy on master; (6) strengthen the thin test suite (#708).'
+outcome: 'Near-realized. Ground truth 2026-06-11: develop green incl. image publish; #708 closed (test scaffolding landed on develop: 5 structural test files + property/synthetic-FITS tiers); candide PBS scripts modernized to SLURM+apptainer (#737); CLAUDE.md already a solid onboarding doc (minor gaps: CONTRIBUTING pointer, first-PR walkthrough); docs deploy infra correct post-#738 — /latest/ live, switcher published, ROOT still serves 2022 v1.0.1 until the next master push (the one remaining trigger). Conda survives only in out-of-scope CANFAR/CC-IN2P3 scripts (job_sp_canfar.bash + init_run_exclusive_canfar.sh have ACTIVE conda paths; cc_{mpi,smp}.sh use ccenv anaconda) — needs its own cluster-aware pass to fully realize ''no conda anywhere''.'
---
## Desired State
diff --git a/.felt/shapepipe/cleanup-rhostats-jobscripts/cleanup-rhostats-jobscripts.md b/.felt/shapepipe/cleanup-rhostats-jobscripts/cleanup-rhostats-jobscripts.md
index 4083ae1b8..a7fda558f 100644
--- a/.felt/shapepipe/cleanup-rhostats-jobscripts/cleanup-rhostats-jobscripts.md
+++ b/.felt/shapepipe/cleanup-rhostats-jobscripts/cleanup-rhostats-jobscripts.md
@@ -1,4 +1,5 @@
---
+id: 01KTCHWZXE6Z6MG1P21F804RZ2
name: 'ShapePipe cleanup: remove obsolete rho-stats/stile; modernize candide job scripts'
status: closed
tags:
@@ -6,27 +7,25 @@ tags:
- cleanup
- constitution
created-at: 2026-05-30T21:45:50.977369486+02:00
-closed-at: 2026-05-31T12:53:30.382233194+02:00
+closed-at: 2026-05-30T22:30:55.745891097+02:00
outcome: |-
- Resolved as one shipped PR + one corrected mis-scope.
-
- D1 (rho-stats removal) was a STALE PREMISE: the rho-stats/stile/treecorr code was
- already surgically removed from develop in #715 (merged 2026-04-23). What remained
- in `mccd_plots_runner.py` / `mccd_plot_utilities.py` is pure meanshapes/ellipticity
- plotting — NOT rho-stats — and Martin explicitly asked to keep it on #715 ("Let's
- keep meanshapes, this is very useful... can be run on merged star and PSF catalogues").
- PR #736 was opened then CLOSED (not merged): deleting meanshapes would contradict
- Martin and risk a catalogue-paper figure path. `stile` was already gone everywhere.
- Lesson: verify the premise against current develop before cutting the branch.
-
- D2 (candide PBS scripts) SHIPPED as PR #737 — OPEN, CI green, mergeable, awaiting
- Martin's review. candide_smp.sh / candide_mpi.sh now run via `apptainer exec` against
- ghcr.io/cosmostat/shapepipe:develop-runtime (no conda); host-clone bind-mounted at the
- same path so $SPDIR-relative configs resolve identically in/out of container; MPI uses
- the hybrid host-mpiexec pattern. Tested on c03=candide: SMP runs the example pipeline
- end-to-end with 0 errors; MPI hybrid needs a real multi-node allocation to verify e2e.
- canfar + ccin2p3 scripts deliberately untouched (different clusters, can't verify here)
- and noted in the PR. Also fixed a stale config path and propagated the real exit code.
+ Mixed. PR #737 (candide scripts) good and open; PR #736 (rho-stats) CLOSED as
+ an over-cut. The constitution's premise was stale: it treated mccd_plots_runner
+ as "the rho-stats path," but the authorized rho-stats removal (Martin #657
+ "option 1, delete") had ALREADY shipped in #715 (merged 2026-04-23). By develop,
+ mccd_plots_runner/mccd_plot_utilities held NO rho/stile/treecorr — pure
+ meanshapes (plot_meanshapes: focal-plane ellipticities/sizes/residuals + 1D
+ histograms). On #715 Martin EXPLICITLY said keep meanshapes ("very useful...
+ maybe Fabian's 2D plots in the catalogue paper"). #736 deleted exactly that path
+ → net-zero rho cleanup, entirely over-reach against Martin's instruction. Closed
+ with explanation (2026-05-31), not trimmed — nothing left to salvage.
+ (2) PR #737 stands: candide_smp.sh / candide_mpi.sh run via apptainer + runtime
+ image, no conda; SMP verified end-to-end on c03 (0 errors), MPI hybrid pattern
+ written but needs a real allocation to verify. canfar + ccin2p3 left untouched.
+ Scope findings that held up: `stile` already vestigial; `random_cat` correctly
+ KEPT (general LSS random generator, not rho-stats). LESSON: check what prior
+ merged PRs already did before acting on a cleanup constitution — verify the
+ premise against current `develop`, not the constitution's snapshot.
shuttle:
enabled: true
kind: oneshot
@@ -34,9 +33,10 @@ shuttle:
project_dir: /automnt/n17data/cdaley/unions/shapepipe
agent: claude-opus
session:
- id: f1758ecc-bf5f-452c-9f92-6393adebe65e
+ id: 7c02b521-17bd-44dc-ad10-709520d6d2bf
agent: claude-opus
- dispatched_at: 2026-05-31T10:51:28.745315935Z
+ dispatched_at: 2026-05-30T20:29:24.913621662Z
+tempered: true
---
## Desired State
@@ -137,7 +137,30 @@ green on each, neither merged, canfar untouched-and-noted.
don't touch); the broader test-suite work (separate, done); any
scientific-algorithm change.
-## Open Questions
-
-- Is `random_cat` truly rho-stats-only, or does any non-rho config/use depend on
- it? Confirm before deleting it (vs. just `mccd_plots_runner`).
+## Resolution
+
+**`random_cat` is NOT rho-stats-only — kept.** Empirically: `random_cat_package`
+generates a random catalogue of points within the survey mask (healpix output,
+`config_Rc.ini`, authored by Martin). That is a general clustering / LSS tool
+(Landy-Szalay-style randoms), independent of PSF systematics. Rho statistics are
+auto/cross-correlations of PSF-ellipticity residuals at *star* positions and need
+no random catalogue. So the conditional in Deliverable 1 ("remove after confirming
+random_cat is rho-only") failed its precondition → random_cat stays. Removing it
+would overreach what Martin called obsolete; flagged in PR #736 for a follow-up if
+he does want it gone.
+
+**`stile` was already vestigial.** Zero references in `src/`, `example/`, `docs/`,
+`pyproject.toml`, `uv.lock` — never a declared dep, never imported. Nothing to
+remove; the absence is noted in PR #736 so it isn't read as an oversight.
+
+**The "rho-stats path" was really PSF-diagnostic plotting.** `mccd_plot_utilities`
+imports only matplotlib/mccd/numpy/astropy — it computes mean shapes and ellipticity
+histograms, no `treecorr`/`stile` correlation functions. The docstring's "rho
+statistics plot" was aspirational. Maps cleanly to Martin's "rho stats within
+shapepipe" = the PSF-leakage diagnostics now owned by `shear_psf_leakage`.
+
+**MPI verification gap.** `candide_smp.sh` is verified end-to-end through the
+container on c03. `candide_mpi.sh` uses the documented hybrid Apptainer pattern
+(host `mpiexec` launches container ranks) but can't be verified on a login node —
+`mpiexec` inside the container hangs without a PMIx allocation. ABI note for the
+reviewer: image ships OpenMPI 4.1.4, candide modules are now OpenMPI 5.0.x.
diff --git a/.felt/shapepipe/docs-cluster-tree/docs-cluster-tree.md b/.felt/shapepipe/docs-cluster-tree/docs-cluster-tree.md
index c387c0980..bf05714e0 100644
--- a/.felt/shapepipe/docs-cluster-tree/docs-cluster-tree.md
+++ b/.felt/shapepipe/docs-cluster-tree/docs-cluster-tree.md
@@ -1,4 +1,5 @@
---
+id: 01KTCHWZY2NPWM4RQ7VAZK7WFE
name: Machine-specific cluster docs tree + freshness pass
tags:
- shapepipe
diff --git a/.felt/shapepipe/docs-versioning/docs-versioning.md b/.felt/shapepipe/docs-versioning/docs-versioning.md
index 7246d684a..18909ea5b 100644
--- a/.felt/shapepipe/docs-versioning/docs-versioning.md
+++ b/.felt/shapepipe/docs-versioning/docs-versioning.md
@@ -1,11 +1,14 @@
---
+id: 01KTCHWZYMXBPT7S0NXJ7AT2CD
name: Versioned docs site with a version switcher
+status: closed
tags:
- shapepipe
- docs
- ci
created-at: 2026-05-31T20:39:06.459699229+02:00
-outcome: 'PR #738 (off develop): docs site now versioned — stable@root (non-destructive to existing URLs), develop@/latest/, tags@/vX.Y.Z/, switcher.json drives the dropdown. Fixes that the site published only from master (~78 commits behind) so container.md/canfar.md were never published. CI green incl. real dev-image build + PR-preview artifact.'
+closed-at: 2026-06-10T17:44:55.028992495+02:00
+outcome: 'Shipped: PR #738 merged 2026-06-10 (independent review pass, sfarrens offered post-hoc comments). master→root, develop→/latest/, tags→//, per-ref builds, switcher.json at root. Stable root refreshes on next master push.'
---
## Why
diff --git a/.felt/shapepipe/ngmix-size-columns/ngmix-size-columns.md b/.felt/shapepipe/ngmix-size-columns/ngmix-size-columns.md
new file mode 100644
index 000000000..3f5fd7c22
--- /dev/null
+++ b/.felt/shapepipe/ngmix-size-columns/ngmix-size-columns.md
@@ -0,0 +1,76 @@
+---
+id: 01KTCQPE4M6S1E4V9WMPCAFCKT
+name: 'ngmix size columns: honest r50 + cs_util converter web'
+status: closed
+tags:
+ - constitution
+ - shapepipe
+ - ngmix
+created-at: 2026-06-05T22:30:50.004535516+02:00
+outcome: 'Delivered end-to-end and FULLY MERGED 2026-06-11: cs_util#65-67 (v0.2.1 on PyPI with cs_util.size), shapepipe#743 (honest r50, merged into ngmix_v2.0), shear_psf_leakage#23 (cs-util cap relaxation), sp_validation#198 (T_to_fwhm fix + galaxy smoke test) — the whole dependency chain closed in one night. The T_to_fwhm fix moves α(size) leakage results; regenerate where it mattered. Deferred (stated in #743): wiring honest r50/r50psf into make_cat so paper-convention r_h reaches the science catalogue.'
+shuttle:
+ kind: oneshot
+ host: candide
+ agent: claude-fable
+---
+
+# ngmix size columns: honest r50 + cs_util converter web
+
+Spun out of the [[review-ngmix-v2-pr740]] review. The full investigation — code ground-truth table,
+the UNIONS-paper convention, the sp_validation consumption map, and the bonus downstream bug — lives in
+**`.felt/review-ngmix-v2-pr740/size-report.md`**. Read it first; it is the detailed spec.
+
+The v2.0 ngmix rewrite added five `r50*` columns, **none of which is a half-light radius**: galaxy
+`r50` = `pars[4]` = `T` = 2σ² (an *area*, a bit-for-bit copy of the `T` column); PSF `r50psf` =
+`√(T/2)` = σ (a length, but off by the `1.1774` factor). So the same name root means an area on the
+galaxy side and a length on the PSF side. The official catalogue paper (**UNIONS-3500 WL I**,
+arXiv:2605.13549) reports the half-light radius `r_h` (= r50) as the **primary** size for both galaxy
+and PSF, with `T` derived. Cail approved fixing this at the source.
+
+## Desired State
+
+**1. ngmix writer emits honest, self-consistent size columns** (`ngmix.py`, on `ngmix_v2.0`):
+- Galaxy `r50 = 1.1774·√(T/2)` (= `√(ln2·T)`) with propagated error `r50_err = (1.1774/(2√2))·T_err/√T`
+ — replacing the current raw `pars[4]`/`pars_err[4]`.
+- PSF `r50psf`/`r50_psfo_ngmix`/`r50_err_psfo_ngmix` = existing σ values × `1.17741` (true half-light
+ radii, not σ).
+- Keep all `T*` columns as the (correctly-named) DES areas. After this, every "r50" column is a length
+ meaning the same thing on both sides, matching the paper.
+- **De-duplicate:** `T_psfo_ngmix` ≡ `Tpsf` and `r50_psfo_ngmix` ≡ `r50psf`. Default = retire the
+ redundant names; fall back to documenting them as aliases if anything outside `sp_validation/src`
+ (notebooks, older catalogues) depends on the `_psfo_ngmix` names. Fix the `T_err_psfo_ngmix`-with-no-
+ `Tpsf_err` asymmetry while here.
+
+**2. `cs_util` holds the conversion web** — the single source of truth, since `sp_validation/galaxy.py`
+already does `from cs_util import cfis` and cs_util is the shared producer/consumer layer:
+- `T_to_r50(T) = 1.1774·√(T/2)`, `r50_to_T(r50) = 2·(r50/1.1774)²`, `sigma_to_fwhm(s) = 2.355·s`,
+ and the **corrected** `T_to_fwhm(T) = 2.355·√(T/2)`. With tests.
+
+**3. Fix the downstream PSF-leakage bug in sp_validation.** `galaxy.py:T_to_fwhm` is `T/1.17741*2.355`
+— dimensionally wrong (treats the area `T` as σ; `2.355/1.17741 = 2.000`). It builds the `fwhm_PSF`
+regressor for the scale-dependent PSF-leakage fit (`run_object.py:510`), so it distorts the PSF-size
+axis and biases `α(size)`. Switch it to `cs_util.sigma_to_fwhm` on the corrected PSF size (or import
+the fixed `cs_util.T_to_fwhm`) and retire the local function. **This is the change that actually moves
+a science number** — flag it prominently for Cail (it touches the pure_eb / leakage analysis).
+
+**Scoping:** the ngmix-writer change (1) and the sp_validation fix (3) are independent — sp_validation
+reads none of the `r50*` columns today, so fixing ngmix won't break it. Land cs_util (2) before
+sp_validation (3) switches to importing it. **Merge/PR-creation stay Cail's gesture**; surface the open
+calls (r50-as-primary convention, drop-vs-alias, catalogue version bump, whether in-flight leakage
+results need regenerating) in this fiber's report for Cail's eventual reply — **do not post to Martin.**
+
+## Context
+
+- **Detailed spec:** `.felt/review-ngmix-v2-pr740/size-report.md` (ground-truth table, paper citations,
+ consumption map, the 6 open questions).
+- **Repos (all on candide):** shapepipe `/automnt/n17data/cdaley/unions/shapepipe` (branch `ngmix_v2.0`);
+ sp_validation `/automnt/n17data/cdaley/unions/pure_eb/code/sp_validation/src/sp_validation/`
+ (`galaxy.py`, `calibration.py`, `run_object.py`, `notebooks/extract_info.py`); cs_util (shared —
+ locate the checkout; `sp_validation/galaxy.py` imports it). Cross-repo, so likely cross-PR.
+- **Coordinate with [[ngmix-weights-ivar]]:** both edit `ngmix.py` but different functions (size-column
+ writer vs `prepare_ngmix_weights`) — low conflict; each on its own branch off `ngmix_v2.0` (after the
+ review-cleanup commit `bd60dc8e` lands).
+- **Convention:** UNIONS-3500 WL I (arXiv:2605.13549) → `r_h`/r50 primary, `T = 2σ²` derived (Eq. 1).
+ Arithmetic verified: `√(2·ln2) = 1.17741`.
+- **Agent:** Cail's call — the cs_util/sp_validation API + naming is taste-y (leans Claude); the ngmix
+ column edits are mechanical.
diff --git a/.felt/shapepipe/ngmix-size-columns/report.html b/.felt/shapepipe/ngmix-size-columns/report.html
new file mode 100644
index 000000000..f1004ac28
--- /dev/null
+++ b/.felt/shapepipe/ngmix-size-columns/report.html
@@ -0,0 +1,207 @@
+
+
+
+
+
+Report — ngmix-size-columns
+
+
+
+
+
+
+
+
+
Honest r50 columns, a shared size web, and the leakage-axis fix.
+
+
+
+
+
§ Current state
+
All three branches landed and fresh-eyes reviewed; PR creation is yours.
+
+
Every piece of the constitution's desired state is implemented, tested, and pushed.
+ A second session independently reviewed all three diffs: re-derived both error propagations
+ (galaxy r50·T_err/2T and PSF √(2ln2)·T_err/(4σ) are the same formula —
+ consistent), confirmed zero remaining consumers of the retired _psfo_ngmix size
+ columns (only make_cat, repointed), confirmed the dropped pixel_size
+ kwarg on sigma_to_fwhm has no callers, re-ran the tests (shapepipe size tests pass
+ in the dev container; cs_util 24/24; T_to_fwhm(2.0) → 2.35482 reproduced in the
+ sp_validation container against the branch), and verified all three push SHAs. No code defects
+ found. Three branches across three repos, in dependency order:
New cs_util/size.py: the conversion web (T ↔ σ ↔ r50 ↔ FWHM) with exact constants
+ and 6 tests (closed forms, round trips, FWHM = 2·r50). Full suite 24/24 green.
+ On my fork — you have no push rights on CosmoStat/cs_util.
Retires the dimensionally wrong local T_to_fwhm; galaxy.py re-exports
+ T_to_fwhm/sigma_to_fwhm from cs_util.size, so all
+ star-import consumers (extract_info.py, the notebooks' size_to_fwhm
+ dispatch) keep working unchanged. Verified in the sp_validation container:
+ T_to_fwhm(2.0) → 2.35482 (old code gave 4.0003).
+
+
+
+
+ The science-number change: the sp_validation fix corrects the
+ fwhm_PSF regressor feeding the scale-dependent PSF-leakage fit
+ (run_object.py:510, size_PSF_col). Any α(PSF-size) trend or per-bin
+ leakage coefficient computed with the old function is biased by a monotonic-but-nonlinear
+ distortion of the size axis. Spatially-constant leakage is unaffected. In-flight pure_eb /
+ leakage results produced with the old code should be regenerated after this merges.
+
+
+
Merge order: cs_util first (sp_validation's import fails loudly without it —
+ intentional fail-fast), then sp_validation; the shapepipe branch is independent and can merge any
+ time. No PRs opened — that's your gesture. cs_util needs a release (or a git-ref pin in
+ sp_validation) before the sp_validation branch is safe to merge.
+
+
+ CI will not police the merge order. sp_validation's CI is already green on the
+ pushed branch (run 27248208943) — but vacuously: the image installs released
+ cs-util 0.1.9 (no size module), and the test suite never imports
+ galaxy.py (tests cover cosmo_val/cosmology/survey/plots
+ only). Merging sp_validation before the cs_util release would pass CI and then raise
+ ImportError at runtime, in extract_info.py / the notebooks. The
+ ordering is enforced by discipline, not automation.
+
+
+
+
+
+
+
§ Findings
+
What surfaced beyond the spec.
+
+
The old PSF r50 error was also a factor-2 over-estimate
+
Beyond the missing 1.1774 factor, r50_err_PSFo = T_err/(2σ) was wrong as an error
+ on σ: for σ = √(T/2), dσ/dT = 1/(4σ), not 1/(2σ). (The size-report table had marked this
+ "correct error-prop of σ" — it wasn't.) Both verified by finite differences; the new code carries
+ r50_err = √(2ln2)·T_err/(4σ).
+
+
Remaining duplicates, deliberately left
+
g1/g2_psfo_ngmix ≡ g1/g2_psf in the ngmix output (both read from
+ g_PSFo) — same disease, different organ; out of this fiber's size scope. Downstream,
+ make_cat writes both NGMIX_Tpsf_* and NGMIX_T_PSFo_* from the
+ now-identical source. Candidates for a follow-up dedupe pass.
+
+
shapepipe keeps a local constant, for now
+
ngmix.py defines SIGMA_TO_R50 = √(2 ln 2) locally rather than importing
+ cs_util.size — shapepipe pins released cs_util (>=0.1.9), and making the
+ branch depend on an unreleased fork would block its CI. Once cs_util ships the size module, a
+ one-line follow-up can switch the import.
+
+
Exact constants over truncated literals
+
cs_util uses √(2ln2) / 2√(2ln2) exactly; sp_validation's old 2.355 and
+ 1.17741 literals are retired with the functions. FWHM values shift at the 4th decimal
+ (2.355 → 2.35482) — cosmetically visible if anyone diffs old catalogs against new.
+
+
Pre-existing test failure, not mine
+
test_metacal_is_reproducible_with_fixed_seed fails in the local
+ shapepipe-dev container — reproduced by the review session on clean
+ ngmix_v2.0 with the identical error (ngmix.fitting has no
+ Fitter: the local container carries an older ngmix than the branch expects). The
+ other tests, including the two new size tests, pass against the branch code. Note shapepipe's
+ deploy-image.yml triggers only on PRs/pushes targeting
+ develop/main/master — so a PR of this branch into
+ ngmix_v2.0 won't run CI by itself; the suite gets arbitrated when the change rides
+ PR #740's chain toward develop.
+
+
+
+
+
+
§ Open questions
+
Your calls, none blocking.
+
+
+
Drop-vs-alias — resolved in favor of drop, verify you agree. I searched
+ sp_validation (src, notebooks, .ipynb) and all of shapepipe: the only consumer of the size
+ _psfo_ngmix names was make_cat.py, now repointed. Zero external hits, so
+ I dropped rather than aliased, per the constitution's default.
+
Catalogue version bump: on-disk r50* semantics change between
+ v2.0 pre-/post-fix files. Does this warrant a format-version note/changelog entry, and where?
+
Leakage regeneration: were any current/in-flight scale-dependent leakage
+ results produced with the buggy T_to_fwhm? How much does α(size) move? (Worth a
+ before/after on one catalog when convenient.)
+
cs_util path to merge: branch lives on the cailmdaley/cs_util fork.
+ PR to CosmoStat/cs_util + release (0.3?), then bump pins in sp_validation (and optionally switch
+ shapepipe's local constant to the import).
+
Galaxy r50_err form: implemented as first-order propagation from
+ T_err (r50_err = r50·T_err/2T), per spec. If you'd rather use the full
+ pars covariance, that's a writer-side swap.
+
+
+
+
+
+
diff --git a/.felt/shapepipe/ngmix-weights-ivar/ngmix-weights-ivar.md b/.felt/shapepipe/ngmix-weights-ivar/ngmix-weights-ivar.md
new file mode 100644
index 000000000..bd72a223b
--- /dev/null
+++ b/.felt/shapepipe/ngmix-weights-ivar/ngmix-weights-ivar.md
@@ -0,0 +1,124 @@
+---
+id: 01KTCQPE3JGEYN7NQS8HW1AT6B
+name: 'ngmix weight map: fix v2.0 regressions + inverse-variance (#604)'
+status: closed
+tags:
+ - constitution
+ - shapepipe
+ - ngmix
+created-at: 2026-06-05T22:30:49.970813955+02:00
+closed-at: 2026-06-11T02:37:33.556855912+02:00
+outcome: 'Realized, fresh-eyes reviewed (no blockers), and delivered to PR #741 (CosmoStat/shapepipe, ngmix_v2.0): per-pixel 1/RMS² inverse-variance weights gated by weight/flag/valid-RMS masks with scalar sigma_mad fallback — feat f466c987, mask binarization 6ddea5b2, NaN-edge guard 4aa3b2a1, integration test pinning spatially-varying RMS through the full observation chain 158986a4, config wiring 7a44abda + all-or-nothing semantics documented 0bc6016e. Martin''s #604 weight-map thread answered with the commits. BKG_RMS_VIGNET_PATH is all-or-nothing per tile (missing file → FileNotFoundError; fallback only when option absent). Container gotcha: test from a scratch-wf worktree needs --bind /automnt/n17data and /automnt-prefixed paths (plain /n17data symlink invisible inside the sandbox); dev sandbox ships ngmix 1.3.6 so test_metacal_is_reproducible_with_fixed_seed only passes in CI''s v2 image.'
+shuttle:
+ enabled: true
+ kind: oneshot
+ host: candide
+ agent: codex
+---
+
+# ngmix weight map: fix v2.0 regressions + inverse-variance (#604)
+
+Spun out of the [[review-ngmix-v2-pr740]] review. The full investigation, with file:line
+evidence, the empirical confirmation, the concrete change plan, effort, and risks, lives in
+**`.felt/review-ngmix-v2-pr740/weights-report.md`** — read it first; it is the detailed spec.
+
+The v2.0 ngmix rewrite (PR #741, branch `ngmix_v2.0`) introduced two coupled regressions in
+`prepare_ngmix_weights` (`ngmix.py:871`), **empirically confirmed** this session (fed `make_data`'s
+truth inverse-variance `1/noise²` → recovered `8.8e11`, off by ≈`1/noise²`):
+
+- **R1** — noise estimate regressed from v1's object-free windowed `get_noise` (still present at
+ `:826`, now **dead**) to flux-contaminated whole-stamp `sigma_mad(gal)`. Size/flux-dependent bias
+ on the likelihood weighting → biased `g_err`/`T_err`/`s2n`; the fingerprint of a multiplicative
+ shear bias (cf. old-path `m≈-2.8e-2`). Dominant today.
+- **R2** — lost the v1 binarization `weight_map[weight_map != 0] = 1`, so the raw weight is multiplied
+ by `1/sig_noise²`: a per-epoch `1/Fscale²` error today, and a latent double-count the moment a real
+ inverse-variance map is fed in.
+
+Neither v1 nor v2 is *correct* in absolute terms — both approximate per-pixel inverse-variance with a
+per-stamp scalar on a 0/1 mask. #604 ("Weight Handling", open since 2023) is the real fix: feed ngmix
+genuine inverse-variance maps.
+
+## Desired State
+
+**Two PRs, in this order:**
+
+1. **Minimal regression fix on `ngmix_v2.0` (#741).** Restore the v1 binarization (R2) and add a
+ **red→green unit test** on `prepare_ngmix_weights` using `make_data`'s truth ivar (the test sketch
+ is in the report; it needs no ngmix fit, so it's fast and deterministic). This is a clean, small
+ change that removes the double-count hazard.
+ - **Open call (default = defer R1):** restoring the windowed `get_noise` for R1 reintroduces the
+ gal-guess machinery this PR may have deliberately stripped. Default to fixing R2 + the test now,
+ and let the proper inverse-variance path (PR 2) retire whole-stamp `sigma_mad` wholesale — which
+ resolves R1 without reviving the old machinery. If R1 proves to matter for the interim, surface
+ it; **do not block on it, and do not post to Martin** — these science calls go into this fiber's
+ report for Cail to fold into the eventual #741 reply.
+
+2. **SExtractor `BACKGROUND_RMS` baseline — separate PR, closes #604.** The in-house path (THELI
+ weights are an external product and slot in later as another `ME_IMAGE_PATTERN` source). Follow the
+ report's change plan: `CHECKIMAGE = BACKGROUND, BACKGROUND_RMS` (config), cut the RMS stamp via
+ vignetmaker's existing list-driven ME loop (config), wire one **optional** RMS input through
+ `ngmix_runner`/`Vignet`, and rewrite `prepare_ngmix_weights` to build per-pixel `1/RMS²` gated on
+ the mask, with `sigma_mad`/`get_noise` retained as the **fallback** when no RMS map is present.
+ Validate with a before/after on the example tile (shear/s2n/flags). Effort ~4–8h.
+
+**Quality bar:** every claim verified by running, not inferring (dev image + ngmix 2.4.0 are ready —
+see Context). The red→green test must actually go red on current code and green after. **Merge and
+PR-creation stay Cail's gesture** — prepare branches, commits, and a PR-ready description; surface the
+science decisions in this fiber's `report.html`/outcome rather than pushing to Martin.
+
+## Context
+
+- **Detailed spec:** `.felt/review-ngmix-v2-pr740/weights-report.md` (change plan, file list, risks,
+ the 6 science questions for Martin). This constitution is the pointer; the report is the substance.
+- **R2 checkpoint:** `/tmp/pr740-wt` commit `953a52a3` restores
+ `weight_map[weight_map != 0] = 1` in `prepare_ngmix_weights` and adds
+ `test_weight_map_recovers_injected_inverse_variance`. Observed red before the fix:
+ recovered `8.80916e11` vs truth `1e6`; observed green after the fix: recovered
+ `8.80916e5` vs truth `1e6` with `rtol=0.15`. The tolerance is intentionally loose
+ enough to leave R1 visible: whole-stamp `sigma_mad(gal)` still overestimates the
+ compact simulated stamp's noise by about 6%, producing a ~12% inverse-variance
+ shortfall. This commit fixes R2 only; it does not claim R1 is solved.
+- **BACKGROUND_RMS checkpoint:** `/tmp/pr740-wt` commit `03bf12b6` adds the
+ first #604 implementation slice. `ngmix_runner` now takes optional
+ `BKG_RMS_VIGNET_PATH` (with `{file_number_string}` substitution) without
+ changing the existing six-input pipeline contract. `Vignet` opens the
+ optional RMS sqlite, `prepare_postage_stamps` rescales RMS by `FSCALE`, and
+ `prepare_ngmix_weights(..., bkg_rms=...)` uses per-pixel `1/RMS²` on pixels
+ where the Megapipe weight mask, flag mask, and RMS validity mask are all good.
+ Verification observed in the dev image:
+ `test_weight_map_recovers_injected_inverse_variance`,
+ `test_background_rms_builds_per_pixel_inverse_variance`, and
+ `test_rescale_epoch_fluxes_scales_background_rms_like_image_counts` all
+ passed. `py_compile` and `git diff --check` were also clean.
+- **BACKGROUND_RMS config checkpoint:** `/tmp/pr740-wt` commit `ee87b5bd`
+ wires the active CFIS and CFIS simulation example configs into that code
+ path. The exposure SExtractor configs request `CHECKIMAGE = BACKGROUND,
+ BACKGROUND_RMS`; the multi-epoch vignetmaker configs add
+ `background_rms` beside `background`; and the ngmix templates set
+ `BKG_RMS_VIGNET_PATH` to the expected
+ `background_rms_vignet{file_number_string}.sqlite` output. Verification:
+ the three targeted RMS/weight tests pass in the dev image; all touched
+ example configs parse with `ConfigParser(interpolation=None)`; `git diff
+ --check` is clean.
+- **Optional RMS input-contract guard:** `/tmp/pr740-wt` commit `22e31e51`
+ adds `test_ngmix_accepts_optional_background_rms_vignet`, which
+ instantiates `Ngmix` with the seventh `BACKGROUND_RMS` sqlite input and
+ verifies `Vignet` opens it. Verification observed in the dev image: the
+ focused RMS/weight pytest selection passed, `py_compile` passed for touched
+ ngmix files, and `git diff --check` was clean.
+- **Current test caveat:** the targeted unit test passes in the dev image. The full
+ `src/shapepipe/tests/test_ngmix.py` file currently still fails in the existing
+ metacal smoke test because the container's `ngmix.fitting` has no `Fitter`
+ attribute; that failure is independent of the new R2 unit path.
+- **Repo:** `/automnt/n17data/cdaley/unions/shapepipe`, branch `ngmix_v2.0` (= #741 head). Base the
+ work on `ngmix_v2.0` **after** the review-cleanup commit lands (it removes dead code and is currently
+ staged at `/tmp/pr740-wt` commit `bd60dc8e`, pending Cail's push). Coordinate with
+ [[ngmix-size-columns]] — it also edits `ngmix.py`, but a different function (the size-column writer
+ vs `prepare_ngmix_weights`); low conflict, but each work-stream gets its own branch.
+- **Test environment:** dev-image sandbox at `/n17data/cdaley/containers/shapepipe-dev` with **ngmix
+ 2.4.0** installed (the published `:develop` image ships the old `aguinot` fork v1.3.6 — wrong for
+ this branch). Run via `CONTAINER_PATH=/n17data/cdaley/containers/shapepipe-dev PYTHONPATH=/src app python3.12 -m pytest …`.
+ The `--writable` sandbox needs bind mountpoints to exist (`mkdir` them); see this session's history.
+- **Issue #604:** megapipe weights are 0/1 masks (Gwyn confirmed); aguinot's BACKGROUND_RMS proposal is
+ the basis for PR 2. Cail confirmed the SExtractor-RMS baseline over THELI.
+- **Agent:** Codex (Cail's call) — gritty, well-defined cross-file implementation.
diff --git a/.felt/shapepipe/np-str0-numpy2/np-str0-numpy2.md b/.felt/shapepipe/np-str0-numpy2/np-str0-numpy2.md
new file mode 100644
index 000000000..d23de432a
--- /dev/null
+++ b/.felt/shapepipe/np-str0-numpy2/np-str0-numpy2.md
@@ -0,0 +1,40 @@
+---
+id: 01KTK3WGKFCGD2VQAQSABC3TBJ
+name: np.str0 fix (NumPy 2 aftershock)
+status: closed
+tags:
+ - shapepipe
+ - numpy
+ - bugfix
+created-at: 2026-06-08T09:59:18.639370578+02:00
+closed-at: 2026-06-10T17:13:38.507495758+02:00
+outcome: 'np.str0 removed from file_io.py on develop (verified 2026-06-10: only np.str_ remains); NumPy 2 string-column saves work. Regression guard idea (NAME column in roundtrip test) noted in body if ever needed.'
+---
+
+Fabian Hervas-Peters hit `module 'numpy' has no attribute 'str0'` running the
+new container against `src/shapepipe/pipeline/file_io.py`. Root cause: the
+container now ships **NumPy 2.4.4** (`pyproject` pins `numpy>=2.0`), and
+`np.str0` — a deprecated alias for `np.str_` — was *removed* in NumPy 2.0.
+
+The offending line was `_get_fits_col_type`'s string branch:
+`elif type(col_data[0]) in [str, np.str_, np.str0]:`. The fix is to drop the
+dead alias — `np.str_` already covers it, so nothing is lost.
+
+**Why the test suite never caught it.** The list literal `[str, np.str_, np.str0]`
+is only *evaluated* when that `elif` is actually reached, i.e. only when a column
+whose first element is a string flows through `save_as_fits`. The existing
+`test_fits_catalogue_table_roundtrips` saved only a `float64` and an `int16`
+column — both short-circuit on earlier branches and never touch the string path.
+So the `AttributeError` stayed dormant until Fabian saved a catalogue with a
+string column. The regression guard is a one-line addition: a `NAME` string
+column in the roundtrip test, which now exercises the `"A"` (FITS char) dispatch.
+
+Interestingly, `ngmix_v2.0` had *already* removed `np.str0` independently, so
+merging develop into it was a clean no-op on `file_io.py` and only carried over
+the improved test.
+
+General lesson: removed-NumPy-2 aliases (`np.str0`, `np.bool0`, `np.int0`,
+`np.object0`, `np.unicode_`, …) hide in rarely-reached branches and survive
+import — they only blow up at the call site. Worth a grep sweep when migrating a
+codebase to NumPy 2. A sweep of `src/shapepipe/` on 2026-06-08 found `str0` was
+the only remaining offender.
diff --git a/.felt/shapepipe/smoke-test-read-only/smoke-test-read-only.md b/.felt/shapepipe/smoke-test-read-only/smoke-test-read-only.md
index b9bbe8849..fca6d5ad4 100644
--- a/.felt/shapepipe/smoke-test-read-only/smoke-test-read-only.md
+++ b/.felt/shapepipe/smoke-test-read-only/smoke-test-read-only.md
@@ -1,23 +1,14 @@
---
+id: 01KTCHWZZ923H1H5DR6AS3Q5H6
name: Smoke test must work in read-only mode
+status: closed
tags:
- shapepipe
- docker
- infra
created-at: 2026-05-28T10:32:25.53742271+02:00
-outcome: |-
- `shapepipe_run -c /app/example/config.ini` fails on read-only SIF
- because the example config uses relative `OUTPUT_DIR = ./example/output`,
- which resolves under `WORKDIR=/app` — read-only in apptainer/SIF. Fix:
- add `scripts/sh/shapepipe_run_example.sh` wrapper that mktemp's a
- workdir, copies `/app/example/` into it, cd's, and execs `shapepipe_run`.
- Dockerfile's existing auto-symlink rule (`scripts/*/*.sh` →
- `/usr/local/bin/`) makes it usable as `shapepipe_run_example`
- on PATH. CI smoke step updated to call it under
- `docker run --read-only --tmpfs /tmp:rw` (emulates SIF). Drive-by:
- tightened `.gitignore` from `*shapepipe_run_*` (catches the wrapper)
- to `example/output/shapepipe_run_*`. Submitted as #731 from branch
- `chore/smoke-test-read-only`; awaiting CI.
+closed-at: 2026-06-10T17:13:38.485047103+02:00
+outcome: 'shapepipe_run_example wrapper (copy example to /tmp) landed via PR #731, merged to develop; CI smoke steps use it. Read-only SIF runs work.'
---
## The gap
diff --git a/.felt/shapepipe/test-suite/test-suite.md b/.felt/shapepipe/test-suite/test-suite.md
index 37e3d46ca..ffb9aab3a 100644
--- a/.felt/shapepipe/test-suite/test-suite.md
+++ b/.felt/shapepipe/test-suite/test-suite.md
@@ -1,4 +1,5 @@
---
+id: 01KTCHWZZJ1H3D0HYETGFVMVZH
name: 'ShapePipe test suite: tiered, in-image, property-based'
status: closed
tags:
@@ -15,6 +16,7 @@ shuttle:
host: c03
project_dir: /automnt/n17data/cdaley/unions/shapepipe
agent: codex
+tempered: true
---
## Desired State
diff --git a/.felt/shapepipe/v2-run-plan/v2-run-plan.md b/.felt/shapepipe/v2-run-plan/v2-run-plan.md
index 537c85a96..b5a96e2b8 100644
--- a/.felt/shapepipe/v2-run-plan/v2-run-plan.md
+++ b/.felt/shapepipe/v2-run-plan/v2-run-plan.md
@@ -1,4 +1,5 @@
---
+id: 01KTCHWZZKM6TYB7WS8J1TQWVY
name: v2.0 ShapePipe run — workflow wishlist
tags:
- shapepipe
diff --git a/.felt/sqlitedict-pickle-smell/sqlitedict-pickle-smell.md b/.felt/sqlitedict-pickle-smell/sqlitedict-pickle-smell.md
index f035953f5..c2fc64feb 100644
--- a/.felt/sqlitedict-pickle-smell/sqlitedict-pickle-smell.md
+++ b/.felt/sqlitedict-pickle-smell/sqlitedict-pickle-smell.md
@@ -1,4 +1,5 @@
---
+id: 01KTCHX00N4580BRMCH0FNADXV
name: sqlitedict pickle-by-default is a known smell
status: open
tags:
diff --git a/README.rst b/README.rst
index da8fe0701..e27b4028d 100644
--- a/README.rst
+++ b/README.rst
@@ -1,7 +1,7 @@
ShapePipe
=========
-|CI| |CD| |python39| |release|
+|CI| |CD| |python312| |release|
.. |CI| image:: https://github.com/CosmoStat/shapepipe/workflows/CI/badge.svg
:target: https://github.com/CosmoStat/shapepipe/actions?query=workflow%3ACI
@@ -9,14 +9,48 @@ ShapePipe
.. |CD| image:: https://github.com/CosmoStat/shapepipe/actions/workflows/pages/pages-build-deployment/badge.svg
:target: https://github.com/CosmoStat/shapepipe/actions/workflows/pages/pages-build-deployment
-.. |python39| image:: https://img.shields.io/badge/python-3.9-green.svg
- :target: https://www.python.org/‰
+.. |python312| image:: https://img.shields.io/badge/python-3.12-green.svg
+ :target: https://www.python.org/
.. |release| image:: https://img.shields.io/github/v/release/CosmoStat/shapepipe
:target: https://github.com/CosmoStat/shapepipe/releases/latest
ShapePipe is a galaxy shape measurement pipeline developed within the
-CosmoStat lab at CEA Paris-Saclay.
+CosmoStat lab at CEA Paris-Saclay. It runs the full chain from raw survey
+images to calibrated shear catalogues — object detection, PSF modelling, and
+shape measurement — and produced the first UNIONS cosmic-shear release.
-See the `documentation `_ for details
-on how to install and run ShapePipe.
+Quickstart
+----------
+
+ShapePipe ships as a container image, so you can run the bundled example
+pipeline — a single CFIS tile through the full chain — without installing
+anything:
+
+.. code-block:: bash
+
+ # Apptainer (HPC, no root needed):
+ apptainer exec docker://ghcr.io/cosmostat/shapepipe:develop-runtime shapepipe_run_example
+
+ # ...or Docker:
+ docker run --rm ghcr.io/cosmostat/shapepipe:develop-runtime shapepipe_run_example
+
+The image is published on every push to the `GitHub Container Registry
+`_:
+``:develop`` tracks the integration branch, release tags (e.g. ``:v1.1.0``) a
+stable cut, and the ``-runtime`` suffix selects the slim batch image over the
+full interactive one.
+
+Documentation
+-------------
+
+Full documentation lives at https://cosmostat.github.io/shapepipe. Good places
+to start:
+
+- `Installation `_ — getting ShapePipe onto your machine or cluster.
+- `Basic execution `_ and `configuration `_ — running ``shapepipe_run`` and writing pipeline configs.
+- `Container workflow `_ — the two image targets and the ``pyproject.toml`` / ``uv.lock`` / ``Dockerfile`` layers.
+- `Running on a cluster `_ — pulling the image and submitting jobs, with worked candide (SLURM) and CANFAR examples.
+
+If you use ShapePipe in academic work, please cite Guinot et al. (2022) and
+Farrens et al. (2022).
diff --git a/docs/source/basic_execution.md b/docs/source/basic_execution.md
index 9e7ca63b4..d9aa6de07 100644
--- a/docs/source/basic_execution.md
+++ b/docs/source/basic_execution.md
@@ -7,7 +7,7 @@ option:
```{seealso}
:class: margin
-The `shapepipe` environment will need to be built and activated in order to run this script (see [Installation](installation.md)).
+ShapePipe runs inside its container — there is no environment to activate. See [Installation](installation.md).
```
```bash
shapepipe_run --help
@@ -37,11 +37,33 @@ shapepipe_run -c
## Running the Pipeline with MPI
ShapePipe can also use [mpi4py](https://mpi4py.readthedocs.io/en/stable/)
-for managing parallel processes on clusters with multiple nodes.
-The `shapepipe_run` script can be run with MPI as follows
+to spread work across multiple nodes of a cluster. Set `MODE = mpi` in the
+`[EXECUTION]` section of the config and launch with an MPI runner:
```bash
-mpiexec -n shapepipe_run
+mpiexec -n shapepipe_run -c
```
-where `` is the number of cores to allocate to the run.
+where `` is the number of MPI processes to start.
+
+### Through the container (the supported way on a cluster)
+
+On a cluster you run ShapePipe from the published image as a standard Apptainer
+*hybrid* MPI job: the **host** `mpirun`/`mpiexec` launches one container rank per
+slot, and the OpenMPI bundled in the image wires the ranks together.
+
+```bash
+# one-time: pull the runtime image
+apptainer pull shapepipe.sif docker://ghcr.io/cosmostat/shapepipe:develop-runtime
+
+# load a host MPI in the same family as the image's OpenMPI (5.0.x), then launch
+module load openmpi
+mpirun -n \
+ apptainer exec --bind "$PWD:$PWD" shapepipe.sif \
+ shapepipe_run -c
+```
+
+The image ships **OpenMPI 5.0.x** so that its PMIx matches modern cluster
+launchers. The host and container MPI must be compatible: if you see *N* copies
+of `rank 0 of 1` instead of one *N*-rank job, load a host OpenMPI in the 5.0.x
+family. See `example/pbs/candide_mpi.sh` for a complete SLURM batch script.
diff --git a/docs/source/canfar.md b/docs/source/canfar.md
deleted file mode 100644
index 1f490c504..000000000
--- a/docs/source/canfar.md
+++ /dev/null
@@ -1,94 +0,0 @@
-# Running `shapepipe` on the canfar science portal
-
-## Introduction
-
-## Steps from testing to parallel running
-
-Before starting a batch remote session job on a large number of images (step 5.),
-it is recommended to perform some or all of the testing steps (1. - 4.).
-
-
-1. Run the basic `shapepipe` runner script to test (one or several) modules in question, specified by a given config file, on one image.
- This step has to be run in the image run directory. The command is
- ```bash
- shapepipe_run -c config.ini
- ```
-
-2. Run the job script to test the job management, on one image.
- This step has to be run in the image run directory. The command is
- ```bash
- job_sp_canfar -j JOB [OPTIONS]
- ```
-
-3. Run the pipeline script to test the processing step(s), on one image.
- This step has to be run in the patch base directory.
-
- 1. First, run in dry mode:
- ```bash
- init_run_exclusive_canfar.sh -j JOB -e ID -p [psfex|mccd] -k [tile|exp] -n
- ```
- 2. Next, perform a real run with
- ```bash
- init_run_exclusive_canfar.sh -j JOB -e ID -p [psfex|mccd] -k [tile|exp] -n
- ```
-
-4. Run remote session script to test job submission using docker images, on one image.
- This step has to be run in the patch base directory.
- 1. First, run in dry mode=2, to display curl command, with
- ```bash
- curl_canfar_local.sh -j JOB -e ID -p [psfex|mccd] -k [tile|exp] -n 2
- ```
-
- 2. Next, run in dry mode=1, to use curl command without processing:
- ```bash
- curl_canfar_local.sh -j JOB -e ID -p [psfex|mccd] -k [tile|exp] -n 1
- ```
- 3. Then, perform a real run, to use curl with processing:
- ```bash
- curl_canfar_local.sh -j JOB -e ID -p [psfex|mccd] -k [tile|exp]
- ```
-
-5. Full run: Call remote session script and docker image with collection of images
- ```bash
- curl_canfar_local.sh -j JOB -f path_IDs -p [psfex|mccd] -k [tile|exp]
- ```
- with `path_IDs` being a text file with one image ID per line.
-
-## Monitoring
-
-
-### Status and output of submitted job
-
-Monitoring of the currently active remote session can be performed using the session IDs `session_IDs.txt` written by the
-remote session script `curl_canfar_local.sh`. In the patch main directory, run
-```bash
-curl_canfar_monitor.sh events
-```
-to display the remotely started docker image status, and
-```bash
-curl_canfar_monitor.sh logs
-```
-to print `stdout` of the remotely run pipeline script.
-
-### Number of submitted running jobs
-
-The script
-```bash
-stats_headless_canfar.py
-```
-returns the number of actively running headless jobs.
-
-
-## Post-hoc summary
-
-In the patch main directory, run
-```bash
-summary_run PATCH
-```
-to print a summary with missing image IDs per job and module.
-
-## Deleting jobs
-
-```bash
- for id in `cat session_IDs.txt`; do echo $id; curl -X DELETE -E /arc/home/kilbinger/.ssl/cadcproxy.pem https://ws-uv.canfar.net/skaha/v0/session/$id; done
- ```
diff --git a/docs/source/clusters.md b/docs/source/clusters.md
new file mode 100644
index 000000000..41d32a0bf
--- /dev/null
+++ b/docs/source/clusters.md
@@ -0,0 +1,101 @@
+# Running on a Cluster
+
+ShapePipe runs the same way on every cluster: **through the container**. You
+pull the slim `runtime` image once, bind-mount your clone, and run
+`shapepipe_run` (or the CANFAR submission tooling) inside it — there is no
+environment to install or activate on the host. This page covers the shared
+pattern, then the specifics for each supported machine.
+
+For what is *inside* the image and how it is built, see
+[Container Workflow](container.md).
+
+## The pattern
+
+Three things hold on any cluster:
+
+- **The container is the unit of execution.** Pull the runtime image to a SIF
+ (`apptainer pull … docker://ghcr.io/cosmostat/shapepipe:-runtime`) and run
+ the pipeline inside it. Nothing else is installed on the host.
+- **Bind-mount your clone at the same path.** The config files reference their
+ location for the input and output directories; bind-mounting the host clone at
+ the *identical* path inside the container makes those paths resolve the same in
+ and out of the container.
+- **Keep SIFs and Apptainer's scratch off a quota-limited `$HOME`.** A pull under
+ a tight home quota fails with `disk quota exceeded`; point `APPTAINER_CACHEDIR`
+ (and the SIF itself) at a roomy data partition.
+
+MPI jobs add one constraint: the container's MPI must speak the same PMIx wire
+protocol as the host launcher (see the candide section below).
+
+## candide (SLURM)
+
+candide uses **SLURM** (`sbatch`; the old `qsub`/PBS commands are gone). The repo
+ships ready job scripts in `example/pbs/` — `candide_smp.sh` (single node,
+parallelised with joblib) and `candide_mpi.sh` (multi-node, hybrid MPI). To run
+the bundled single-tile example end to end:
+
+```bash
+# Keep the SIF and Apptainer's scratch off the quota-limited $HOME.
+export DATA=/n17data/$USER # adjust to your data partition
+export APPTAINER_CACHEDIR=$DATA/.apptainer
+
+# Pull the runtime image (~850 MB).
+apptainer pull "$DATA/shapepipe-runtime.sif" \
+ docker://ghcr.io/cosmostat/shapepipe:develop-runtime
+
+# Submit. SPDIR is your clone, bind-mounted at the same path inside the
+# container; SP_IMAGE is the SIF. The same script serves the example and a real
+# run — point the config inside it at your own pipeline.
+SP_IMAGE="$DATA/shapepipe-runtime.sif" SPDIR="/path/to/shapepipe" \
+ sbatch example/pbs/candide_smp.sh
+```
+
+Partitions are `comp` (2-day limit) and `compl` (5-day). Both job scripts read
+`SP_IMAGE` (the SIF) and `SPDIR` (the clone) from the environment, so the only
+thing that changes between the example and a real run is the config the script
+points at.
+
+For multi-node MPI, use `candide_mpi.sh`. It `module load`s a host OpenMPI in the
+5.0.x series to match the PMIx that the image's OpenMPI speaks; a mismatched host
+MPI (e.g. OpenMPI 4) silently degrades the job to *N* independent "rank 0 of 1"
+processes instead of one *N*-rank job. The script's header comments carry the
+detail.
+
+Adapting these scripts to another SLURM cluster is mostly the `#SBATCH`
+directives and the `module load` line — the `apptainer exec` invocation carries
+over unchanged.
+
+## CANFAR
+
+CANFAR submission does not go through a batch scheduler. Instead you submit
+container jobs to CANFAR's headless system with the `canfar_submit_job` console
+script (backed by the `canfar` library), and watch them with `canfar_monitor` /
+`canfar_monitor_log`. Pipeline steps are **bit-coded** through `-j` (the same
+scheme as `scripts/sh/job_sp_canfar.bash`), the PSF model is chosen with
+`-p psfex|mccd`, and `-V` selects the image version:
+
+```bash
+# Submit pipeline step(s) for the configured tiles (bit-coded -j).
+canfar_submit_job -j 1 -p psfex -V 1.1
+
+# Monitor sessions/jobs and stream logs.
+canfar_monitor
+canfar_monitor_log
+```
+
+The full production run — input preparation, the per-step `-j` table, and
+post-processing — is documented in the
+[CANFAR production walkthrough](pipeline_canfar.md).
+
+```{note}
+The CANFAR production submission scripts (`scripts/sh/job_sp_canfar*.bash`) still
+run under the pre-container environment and are slated for the same
+container-first cleanup the candide scripts received. Treat the walkthrough as
+the current-but-evolving production procedure.
+```
+
+## ccin2p3
+
+ccin2p3 is not yet containerised. The `example/pbs/cc_{smp,mpi}.sh` scripts target
+the pre-container environment; a container-first path mirroring candide is future
+work.
diff --git a/docs/source/configuration.md b/docs/source/configuration.md
index 3336123c1..772a9558f 100644
--- a/docs/source/configuration.md
+++ b/docs/source/configuration.md
@@ -1,6 +1,6 @@
# Configuration
-The pipeline requires a configuration file (by default called `conifg.ini`)
+The pipeline requires a configuration file (by default called `config.ini`)
in order to be run. Example configuration files are provided in the
[example](https://github.com/CosmoStat/shapepipe/tree/develop/example)
directory.
@@ -114,7 +114,7 @@ INPUT_DIR = last:psfex_runner
OUTPUT_DIR = /home/username/my_output_dir
FILE_PATTERN = psf
FILE_EXT = fits
-NUMBERING_LIST = -001, -002
+NUMBER_LIST = -001, -002
```
ShapePipe will look for the specific files `psf-001.fits` and `psf-002.fits`
diff --git a/docs/source/container.md b/docs/source/container.md
index 2511a7f3e..174cde9dc 100644
--- a/docs/source/container.md
+++ b/docs/source/container.md
@@ -2,7 +2,8 @@
ShapePipe ships as a container image. This page covers the two image
targets and the three configuration files (`pyproject.toml`, `uv.lock`,
-`Dockerfile`) that determine what's inside.
+`Dockerfile`) that determine what's inside. For running the image on a batch
+cluster (candide, CANFAR), see [Running on a cluster](clusters.md).
## Two image targets
diff --git a/docs/source/contributing.md b/docs/source/contributing.md
index 99cf85a09..814bf4852 100644
--- a/docs/source/contributing.md
+++ b/docs/source/contributing.md
@@ -2,7 +2,7 @@
ShapePipe is a fully open-source project and we welcome contributions.
-Pleas read our
+Please read our
[contribution guidelines](https://github.com/CosmoStat/shapepipe/blob/develop/CONTRIBUTING.md).
for details on how to contribute to the development of this package.
diff --git a/docs/source/dependencies.md b/docs/source/dependencies.md
index 64c6db0ea..9378bfe00 100644
--- a/docs/source/dependencies.md
+++ b/docs/source/dependencies.md
@@ -1,32 +1,65 @@
# Dependencies
-All third-party software packages required by ShapePipe are installed
-automatically (see [Installation](installation.md)). Note that all packages
-are pinned to a given version for each release of ShapePipe to maximise the
-reproducibility of the results. Below we list the main packages used.
+ShapePipe's dependencies are installed automatically — the recommended path is
+the [container](container.md), which bundles all of them (see also
+[Installation](installation.md)). `pyproject.toml` is the authoritative list: it
+declares the **abstract minimum versions** ShapePipe is compatible with, while
+`uv.lock` pins the **exact** versions that ship in a given build. The tables
+below describe the main packages; for the complete, current set always refer to
+`pyproject.toml`.
## Python Dependencies
-| Package Name | References |
-|--------------|---------------------------------------------------|
+Scientific stack:
+
+| Package | References |
+|---------|------------|
| [Astropy](https://www.astropy.org/) | {cite:p}`astropy:2013,astropy:2018` |
| [GalSim](https://github.com/GalSim-developers/GalSim) | {cite:p}`rowe:15` |
-| [Joblib](https://joblib.readthedocs.io/en/latest/) | {cite:p}`joblib:20` |
-| [Matplotlib](https://matplotlib.org/) | {cite:p}`hunter:07` |
+| [ngmix](https://github.com/aguinot/ngmix) | {cite:p}`sheldon:15` |
| [MCCD](https://github.com/CosmoStat/mccd) | {cite:p}`liaudat:21` |
| [ModOpt](https://cea-cosmic.github.io/ModOpt/) | {cite:p}`farrens:20` |
-| [mpi4py](https://mpi4py.readthedocs.io/en/stable/) | {cite:p}`dalcin:05,dalcin:08,dalcin:11` |
-| [NGMIX](https://github.com/esheldon/ngmix) | {cite:p}`sheldon:15` |
+| [python-pysap](https://github.com/CEA-COSMIC/pysap) | |
| [Numpy](https://numpy.org/) | {cite:p}`harris:20` |
+| [Numba](https://numba.pydata.org/) | |
| [Pandas](https://pandas.pydata.org/) | {cite:p}`pandas:10,pandas:20` |
+| [Matplotlib](https://matplotlib.org/) | {cite:p}`hunter:07` |
+| [Joblib](https://joblib.readthedocs.io/en/latest/) | {cite:p}`joblib:20` |
+| [mpi4py](https://mpi4py.readthedocs.io/en/stable/) | {cite:p}`dalcin:05,dalcin:08,dalcin:11` |
+| [reproject](https://reproject.readthedocs.io/) | |
+| [h5py](https://www.h5py.org/) | |
| [sf_tools](https://github.com/sfarrens/sf_tools) | |
-| [sqlitedict](https://github.com/RaRe-Technologies/sqlitedict) | |
-## Executable Dependencies
+Data access & infrastructure (CANFAR / UNIONS):
+
+| Package | Purpose |
+|---------|---------|
+| [vos](https://github.com/opencadc/vostools) | CADC / CANFAR VOSpace access |
+| [skaha](https://github.com/shinybrar/skaha) | CANFAR Science Platform sessions |
+| canfar | CANFAR container-job submission |
+| [astroquery](https://astroquery.readthedocs.io/) | external catalogue queries |
+| [cs_util](https://github.com/CosmoStat/cs_util) | shared CosmoStat utilities |
+| [sqlitedict](https://github.com/RaRe-Technologies/sqlitedict) | on-disk pipeline state |
+
+```{note}
+`ngmix` is pinned to the
+[`aguinot/ngmix@stable_version`](https://github.com/aguinot/ngmix/tree/stable_version)
+fork until the fixes land upstream — do not modernise that line (see the note in
+`pyproject.toml`).
+```
-| Package Name | References |
-|---------------|------------|
-| [CDSclient](http://cdsarc.u-strasbg.fr/doc/cdsclient.html) | |
+## System Dependencies
+
+The container also provides the non-Python tools ShapePipe calls, all from Debian
+packages (no source builds), plus the MPI stack:
+
+| Package | References |
+|---------|------------|
+| [Source Extractor](https://www.astromatic.net/software/sextractor/) | {cite:p}`bertin:96` |
| [PSFEx](https://www.astromatic.net/software/psfex/) | {cite:p}`bertin:11` |
-| [SExtractor](https://www.astromatic.net/software/sextractor/) | {cite:p}`bertin:96` |
| [WeightWatcher](https://www.astromatic.net/software/weightwatcher/) | {cite:p}`marmo:08` |
+| OpenMPI (5.0.x) | |
+
+Python dependencies themselves are managed with [uv](https://docs.astral.sh/uv/);
+see [Container Workflow](container.md) for how `pyproject.toml`, `uv.lock`, and
+the `Dockerfile` fit together.
diff --git a/docs/source/module_develop.md b/docs/source/module_develop.md
index 3df8fe88c..b4ef3566f 100644
--- a/docs/source/module_develop.md
+++ b/docs/source/module_develop.md
@@ -7,7 +7,7 @@ This page provides details on how new modules can be implemented in ShapePipe.
Every ShapePipe module requires a *module package*, which is simply a directory
with the module name followed by `_package`, e.g. for a module called
`my_new_module` you would create a new directory called `my_new_module_package`
-and put it in `shapepipe/modules`. Inside this directory you will need to
+and put it in `src/shapepipe/modules`. Inside this directory you will need to
include a Python file (ideally named after your module, e.g.
`my_new_module.py`) and a `__init__.py` file with the following content.
diff --git a/docs/source/pipeline_canfar.md b/docs/source/pipeline_canfar.md
index 39111b05f..7e8bb29c5 100644
--- a/docs/source/pipeline_canfar.md
+++ b/docs/source/pipeline_canfar.md
@@ -457,6 +457,16 @@ ln -s ..//unions_shapepipe_comprehensive_struct_2024_v1.6.c.hdf5 unions_shapepip
calibrate_comprehensive
+```{note}
+The matched-star-catalogue and coverage-mask diagnostics in the next two
+sections have largely moved out of ShapePipe into
+[`sp_validation`](https://github.com/CosmoStat/sp_validation) / `cosmo_val`.
+Several helpers referenced below — `merge_psf_cat.py`, `download_headers`,
+`extract_field_corners`, `build_coverage_map`, `plot_coverage_map` — are no
+longer shipped in this repository; see `sp_validation` for their current
+equivalents. (`get_ccds_with_psf` and `summary_run` are still part of ShapePipe.)
+```
+
### Create matched star catalogue
For diagnostics, a catalogue with multi-epoch shapes measured by ngmix matched with the validation star catalogue is used.
diff --git a/docs/source/pipeline_v2.0.md b/docs/source/pipeline_v2.0.md
deleted file mode 100644
index bdcd07d08..000000000
--- a/docs/source/pipeline_v2.0.md
+++ /dev/null
@@ -1,292 +0,0 @@
-# Running `ShapePipe` processing and post-processing pipelines on CANFAR
-
-Documentation to create ShapePipe output products for catalogues v2.x.
-
-## Initialise directory structure
-
-```bash
-init_run_v2.0.sh
-```
-sets up the directory structure. This will be
-
-v2.0/
-├── tiles/
-│ ├── 301/
-│ │ ├── 301.278/
-│ │ ├── 301.279/
-│ │ └── ...
-├── exp/
-│ ├── 21/
-│ │ ├── 21163916
-│ │ └── ...
-├── cfis -> /arc/home/kilbinger/shapepipe/example/cfis
-├── tile_numbers -> /arc/home/kilbinger/shapepipe/auxdir/CFIS/tiles_202604/tiles_r.txt
-└── debug/
-
-
-### Interactive job from the terminal for a single tile
-
-Run bit-coded jobs
-```bash
-run_job_canfar_v2.0.sh -e ID -j
-```
-with job processing tiles:
-- 1: download tiles
-- 2: uncompress tile weights
-- 4: find exposures
-then exposures:
-- 8: download exposures
-- 16: split exposures into single-CCD HDUs
-- 32: mask exposures
-- 64: process stars (selection, PSF movel)
-then back to tiles:
-- 128: select objects (using external catalogue)
-- 256: create object postage stamps
-
-## CANFAR Setup
-
-### CANFAR Login
-
-Login to the canfar system with
-
-```bash
-canfar auth login
-```
-
-This can be done from at notebook or terminal within the canfar science portal,
-or any remote terminal that has the canfar library installed.
-
-Check authentication status with
-
-```bash
-canfar auth list
-```
-
-If not on "default", run
-
-```bash
-canfar auth switch default
-```
-
-
-## From previous setup
-
-### Merge all final catalogues
-
-The last step of `ShapePipe` processing is, per patch, to merget all final catalogues. This is done via a python script, as follows.
-First, change to parent directory `/path/to/version` and run the following command for all patches
-
-```bash
-patchnum=`tr $patch P ''`
-create_final_cat.py -m final_cat_$patch.hdf5 -i . -p $patch/cfis/final_cat.param \
- -P $patchnum -o $patch/n_tiles_final.txt -v
-```
-
-## Additional `ShapePipe` processing
-
-
-### Create star Catalogue
-
-We can additionaly create a combined star catalogue, with star shapes projecte from detector to world coordinates.
-This is useful for validation and galaxy-PSF/star correlation diagnostics.
-
-#### Combine all PSF runs
-
-In each patch directory /path/to/version/$patch, run
-
-```bash
-combine_runs.bash -p $psf -c psf
-```
-
-to create a single output directory of PSF files (symbolic links).
-
-Optionally, to create and plot results for this patch only:
-
-```bash
-shapepipe_run -c $SP_CONFIG/config_Ms_$psf.ini
-shapepipe_run -c $SP_CONFIG/config_Pl_$psf.ini
-```
-
-#### Convert star catalogue to wCS
-
-Convert all input validation PSF files and create directories per patch `P?`.
-Create files `validation_psf_conv--.fits` (for the v1.4 setup only one file):
-
-```bash
-cd /path/to/version
-mkdir stat_car
-cd star_cat
-```
-
-For each patch run
-
-```bash
-convert_psf_pix2world.py -i .. -P $patchnum -v
-```
-
-Combine previously created files as links within one ShapePipe run directory (for the v1.4 setup only one link).
-First (and optiohnal), create a subdir for a run and link to the input patches:
-
-```bash
-cd /path/to/version/star_cat
-mkdir v1.6
-ln -s ../P1
-ln -s ../P2
-...
-```
-
-Next, create links to all `validation_conv` runs:
-
-```bash
-combine_runs.bash -p psfex -c psf_conv
-```
-
-Merge all converted star catalogues and create `final-starcat.fits`:
-
-```bash
-export SP_RUN=`pwd`
-shapepipe_run -c ~/shapepipe/example/cfis/config_Ms_psfex_conv.ini
-```
-
-Rename to general PSF and star catalogue used for all ("a") sub-versions:
-
-
-```bash
-cp output/run_sp_Ms/merge_starcat_runner/output/full_starcat-0000000.fits \
- unions_shapepipe_psf_2024_v1.6.a.fits
-```
-
-The FITS file `CATTYPE` (newer version) should be `validation_psf_conf`.
-
-## Post-processing
-
-The following post-processing steps are performed with the library `sp_validation`.
-
-### Extract Information
-
-First, we extract all information from the final catalogue, per patch. We copy
-the parameter file and set links to the catalogues and `ShapePipe` config directory.
-
-```bash
-cd /path/to/version/$patch
-cp ~/astro/repositories/github/sp_validation/notebooks/params.py .
-ln -s /path/to/final_cat_$patchnum.hdf5 # not relative path ../final_cat_P$patchnum.hdf5 !
-ln -s output/run_sp_MsPl/mccd_merge_starcat_runner/output/full_starcat-0000000.fits
-ln -s ~/astro/repositories/github/shapepipe/example/cfis
-```
-
-Then edit `params.py`: Set patch name; set `wrap_ra` for P2.
-
-Now we can run the script, recommended via job submission on candide. For large patches,
-this requies a job with a large memory, e.g. with `mem=380000`
-
-
-```bash
-[squeue] python ~/astro/repositories/github/sp_validation/notebooks/extract_info.py
-```
-
-This creates a patch-wise comprehensive catalogue.
-
-### Create global comprehensive catalogues
-
-```bash
-cd /patch/to/version
-[squeue] python ~/astro/repositories/github/sp_validation/scripts/create_joint_comprehensive_cat.py \
- -v v1.6.c -v -p P1+P2+P3+P4+P5+P6+P7+P8+P9
-```
-
-This creates the file `unions_shapepipe_comprehensive_2024_v1.6.c.hdf5`.
-
-
-### Apply structural masks
-
-First, edit the Python script `~/astro/repositories/github/sp_validation/notebooks/demo_apply_hsp_masks.py`
-to match catalogue name. Check the coverage mask input file (see below).
-Run the script to apply the healsparse structural masks:
-
-```bash
-[squeue] python ~/astro/repositories/github/sp_validation/notebooks/demo_apply_hsp_masks.py
-```
-
-This creates the file `unions_shapepipe_comprehensive_struct_2024_v1.6.c.hdf5`.
-
-
-### Define sample, calibrate catalogue
-
-We are close to finally perform the last post-processing step, which is the calibration. First, the final galaxy sample
-in question needs to be defined, with masks and cuts to apply from a `yaml` config file. A number of pre-defined files
-can be found in `~/astro/repositories/github/sp_validation/calibration`.
-
-For example, to create `v1.6.6`, the steps are:
-
-```bash
-cd /path/to/version
-mkdir -p v1.6.6
-cd v1.6.6
-ln -s ~/astro/repositories/github/sp_validation/calibration/mask_v1.X.6.yaml config_mask.yaml
-ln -s ..//unions_shapepipe_comprehensive_struct_2024_v1.6.c.hdf5 unions_shapepipe_comprehensive_struct_2024_v1.X.c.hdf5
-[squeue] python ~/astro/repositories/github/sp_validation/calibrate_comprehensive_cat.py
-```
-
-calibrate_comprehensive
-
-
-### Create matched star catalogue
-
-For diagnostics, a catalogue with multi-epoch shapes measured by ngmix matched with the validation star catalogue is used.
-This is created as follows:
-
-```bash
-cd /path/to/version
-merge_psf_cat.py [-V v1.6|-P P1+P2+...] -v
-```
-
-This creates the joint catalogue unions_shapepipe_star_2024_v1.6.a.fits .
-
-### Create coverage mask
-
-First, on canfar, move to the directory that has the patch subdirectories.
-
-```bash
-cd /path/to/version
-```
-
-#### Get exposure numbers
-
-If the file `$patch/exp_numbers.txt` does not exist for a given patch, create it with the summary program
-
-```bash
-summary_run $patch 1
-```
-
-Now, create the list of CCDs that have PSF information with
-
-```bash
-get_ccds_with_psf -v -V v1.6
-```
-
-Next, download exposures headers
-
-```bash
-download_headers -i ccds_with_psfs_v1.6.txt -o headers_v1.6 -v
-```
-
-From the headers, the CCD corner coordinates are extracted with
-```bash
-extract_field_corners -i headers_v1.6 -v
-```
-
-Then, build the healsparse coverage mask file as
-```bash
-build_coverage_map
-```
-
-Use `plot_coverage_map` to create plots of the coverage mask.
-
-## Extra Utilities
-
-### Run in Terminal in Parallel
-
-```bash
-cat IDs.txt | xargs -I {} -P 16 bash -c 'init_run_exclusive_canfar.sh -j 512 -e {}'
-```
diff --git a/docs/source/post_processing.md b/docs/source/post_processing.md
index 577ad57b3..e8a3768f4 100644
--- a/docs/source/post_processing.md
+++ b/docs/source/post_processing.md
@@ -16,7 +16,16 @@ If main ShapePipe processing happened at the old canfar VM system (e.g. CFIS v0
---
-The following steps are required for pre-v1.4 runs performed on the canfar VM system.
+```{note}
+This page documents the **legacy** post-processing used for pre-v1.4 runs on the
+canfar VM system. PSF validation and the scale-dependent diagnostics
+(ρ-statistics) now live in
+[`sp_validation`](https://github.com/CosmoStat/sp_validation) rather than in
+ShapePipe; the steps below are retained for reference and for reproducing older
+runs.
+```
+
+The following steps were used for pre-v1.4 runs performed on the canfar VM system.
1. Optional: Split output into sub-samples
@@ -48,30 +57,25 @@ The following steps are required for pre-v1.4 runs performed on the canfar VM sy
This script creates a new combined psf run in the ShapePipe `output` directory, by identifying all psf validation files
and creating symbolic links. The run log file is updated.
- 3. Merge individual psf validation files into one catalogue. Create plots of the PSF and their residuals in the focal plane,
- as a diagnostic of the overall PSF model.
- As a scale-dependend test, which propagates directly to the shear correlation function, the rho statistics are computed,
- see {cite:p}`rowe:10` and {cite:p}`jarvis:16`,
+ 3. Merge the individual PSF validation files into one catalogue, and create
+ plots of the PSF and its residuals in the focal plane as a diagnostic of
+ the overall PSF model:
```bash
shapepipe_run -c /path/to/shapepipe/example/cfis/config_MsPl_PSF.ini
- ```
-
- 4. Prepare output directory
-
- Create links to all 'final_cat' result files with
- ```bash
- prepare_tiles_for_final
```
- The corresponding output directory that is created is `output/run_sp_combined/make_catalog_runner/output`.
- On success, it contains links to all `final_cat` output catalogues
+ The scale-dependent PSF diagnostics that propagate to the shear correlation
+ function (the ρ-statistics, {cite:p}`rowe:10`, {cite:p}`jarvis:16`) are no
+ longer computed here — they have moved to
+ [`sp_validation`](https://github.com/CosmoStat/sp_validation).
- 5. Merge final output files
-
- Create a single main shape catalog:
+ 4. Merge final output files
+
+ Create a single main shape catalogue:
```bash
merge_final_cat -i -p -v
```
- Choose as input directory `input_dir` the output of step C. A default
- parameter file `` is `/path/to/shapepipe/example/cfis/final_cat.param`.
+ Choose as input directory `input_dir` the `make_cat` output of the runs
+ being combined. A default parameter file `` is
+ `/path/to/shapepipe/example/cfis/final_cat.param`.
On success, the file `./final_cat.npy` is created. Depending on the number of
input tiles, this file can be several tens of Gb large.
diff --git a/docs/source/random_cat.md b/docs/source/random_cat.md
index 724ff6e32..a930d40ed 100644
--- a/docs/source/random_cat.md
+++ b/docs/source/random_cat.md
@@ -6,6 +6,15 @@ masks, and combined randoms and masks for a selection of tiles.
The masked regions are obtained on input from ShapePipe pixel mask ("pipeline flag")
files.
+```{note}
+Parts of this procedure use the legacy canfar-VM / `vos` retrieval workflow (see
+[VOSpace retrieval](vos_retrieve.md)) and the obsolete `prepare_tiles_for_final`
+helper, which is no longer shipped. The `random_cat` module itself is current;
+the input-staging and joint-mask steps now overlap with
+[`sp_validation`](https://github.com/CosmoStat/sp_validation). The steps are
+retained for reference.
+```
+
## Set up
### ID file and shell variables
@@ -64,7 +73,7 @@ while read p; do tar xvf pipeline_flag_$p.tgz; done