diff --git a/.felt/dependabot-pr-triage/dependabot-pr-triage.md b/.felt/dependabot-pr-triage/dependabot-pr-triage.md index 804ea499e..48a81c4f8 100644 --- a/.felt/dependabot-pr-triage/dependabot-pr-triage.md +++ b/.felt/dependabot-pr-triage/dependabot-pr-triage.md @@ -1,4 +1,5 @@ --- +id: 01KTCHWZSHQXRQHTHH9Q7Q4MT7 name: Triage open dependabot uv.lock PRs status: closed tags: @@ -84,3 +85,7 @@ Update this fiber's `outcome:` to reflect what landed (e.g. "5 merged, 1 deferre ## Skills Activate `felt` (you'll touch the fiber). The `shuttle` skill is already loaded by virtue of being the dispatched worker. + +## Follow-on: PR #742 (2026-06-11) + +PR #742 ("chore(deps): bump the lockfile-minor-patch group, 4 updates": mpi4py 4.1.1→4.1.2, numpy 2.4.4→2.4.6, snakemake 9.20.0→9.21.0, black) arrived after the original batch. CI green (`build-test-publish` passes). Not merged — pyproject.toml modified alongside uv.lock: lower bounds tightened (`mpi4py>=4.0→4.1.2`, `numpy>=2.0→2.4.6`, `snakemake>=9.20.0→9.21.0`). Tightening `>=` floors is a policy call, not a mechanical bump. Comment posted at https://github.com/CosmoStat/shapepipe/pull/742#issuecomment-4675762821 asking maintainers to confirm intent before merging. diff --git a/.felt/docker-multistage/docker-multistage.md b/.felt/docker-multistage/docker-multistage.md index 94951a708..8b5ed1076 100644 --- a/.felt/docker-multistage/docker-multistage.md +++ b/.felt/docker-multistage/docker-multistage.md @@ -1,4 +1,5 @@ --- +id: 01KTCHWZV7263MYPNB8DT1FXPX name: 'Docker multi-stage: runtime + dev targets' tags: - shapepipe diff --git a/.felt/docker-uv-revert/docker-uv-revert.md b/.felt/docker-uv-revert/docker-uv-revert.md index dd138a97f..a1f4a16d0 100644 --- a/.felt/docker-uv-revert/docker-uv-revert.md +++ b/.felt/docker-uv-revert/docker-uv-revert.md @@ -1,16 +1,18 @@ --- +id: 01KTCHWZW8DT8VRYQJ4CEPDRGB name: 'Docker: revert skaha→python base, switch to uv lockfile' -status: active +status: closed tags: - shapepipe - docker - infra created-at: 2026-04-27T11:26:45.677512058+02:00 -outcome: 'PR #719 (chore: switch Dockerfile to slim Python + uv lockfile) opened and CI-green on first try (3m31s); ready for review. Drops conda double-install, makes pyproject SSOT + uv.lock the pinned manifest, switches WeightWatcher from sed-patched source build to Debian''s pre-patched 1.12+dfsg-3 package, adds binary smoke tests to deploy-image.yml.' +closed-at: 2026-06-10T17:14:45.965931602+02:00 +outcome: 'Superseded: conda removal landed via #733 (merged); the #719-era docker-uv work is absorbed into ci-green-on-develop, which tracks remaining follow-ups.' decisions: base: label: Base image - rationale: Conda double-install was the actual problem; cleanest resolution is to drop conda entirely. The canfar deployment concern is satisfied as long as the slim image works on canfar. + rationale: Conda double-install was the actual problem; cleanest resolution is to drop conda entirely. Martin's canfar concern is satisfied as long as the slim image works on canfar. default: python-slim options: python-slim: @@ -50,7 +52,7 @@ decisions: label: uv + pyproject + uv.lock; uv sync --frozen in Dockerfile modernize: label: Modernize package versions - rationale: 'We determined which versions MUST stay pinned: only ngmix (pinned to a stable_version fork branch — replacement is tracked separately). Everything else can move to current latest because uv resolved cleanly and CI smoke test still passes (3m42s). If a real pipeline run on canfar surfaces a numpy-2 / pandas-3 break, the fix is a targeted constraint + uv lock, not a wholesale revert.' + rationale: 'We determined which versions MUST stay pinned: only ngmix (Axel''s stable_version branch — replacement is tracked separately). Everything else can move to current latest because uv resolved cleanly and CI smoke test still passes (3m42s). If a real pipeline run on canfar surfaces a numpy-2 / pandas-3 break, the fix is a targeted constraint + uv lock, not a wholesale revert.' default: stay-current options: stay-conservative: @@ -58,7 +60,7 @@ decisions: excluded: true excluded_reason: Drift between pyproject signal and lockfile reality; loses the chance to surface numpy-2/pandas-3 incompatibilities at PR time when CI is fast stay-current: - label: Bump pyproject minimums to current major versions (numpy 2, astropy 7, pandas 3, galsim 2.8, mpi4py 4.1, etc.); pin ngmix to its stable_version fork branch + label: Bump pyproject minimums to current major versions (numpy 2, astropy 7, pandas 3, galsim 2.8, mpi4py 4.1, etc.); pin ngmix to Axel's stable_version branch insights: ci-fast: claim: 'First CI run on PR #719 went green in 3m31s. uv installed 238 packages in 322ms — everything resolved to prebuilt wheels, no source compilation of galsim/mpi4py/python-pysap/etc. Massive speedup vs. previous build.' @@ -97,10 +99,11 @@ The `--frozen` flag is the discipline mechanism: a stale lockfile cannot ship. ## Followups - Watch CI on #719. The slim-base apt list is conjectural — galsim/mpi4py/python-pysap pull a lot of system deps and we may need to add more (`libatlas-base-dev`, `libblas-dev`, etc). -- If CI needs anything beyond what's in the apt block, that's worth noting for next time. -- After this lands, PRs #708 and #714 may need a small rebase. -- Optional: separate `Dockerfile.canfar` building on skaha if there's a concrete deployment reason. Currently conjectural — floated as a possibility, but slim should work on canfar. +- If CI needs anything beyond what's in the apt block, that's the surface that benefits from a [[shapepipe/prs-in-flight]] note for next time. +- After this lands, [[shapepipe/prs-in-flight]] PRs #708 and #714 may need a small rebase. +- Optional: separate `Dockerfile.canfar` building on skaha if there's a concrete deployment reason. Currently conjectural — Martin floated it but we agreed slim should work on canfar. ## Connections - [[shapepipe]] — root +- [[shapepipe/prs-in-flight]] — touches the testing-scaffold xfail set and the develop-bugs PR diff --git a/.felt/fabian-coord-bug/fabian-coord-bug.md b/.felt/fabian-coord-bug/fabian-coord-bug.md new file mode 100644 index 000000000..bcbbc1bb2 --- /dev/null +++ b/.felt/fabian-coord-bug/fabian-coord-bug.md @@ -0,0 +1,11 @@ +--- +id: 01KTCHWZWEXTJ337APFSH4NF6X +name: Fabian's coord-propagation bug + image-sim code on github +tags: + - shapepipe + - bug + - collaboration + - future +created-at: 2026-04-27T11:26:52.878118978+02:00 +outcome: 'Fabian: 1-line fix in shapepipe needs porting; first need him to put image-sim code/configs on github so it''s testable. Beg if necessary.' +--- diff --git a/.felt/ngmix-update/ngmix-update.md b/.felt/ngmix-update/ngmix-update.md index 723871e65..bd11d135f 100644 --- a/.felt/ngmix-update/ngmix-update.md +++ b/.felt/ngmix-update/ngmix-update.md @@ -1,9 +1,10 @@ --- -name: ngmix library upgrade + wrapper sync +id: 01KTCHWZWY48FVCEDX6DGE6CTY +name: ngmix library upgrade + Lucy wrapper sync tags: - shapepipe - ngmix - future created-at: 2026-04-27T11:26:51.026191639+02:00 -outcome: 'Replace the pinned ngmix fork (a stable_version branch carrying not-yet-upstreamed fixes) with upstream ngmix once those land; reconcile the wrapper afterward.' +outcome: 'Future: replace Axel''s stable_version fork with upstream ngmix; reconcile with Lucy''s cleaned-up wrapper from her visit' --- diff --git a/.felt/prs-in-flight/prs-in-flight.md b/.felt/prs-in-flight/prs-in-flight.md new file mode 100644 index 000000000..364db869f --- /dev/null +++ b/.felt/prs-in-flight/prs-in-flight.md @@ -0,0 +1,78 @@ +--- +id: 01KTCHWZWYYTDSYHAP4FPE7V3V +name: PRs in flight after v2 merge +tags: + - shapepipe + - pr +created-at: 2026-04-27T11:26:49.300097608+02:00 +outcome: 'Post-v2 + post-propagation: infra stream now landed (#718 setuptools, #719 uv-lockfile, #728 dependabot+SHA-pin), supply-chain hygiene done (20 → 0 alerts). Issue #712 empirically verified resolved against current `:develop` (all 11 packages in Martin''s May 18 list import in both read-only and writable sandbox modes); comment posted, awaiting Martin reply before closing. Science PRs still open: #714 develop-bugs (closes #709 + #711 only — #712 closes separately), #708 testing-scaffold (mine); #725 centroid shift (Axel), several older Martin PRs (#704 #703 #699 #660 #650 #636), #670 lbaumo file_io. Next thread: merge #714.' +insights: + 714-already-redundant: + claim: 'Surprise from rebasing #714: its Dockerfile commit (cf304f8f, adding astroquery/numba/fitsio + setuptools<81 pin) was *already* redundant on current develop — the v2 merge silently put astroquery/numba/fitsio into pyproject and the v2 Dockerfile installs them via ''pip install -e ".[fitsio]"'' at the end. setuptools<81 went away via #718. So ''rebase to drop the obsolete commit'' wasn''t waiting on #719 — it was already obsolete the moment v2 merged. Worth checking sooner next time before assuming a fix is still load-bearing.' + xfail-mostly-fixable: + claim: 'Most #708 xfails are about to be resolved: canfar_monitor IndentationError (4 xfails) and summary_run -h (1 xfail) are fixed in #714; astroquery/numba/fitsio import xfails (5 modules) resolve in #719 because uv sync installs them from pyproject. Only stile/treecorr corr2 (4 modules) is a separate issue requiring stile removal or upstream patch.' + dependabot-policy: + claim: 'shapepipe now ships `.github/dependabot.yml` (#728) with 14-day cooldown, monthly grouped lockfile PRs, github-actions ecosystem opted in, and SHA-pinned actions across all four workflows. Reasoning lives in the file itself + the #728 PR body. Companion fiber [[shapepipe/sqlitedict-pickle-smell]] tracks the single dismissed alert.' + 712-empirically-resolved: + claim: 'Issue #712 is empirically resolved against current `ghcr.io/cosmostat/shapepipe:develop` (dev target, post-#728). Both the original packages (astroquery, numba, fitsio) and Martin''s May 18 follow-up list (scipy, joblib, importlib_metadata, tqdm, LSSTDESC.Coord, pyyaml, astropy_iers_data, pyerfa) import cleanly in both read-only and writable sandbox modes, as do the three originally-flagged runner modules. Pyproject confirms astroquery/numba/joblib/tqdm are core deps; the rest are transitives of astropy/mccd/modopt/galsim; fitsio is gated in both runtime (`--extra jupyter --extra fitsio`) and dev (`--extra dev`) targets. Comment posted; awaiting Martin reply before closing. Likely root cause of the May 18 report: cached/older image.' +decisions: + setuptools-pin: + label: drop setuptools<81 pin + default: merged + options: + merged: + label: 'Already merged as #718 (c9e71df8) — small one-liner, agreed in transcript' +--- + +Snapshot of CosmoStat/shapepipe PR state, maintained as a living index. + +## Open — infra + +(All infra PRs landed. The dependabot stream is resolved; supply-chain +posture set; SHA-pins in place. See [[shapepipe/sqlitedict-pickle-smell]] +for the one open security-fiber.) + +## Open — issues (mine) + +| # | What | Status | +|---|---|---| +| #712 | Dockerfile missing runtime deps | Empirically resolved against current `:develop` ([comment](https://github.com/CosmoStat/shapepipe/issues/712#issuecomment-4562085977)). Both original list (astroquery/numba/fitsio) and Martin's May 18 follow-up (scipy/joblib/importlib_metadata/tqdm/LSSTDESC.Coord/pyyaml/astropy_iers_data/pyerfa) import cleanly in read-only + writable sandbox modes. Awaiting Martin reply before closing. | +| #711 | summary_run -h crashes | Fixed by #714 (auto-closes on merge) | +| #709 | canfar_monitor IndentationError | Fixed by #714 (auto-closes on merge) | + +## Open — mine (science / fixes) + +| # | Branch | What | Status | +|---|---|---|---| +| #731 | `chore/smoke-test-read-only` | smoke-test in read-only mode | Open. Adds `shapepipe_run_example` wrapper; CI now runs the entry-point smoke under `docker --read-only --tmpfs /tmp:rw`. See [[shapepipe/smoke-test-read-only]]. | +| #714 | `fix/develop-bugs` | small develop bugs (#709, #711) | Open. Originally a multi-bug fix; the Dockerfile portion got absorbed into #719. Worth checking what's still load-bearing here vs already-fixed-upstream. | +| #708 | `chore/testing-scaffold` | Tier 0–2 test scaffolding | Open. Some xfails should have flipped to xpass after the v2 + uv-lockfile work; needs a rebase + xfail-list audit. | + +## Open — others' PRs awaiting attention + +| # | Author | What | +|---|---|---| +| #741 | martinkilbinger / lbaumo | **Ngmix v2.0** — upstream ngmix 2.4.0 + Lucy's new classes. Canonical PR (CI mirror; fork PR #740 closed by Martin). OPEN, mergeable, CI green. Two-part review + next-steps triage delivered; 11 findings open (5 cut-and-dry, 5 decisions, 1 resume), 2 are merge-gates (weight-norm, `*_psfo`). See [[review-ngmix-v2-pr740]]. | +| #725 | aguinot | Fix centroid shift (overlaps #741 centroid work — cross-ref/supersede?) | +| #704 | martinkilbinger | Contributors | +| #703 | martinkilbinger | V1.3.x | +| #699 | martinkilbinger | Coverage mask | +| #670 | lbaumo | file_io handles sextractor header | +| #660 | martinkilbinger | Existing output directory | +| #650 | martinkilbinger | Third-party catalogue for tile objects | +| #636 | martinkilbinger | Rho statistics: flexible training/test split | + +## Recently closed + +- **#728** `chore/dependabot-config` — dependabot.yml + SHA-pin all actions. Merged 2026-05-28. +- **#727, #726, #724, #722, #721, #720** — dependabot security bumps for idna/urllib3/gitpython/mistune/jupyter-server/jupyterlab. All squash-merged 2026-05-28 (see [[shapepipe/dependabot-pr-triage]]). +- **#719** `chore/uv-lockfile` — merged 2026-05-05 (Martin). +- **#718** `chore/drop-setuptools-pin` — merged. +- **v2.0 PR** — merged. Source of the skaha/conda situation that #719 unwound. + +## Connections + +- [[shapepipe]] — root +- [[shapepipe/docker-uv-revert]] — drove #719 +- [[shapepipe/dependabot-pr-triage]] — drove the 6 security-bump merges (closed) +- [[shapepipe/sqlitedict-pickle-smell]] — future-work fiber for the one dismissed alert diff --git a/.felt/review-ngmix-v2-pr740/deep-dive-report.html b/.felt/review-ngmix-v2-pr740/deep-dive-report.html new file mode 100644 index 000000000..267991b77 --- /dev/null +++ b/.felt/review-ngmix-v2-pr740/deep-dive-report.html @@ -0,0 +1,199 @@ + + + + + +Deep dive — ngmix v2.0 weights & size + + + + + + +
+ +
+ 2026 · 06 · 05 + ·review-ngmix-v2-pr740 + ·deep dive — weights & size +
+ +

Two hard problems, solved.

+ +

A fan-out of six investigators + two synthesizers chased the two science-shaped + findings from the #741 review to ground. Both came back high-confidence, with verified arithmetic + and — for the weight regression — an observed empirical confirmation. Each problem + now has a concrete recommendation and a clean scoping plan. Full reports: + weights-report.md and size-report.md in this fiber.

+ +
+ + +
+
§ Problem A · weights / inverse-variance
+

v2.0 has two real, coupled weight-map regressions.

+ +

The live weight path is prepare_ngmix_weights (ngmix.py:871). Against + the v1 baseline (develop), v2.0 dropped two things:

+ + + + + +
Regr.What changedEffect
R1Noise estimate regressed from the object-free windowed get_noise (still in the file at :826, now dead — zero callers) to whole-stamp sigma_mad(gal), contaminated by galaxy flux.Size/flux-dependent mis-weighting of the χ² → biased g_err/T_err/s2n; the signature of a multiplicative shear bias m (cf. old-path m~-2.8e-2). Dominant today.
R2Lost the binarization weight_map[weight_map != 0] = 1, so line 906 multiplies the raw weight by 1/sig_noise².For 0/1 masks: per-epoch 1/Fscale² scale error. Latent hazard: feed a real inverse-variance map (THELI / BACKGROUND_RMS) and v2.0 double-counts the variance.
+ +

Observed, not inferred

+

Fed make_data's truth inverse-variance map (1/noise²) through the actual + function in the dev container:

+
+ noise=1e-3  →  truth ivar 1e6  |  recovered 8.81e11  |  ratio 8.81e5 (correct = 1.0)
+ noise=1e-4  →  truth ivar 1e8  |  recovered 7.35e15  |  ratio 7.35e7 (correct = 1.0) +
+

The weight is off by ≈1/noise² — exactly the R2 double-count. The regression is a + clean red→green test target (make_data injects the oracle; no ngmix + fit needed). Today's only weight-path test asserts run(42)==run(42) — determinism, + never truth. Coverage gap.

+ +
+
recommendation — split into two PRs
+

In #741 (now): minimal v1-restore — reinstate the + binarization + call get_noise instead of sigma_mad(gal) — plus the + red→green unit test. Two lines; fixes a regression this PR introduced; restores robustness to + real ivar maps.

+

Separate PR (next): the in-house SExtractor + BACKGROUND_RMS baseline you steered toward (THELI is external). Mostly + config — the check-image & vignetmaker ME loops are already list-driven, so delivering the + RMS stamps is near-free; the earned work is rewriting prepare_ngmix_weights to use + per-pixel 1/RMS² with sigma_mad as fallback, + before/after validation. + Effort S–M, ~4–8 h. Closes #604.

+
+ +
+ Open questions for Martin (6) +
    +
  1. Confirm SExtractor-BACKGROUND_RMS is the baseline, THELI a later add-on (non-blocking)?
  2. +
  3. RMS input optional (with sigma_mad/get_noise fallback) or required? (rec: optional)
  4. +
  5. #741 minimal fix: restore windowed get_noise (reintroduces the gal-guess machinery the PR may have deliberately stripped), or accept bare sigma_mad until #604? R1 is dominant, so it matters even short-term.
  6. +
  7. Does the RMS map feed anything beyond weight + noise-fill (e.g. background subtraction)? (rec: weight + noise-fill only)
  8. +
  9. Config-edit scope for #604: cfis only, or cfis + cfis_simu + canfar batch?
  10. +
  11. Merge bar for #604: smoke before/after on the example tile, or a full sim m/c calibration?
  12. +
+
+
+ +
+ + +
+
§ Problem B · r50 / T / σ size naming
+

The v2.0 “r50” columns aren’t half-light radii.

+ +

All five r50* columns are new in v2.0 (v1 wrote sizes only as + T). None is a true half-light radius:

+ +
+ galaxy r50 (l.409) = pars[4] = T = 2σ²  — an area, bit-for-bit a copy of T
+ PSF r50psf (l.411) = √(T/2) = σ  — a length, but off by ×1.1774
+ true half-light radius = 1.1774·σ = 1.1774·√(T/2)  — no column equals this +
+ +

So the same name root means an area on the galaxy side and a length + on the PSF side. Meanwhile the official catalogue paper (UNIONS-3500 WL I, + arXiv:2605.13549) reports the half-light radius r_h (= r50) as the primary + size for both galaxy and PSF, with the size cut written as the dimensionless ratio + 0.707 ≤ r_h/r_h,psf ≤ 3; T is given only as the derived DES area (Eq. 1). + The paper wants exactly the quantity these columns claim but fail to provide.

+ +

sp_validation reads none of the r50* columns (only the + dimensionless ratio T/Tpsf, which is robust) — so the mislabel harms nothing today, + but it's a live foot-gun for any future consumer and contradicts the paper.

+ +
+
recommendation — transform at the source
+

Make the ngmix writer emit honest columns: galaxy + r50 = 1.1774·√(T/2) with propagated error; PSF r50psf = existing σ × + 1.1774. Keep all T* columns. Then every "r50" is a length meaning the same thing on + both sides, matching the paper. De-duplicate (T_psfo_ngmixTpsf, + r50_psfo_ngmixr50psf).

+

Put the conversion web in cs_util once + (T_to_r50, r50_to_T, sigma_to_fwhm, corrected + T_to_fwhm) — the shared producer/consumer layer. Beats a pure rename (throws away + the paper's headline quantity) and a read-side-only fix (leaves the foot-gun on disk).

+
+ +

Bonus find: a real downstream bug in sp_validation

+

galaxy.py:T_to_fwhm is T/1.17741*2.355 — dimensionally wrong + (2.355/1.17741 = 2.000; it treats the area T as if it were σ). Its own + MKDEBUG comment already doubts it. It builds the fwhm_PSF regressor for the + scale-dependent PSF-leakage fit (run_object.py:510), so it distorts + the PSF-size axis and biases α(size). The correct form is + 2.355·√(T/2). A clean cs_util surface lets this be fixed and the trap retired.

+ +
+ Open questions for Lucy / Martin (6) +
    +
  1. Standardize on r50-as-primary (per UNIONS-3500 I) while keeping T, or keep T as working column with r50 derived?
  2. +
  3. Drop or alias the duplicate columns — is anything outside sp_validation src/ (notebooks, older catalogues) bound to the _psfo_ngmix names?
  4. +
  5. cs_util as the home for the converter web — agreed? Naming?
  6. +
  7. Were in-flight scale-dependent leakage results produced with the buggy T_to_fwhm, and how much does α(size) move once fixed?
  8. +
  9. Galaxy r50_err: simple propagation, or from the full pars covariance?
  10. +
  11. Catalogue-format version bump / changelog for the r50* semantics change?
  12. +
+
+
+ +
+ +
+
§ Where this leaves us
+

Two work-streams, both scoped.

+ +
+ +
+ + diff --git a/.felt/review-ngmix-v2-pr740/draft-review.md b/.felt/review-ngmix-v2-pr740/draft-review.md new file mode 100644 index 000000000..31cd8852e --- /dev/null +++ b/.felt/review-ngmix-v2-pr740/draft-review.md @@ -0,0 +1,117 @@ +# Draft review — PR #740 "Ngmix v2.0" (staging, NOT posted) + +Staged for the call with Cail. Posting/pushing happens *with* him, in his voice. +Authorship convention: "— Claude on behalf of Cail" unless he says otherwise. + +--- + +## Reviewer checklist (Martin's template) + +| Box | Verdict | +|---|---| +| Targets `develop` | ✅ head `ngmix_v2.0` → base `develop` | +| Assigned to developer | ⬜ (Cail to check on GitHub) | +| Appropriate labels | ☑️ `enhancement` only — fine; could add a `dependencies` label given the ngmix bump | +| Projects/milestones | ⬜ (Cail to check) | +| Clear description | ⚠️ Summary is terse (4 bullets). For a +3894/−3410, 67-file PR that bumps a core dep and rewrites the shape-measurement module, a fuller description would help — esp. the centroid-bias result. | +| "closes #" links | ❌ No issue links. #725 (Axel's centroid-shift PR) overlaps this work — should this PR close/supersede it? Worth an explicit cross-ref. | +| Code/doc style | ⚠️ Mostly good. Pockets of debug cruft (see line-level). | +| Docs updated | ❌ No `docs/` changes. The ngmix pin note in `CLAUDE.md` ("don't modernize this line") is now stale — this PR *is* the modernization. Needs updating on/with merge. | +| **CI passing** | ✅ **GREEN** (full suite). Got here by pushing a same-repo mirror (PR #741, branch `ngmix_v2.0` on origin) and merging `develop` into it (clean except `uv.lock`, regenerated via `uv lock`) to pull in the modernized CI. `build-test-publish` ran every step green: images build (new `uv.lock` + ngmix 2.4.0 resolve), binary smokes, entry point, **pytest suite** (incl. develop's `test_ngmix.py`), example pipeline; publish correctly skipped on the PR. Original fork PR #740 got no CI (Actions approval gate). | The modern test-running workflow (`pull_request → develop`) doesn't queue because GitHub evaluates the trigger from the *head branch's* workflow file, which predates the modernization. `ci-release.yml` runs the full suite only on `pull_request → main/master`. **To run the full suite + example pipeline, the branch must have `develop` merged in** (pulls in modern CI + resolves drift). Original #740 got no CI at all because fork PRs don't trigger our Actions without maintainer approval. | +| API docs built | ➖ n/a-ish (no public API doc surface changed materially) | +| All files checked + comments | ✅ done in this review | + +--- + +## Science-quality read (the part the checklist misses) + +### 1. ngmix 2.4.0 API correctness — ✅ verified empirically +Installed ngmix **2.4.0** (from `esheldon/ngmix@v2.4.0`, the PR's own source) over the +container stack and ran the **actual module code**. All 13 ngmix symbols the module +uses resolve in 2.4.0 (`fitting.Fitter`, `guessers.TPSFFluxAndPriorGuesser`/`TFluxGuesser`, +`runners.PSFRunner`/`Runner`, `metacal.MetacalBootstrapper`, `joint_prior.PriorSimpleSep`, +`priors.*`, `observation.*`, `Jacobian`). The metacal/bootstrap path runs end-to-end. + +### 2. Shear recovery — ✅ unbiased +Ran `do_ngmix_metacal` on 200 galsim galaxies (sub-pixel shifted, the centroid regime), +recovered shear via the metacal response: +- **m = +2×10⁻⁴ ± 3×10⁻⁴** (multiplicative bias consistent with zero) +- **c₂ = −1×10⁻⁵** (additive consistent with zero), R₁₁≈R₂₂≈0.924, 200/200 fits OK. +- Author's own `centroid_bias_v2.py` (noiseless, 60 trials, ±-cancellation): **m = +3.7×10⁻⁴ ± 0.5×10⁻⁴ (1σ)**, c₂ = (−0.3 ± 0.15)×10⁻⁵, R₁₁≈0.9234. +- **Read:** few×10⁻⁴ residual m — small, likely within UNIONS requirements, but the noiseless limit resolves a marginal *positive* residual (≈7σ given tiny noiseless bars). Not the centroid fix (fix-off gives the same) → metacal/fitter higher-order. Worth stating the m-requirement we're holding this to. + +### 3. The centroid fix — present, correct in form, but **necessity not demonstrated in idealized sim** +The fix (`make_ngmix_observation`, `ngmix.py:987-1004`) re-centers the galaxy Jacobian on +the HSM `FindAdaptiveMom` centroid so the centroid prior (centered at the Jacobian origin) +doesn't bias the fit. Sound idea. **But** running the same ensemble with the recenter +disabled gives the *same* unbiased m (both ~3×10⁻⁴) — i.e. in this small-shift (±½ px), +high-S/N regime the fix is harmless but not measurably necessary. **Question for the call:** +in what regime does the bias the fix targets actually appear? The author's centroid scripts +(`centroid_bias.py`, configs bug/fix/v2) presumably demonstrate it — is the before/after +result captured anywhere we can point to? This is the single most important thing to pin down. + +### 4. `r50` vs `T` labeling — ⚠️ likely mislabel, worth clarifying +In `compile_results` (`ngmix.py:415-417`): `output_dict[name]['r50'] = results[...]['pars'][4]`. +For an ngmix `gauss` model, `pars[4]` is **T** (≈2σ², arcsec²), *not* a half-light radius — +and `output_dict[name]['T']` stores the same value. And `r50_psfo = sqrt(T_psf/2)` is the +Gaussian **σ**, not r50 (true Gaussian r50 = 1.1774σ). So columns named `r50*` actually carry +T / σ. Downstream (sp_validation) needs to know what these columns mean. Is "r50" intentional +shorthand, or should these be true half-light radii? (Constitution flagged "r50 as the size +guess changed from T" — but the *guesser* still uses `T=0.25`; the change is in output column +naming, not the guess. Worth confirming Martin's intent.) + +--- + +## Code-review pass (multi-agent, high-effort) — findings + +Ran a 6-angle code-review over the ngmix delta. CI is green, so none of these break the +test suite — but the suite is shallow on these paths. Ranked. + +### Correctness — confirmed +1. **`get_guess` ImportError — shipped script dead on arrival.** `scripts/validation/centroid/centroid_bias.py:37` does `from ...ngmix import get_guess, get_noise`, but this PR removed `get_guess` from the module. The script ImportErrors on load. (Only `centroid_bias.py`; `test_centroid_shift.py` imports just `get_noise`.) Fix the import, restore `get_guess`, or drop the v1 script. +2. **Reproducibility break — unseeded RNG.** `ngmix.py:948-949` `prepare_ngmix_weights` uses global `np.random.randn` for the noise image + masked-pixel fill, while the module deliberately switched to a seeded `self._rng` (`:266`). Same tile + same seed → different masked-pixel values and metacal noise → non-reproducible shears. Thread `rng` into `prepare_ngmix_weights`/`make_ngmix_observation`. +3. **`*_psfo` columns now carry the metacal-RECONVOLVED PSF, not the original PSFEx/MCCD PSF.** `average_multiepoch_psf` reads `obsdict['noshear'][n_e].psf.meta['result']` — the reconvolved metacal PSF — and feeds `g_PSFo`/`T_PSFo`/`r50_PSFo`. But `compile_results` documents these as "the original image psf from psfex or mccd," and PSF-leakage / ρ-stat / null tests consume them as the input PSF. The old code fit the raw pre-metacal PSF separately. Confirm intent; the comment is now wrong either way. +4. **`CHECK_EXISTING_DIR` resume silently dropped.** New `process()` never reads `self._check_existing_dir` or calls `get_last_id` (both now dead). A configured resume reprocesses from scratch and overwrites — a documented feature silently no-ops. + +### Correctness — plausible, needs Martin/Lucy intent (raise as questions) +5. **Weight-map normalization changed.** Old `prepare_ngmix_weights` binarized the mask (`weight[weight!=0]=1`) then applied a flat `1/sig_noise²`. New keeps the per-pixel (already FSCALE-rescaled) weight AND multiplies by `1/sigma_mad(gal)²`, with `sigma_mad` taken over the *object-containing* stamp (flux-contaminated, vs the old windowed object-free `get_noise`). Looks like double inverse-variance + a biased noise estimate → shifts the fit's χ² weighting, errors, s2n, and PSF-averaging weights vs the validated v1. Intentional? +6. **Per-type PSF columns collapsed.** `Tpsf`/`g1_psf`/`g2_psf` are now identical across all 5 metacal HDUs (all from `T_PSFo`/`g_PSFo`); old code stored per-type values. May be fine if the metacal target PSF is type-independent — but if any response/selection step differences them it now gets exactly zero. +7. **Galaxy zero-pixel guard narrowed (`any`→`all`).** `prepare_postage_stamps:737` now skips an epoch only if `np.all(gal_vign==0)`; the old code skipped if *any* pixel was 0. Partial CCD-edge stamps (a band of exact-zero pixels) now enter the fit, relying on flags/weights to catch them. Intentional loosening or a regression? +8. **PSF fit uses the galaxy joint prior + galaxy `FLUX_AUTO` as the PSF flux guess** (`do_ngmix_metacal`). Matches the upstream ngmix example's prior reuse, but the galaxy-flux guess for the PSF is odd; possible PSF-fit convergence/quality impact. Lower confidence. + +### Altitude / structure +9. **Runner input-contract mismatch (4 runners).** `sextractor_runner`, `read_ext_sexcat_runner`, `psfex_interp_runner`, `vignetmaker_runner` moved the WCS/exp-numbers paths from named config options (`LOG_WCS`/`ME_LOG_WCS`/`ME_DOT_PSF_DIR`) to positional `input_file_list[i]` — but, unlike `ngmix_runner`, their `@module_runner` decorators were **not** updated. The real input contract now lives entirely in each config's `FILE_PATTERN`/`FILE_EXT` ordering; the shipped v2.0 example configs are correct, but a hand-written/v1-style config silently mis-maps inputs or IndexErrors with no schema to catch it. `ngmix_runner` is the model — extend the other four decorators to match. +10. **`r50`/`T` mislabel (units).** `compile_results:415-417` stores `pars[4]` (= `T`, arcsec²) under `r50`/`r50_err`, and `r50_PSFo = sqrt(T_psf/2)` is the Gaussian σ, not the half-light radius (`r50 = 1.1774σ`). sp_validation consumes these size columns. Intentional shorthand or a mislabel? + +### Cleanups (bundle into one comment) +- `save_results`: `ngmix.py:547` non-`f` string → error prints literal `{ext_name}`; leftover `print(...)` debug at `:564-565`; the `np.append`-per-batch FITS append is O(n²) over a tile. +- Dead code: unreachable `print` after `return` in `MegaCamFlip` (`:293`); unused `sextractor_e1e2` (`:1131`, with a copy-pasted wrong docstring); `# I still don't know how to handle this` (`:63`); commented `#self._gal_vignet_path` block (`:240-244`); `# MKDEBUG` markers; redundant `self._f_wcs_path`. +- `51 * 51` hardcoded masked-fraction denominator (`:766`) → use `v_flag_tmp.size`; near-term goal is configurable stamp sizes, so this will silently go wrong. +- Duplication: `scripts/python/fitting.py` and `scripts/jupyter/test_centroid_shift.py` each redefine their own `make_data`/`get_prior`, duplicating `testing/simulate.py` + the module `get_prior` added in this same PR (defeats simulate.py's "stable across branches" purpose). `fitting.py` is a stray upstream-ngmix example (`fitting_bd_empsf.py` header, `espy`/`images` refs) — likely an unintended add; delete or make it import the canonical helpers. +- Minor efficiency: doubled `galsim.Image(gal, scale=1.0)` in `make_ngmix_observation`; repeated uncached `SqliteDict[str(obj_id)]` deserialization per object/epoch; `get_exp_output_dirs` re-globs identical dirs when a runner repeats in `ME_IMAGE_EXP_RUNNERS`. + +--- + +## Line-level (earlier staged notes — superseded by the code-review pass above for ngmix.py) + +- **`pyproject.toml`** — ngmix now `>=2.4` with `[tool.uv.sources] ngmix = git esheldon/ngmix tag v2.4.0`. Why pin to a git tag rather than the PyPI release `ngmix==2.4.0`? PyPI would be more reproducible and drop the git dependency. (PyPI was unreachable from here to confirm 2.4.0 is published — Cail can check.) +- **`ngmix.py:63`** — stray `# I still don't know how to handle this` above `Tile_cat`. +- **`ngmix.py:254`,`:528`-ish** — `# MKDEBUG` markers and `print(...)` debug statements (e.g. `:293` unreachable `print('rotating megapipe image')` after a `return`; `:564-565` `print("output_dict"...)` in an error branch). Clean up before merge. +- **`ngmix.py:766`** — masked-fraction cut hardcodes `51 * 51`; if vignet size ever changes this silently breaks. Use `flag_vign.size`. +- **`scripts/python/fitting.py`** — looks like a lightly-edited copy of esheldon's `fitting_bd_empsf.py` example; docstring still references `fracdev` / `fitting_bd_empsf.py` and a bulge+disk fit that no longer applies. Either trim to match or drop — is it meant to ship? +- **`scripts/jupyter/test_centroid_shift.py`** — jupyter-exported (cell markers `# In[14]:`, commented-out code, inline-duplicated `make_data`). Now that `shapepipe.testing.simulate.make_data` exists, this could import it instead of carrying its own copy. Ship as-is or tidy? +- **`src/shapepipe/testing/simulate.py`** — 👍 clean, dependency-light (galsim+numpy only), gives exactly the cheap validation path. Nice addition. +- **Supporting runners** (`psfex_interp_runner`, `vignetmaker_runner`, `ngmix_runner`) — the `$SP_EXP`-tree resolver (`get_exp_output_dirs`) with config-gated fallback to the v1 run-log path is clean and backward-compatible. `f_wcs_path` now threaded via `input_file_list` consistently. No concerns. + +--- + +## Suggested PR-level summary comment (draft) +> Reviewed the diff and exercised it on candide against **ngmix 2.4.0** (installed from the +> PR's own source). The 2.x API is used correctly and the metacal path runs end-to-end; on +> simulated galaxies the v2.0 fitter recovers shear with **m ≈ 2×10⁻⁴** (consistent with zero). +> Two things before I can tick the checklist: (1) **CI hasn't run** — it's a fork PR, needs an +> "Approve and run" or a push to origin; (2) I couldn't reproduce a centroid bias in the +> idealized sim with the fix *off*, so I'd like to understand the regime where the centroid fix +> matters — is the before/after result from `centroid_bias.py` captured somewhere? Plus a few +> cleanups (debug prints, the `r50`/`T` column naming, stale `CLAUDE.md` ngmix note). Details inline. +> — Claude on behalf of Cail diff --git a/.felt/review-ngmix-v2-pr740/report.html b/.felt/review-ngmix-v2-pr740/report.html new file mode 100644 index 000000000..3004d77f9 --- /dev/null +++ b/.felt/review-ngmix-v2-pr740/report.html @@ -0,0 +1,248 @@ + + + + + +Report — review-ngmix-v2-pr740 + + + + + + +
+ +
+ + 2026 · 06 · 05 + · + review-ngmix-v2-pr740 + · + round 3 · martin responded +
+ +

Ngmix v2.0: where it stands & how to land it.

+ +

The two-part review is delivered, CI is green, and Martin has consolidated the + work onto PR #741 — closing the fork PR #740. This morning (06-05) + Martin replied to six of the eleven findings: he greenlit two cleanups, explained two + changes as intentional, pointed the weight-map concern at a long-standing issue, and poked Lucy + on the size-column naming — explicitly asking us to check the r50/T + arithmetic. I ran that check and confirmed the bug. Five findings — including + the second merge-gate, *_psfo — still have no response. This report carries the + updated dispositions and the sharpened sequence.

+ +
+ +
+
§ Current state
+

#741 is the PR of record — green, mergeable, and Martin has started answering.

+ +

Topology. Martin closed #740 (his fork branch + martinkilbinger:ngmix_v2.0) on 06-03 and consolidated onto #741 — + the CI mirror on CosmoStat:ngmix_v2.0. #741 is the canonical PR: + OPEN, MERGEABLE, + CI green (build-test-publish pass, full suite). It carries + the three part-1 fixes (seeded-RNG thread, scripts-import smoke, v1 + centroid_bias.py deletion) and the develop merge. No code has + changed since the part-2 review — every finding still sits at its anchored line.

+ +

Martin engaged this morning (06-05, 07:14–07:33Z) — six inline replies across + the findings, his first real pass on the science:

+ + +

I ran Martin's check. The model is gauss (both PSF and galaxy), + and the code itself settles the question: line 911 builds + galsim.Gaussian(sigma=np.sqrt(pars[4]/2)), which is only valid if + pars[4] = T = 2σ². So yes, pars[4] is T — and the + r50 columns are worse off than the question implies:

+
galaxy r50 (l.415) = pars[4] = T = 2σ²  — an area, arcsec²
+ PSF r50_PSFo (l.676) = sqrt(T/2) = σ  — the std-dev, arcsec
+ true half-light radius = 1.1774·σ  — neither column equals this
+

The two r50 columns are on different scales (T vs σ) and neither + is a half-light radius. Martin's instinct was right; the fix is a single honest + transformation, ideally the one moved into cs_util.

+ +

Still unanswered (5): 293 dead print, 1140 + sextractor_e1e2, the 4 runner decorators, fitting.py, + 1068 PSF prior/flux — and crucially 1045 + *_psfo reconvolved-PSF, the second merge-gate, which is the one open item + that genuinely changes what downstream null-tests read.

+
+ +
+ +
+
§ Triage
+

The eleven findings, by what unblocks them.

+ +

Martin has now spoken to six; five are still open. The column marks his + morning disposition. The action cut is still by kind:

+ +

A · cut-and-dry mechanical, no science — push as one cleanup + commit.   B · decision needs Martin/Lucy intent; some move + downstream science.   C · resume dead feature — decide + rewire vs remove.

+ + + + + + + + + + + + + + + + + +
LocFindingKindMartin ✓Action
293Unreachable print after return in the megapipe-rotate branch.ADelete the line.
1140sextractor_e1e2 is unused and its docstring is copy-pasted from another function.ADrop it, or fix the docstring if kept.
766Masked-fraction cut divides by a literal 51 * 51; breaks when stamp size becomes configurable (on the roadmap).Aagreed — make configurable; use the STAMP_SIZE/VIGNET_SIZE config convention.Add STAMP_SIZE config (default 51); divide by it, not the literal.
sextractor_
runner:74
+3
WCS moved to a positional input_file_list[-1] in 4 runners (sextractor, read_ext_sexcat, psfex_interp, vignetmaker) but their @module_runner decorators weren't updated (only ngmix_runner was). Shipped configs are correct, so not a live bug — but the decorator is the input contract.ADeclare the WCS input in each decorator's file_pattern/file_ext.
fitting.pyNear-verbatim copy of esheldon's ngmix example; not wired to the pipeline, and duplicates the new testing/simulate.py helpers it ships alongside.AConfirm it's not meant to ship → drop, or trim to import the canonical helpers.
1045*_psfo now carries the metacal-reconvolved PSF. g/T/r50_PSFo read from obsdict['noshear'].psf.meta (reconvolved), and the per-type PSF columns collapse to one value. PSF-leakage / ρ-stat / null tests consume these as the input PSF; the column doc still says "original image psf."B— unanswered
2nd merge-gate
Confirm null-test consumers tolerate reconvolved-PSF columns, or restore the raw pre-metacal PSF fit. Fix the doc either way.
949Weight-map normalization changed vs v1. sigma_mad(gal) is taken over the whole (flux-contaminated) stamp where v1 used the windowed, object-free get_noise (still defined, now unused); and the per-pixel weight map is then multiplied by 1/sig_noise² — reads as double inverse-variance. Shifts χ²-weighting, reported errors/s2n, and PSF-averaging weights.B→ #604 — known weight-handling issue (0/1 masks → need real inv-var). Contextualizes, doesn't resolve the v1-regression sub-finding.Distinguish the two: (a) the long-game proper-variance fix lives in #604; (b) the whole-stamp sigma_mad + double 1/sig² vs v1's windowed estimate is a this-PR question.
415 / 676r50 columns aren't half-light radii. Verified: pars[4] is T (l.911 uses Gaussian(σ=sqrt(pars[4]/2))). Galaxy r50 = T = 2σ² (area); PSF r50_PSFo = sqrt(T/2) = σ. Different scales; neither is the true r50 = 1.1774·σ. sp_validation consumes these size columns.Bverified + poked Lucy. Martin wants the transform un-hard-coded (move sp_validation:galaxy.pycs_util).One honest transform in cs_util: galaxy 1.1774·sqrt(pars[4]/2), PSF 1.1774·sqrt(T/2) — or rename to T/sigma. Cail's call as consumer.
737Zero-pixel guard loosened anyall. v1 skipped an epoch if any pixel was exactly 0; now only if the whole stamp is 0. Partial CCD-edge stamps now enter the fit.Bintended"any==0 passed close to no postage stamps, too restrictive" (commit d01503d0).Resolved as deliberate. Residual: is a fractional-zero threshold wanted, or is all==0 fine? Low priority.
1068PSF fit reuses the galaxy joint prior + galaxy FLUX_AUTO as the PSF flux guess. Mirrors the upstream example, but a galaxy-flux guess for the PSF is unusual. Lower confidence.BSanity-check PSF-fit convergence; confirm or adjust the guess.
254 / 257
/ 464
CHECK_EXISTING_DIR resume is dead. _check_existing_dir is set but never read in process(); get_last_id is defined but never called. A configured resume silently reprocesses from scratch and overwrites.Cremove"a hack to resume interrupted runs… can be removed now, implemented better in future if needed."Remove the option + get_last_id + the MKDEBUG marker. Folds into the Bucket-A commit.
+ +

Tally: 6 answered (2 greenlit cleanups, 1 verified, 1 intended, 1 → #604, 1 RNG ack) · 5 still open, of which 1045 *_psfo is the lone unanswered merge-gate.

+
+ +
+ +
+
§ Recommended sequence
+

How to get #741 mergeable.

+ +
    +
  1. Land the cleanups + Martin's greenlit items (Bucket A+C). One commit on + #741: drop the unreachable print (293) and dead sextractor_e1e2 + (1140), fix the four runner decorators, drop/trim fitting.py, remove the + CHECK_EXISTING_DIR resume + get_last_id (254, Martin: remove) + and the MKDEBUG marker, and add the STAMP_SIZE config + replacing the literal 51*51 (766, Martin: agreed). Pure mechanical, keeps CI green, + narrows the review surface to the two real science questions. (Cail's push offer stands.)
  2. + +
  3. Resolve the size columns (415/676). Verified bug — the fix is one honest + transform. Either land it in cs_util as Martin suggested (galaxy + 1.1774·sqrt(pars[4]/2), PSF 1.1774·sqrt(T/2)) or rename the columns to + T/sigma. Cail decides as the sp_validation consumer; Lucy's been poked.
  4. + +
  5. The lone unanswered merge-gate: *_psfo (1045). Martin/Lucy + still need to confirm that the null-test / ρ-stat consumers tolerate the metacal-reconvolved + PSF in g/T/r50_PSFo, or restore the raw pre-metacal fit. Doc fix either way. This + is the one open item that changes what downstream reads.
  6. + +
  7. Weight-norm (949) — split the thread. The proper-variance redesign is + tracked in #604 (post-merge, + long game). But ask separately whether the whole-stamp sigma_mad + double + 1/sig² is an intended v2.0 change or a regression from v1's windowed estimate — a + this-PR question #604 doesn't cover.
  8. + +
  9. Loose ends: PSF prior/flux guess (1068) sanity-check; zero-pixel guard + (737) resolved as intended. Merge #741 — Cail/Martin's gesture, never the + worker's — then delete the stale fork branch behind closed #740.
  10. +
+ +

The honest read on "mergeable today": CI is green and nothing here breaks the + suite. With Martin's morning pass, the merge-gate count drops to one clear item — + *_psfo (1045), which silently changes the PSF columns sp_validation / null-tests + consume — plus the size-column fix (415/676), now verified and mechanical once the convention is + chosen. Weight-norm (949) is reframed: the design fix is #604's (post-merge), leaving only the + narrower "is the sigma_mad change a regression?" question. Everything else can merge-then-iterate.

+ +

Housekeeping still open (from + the part-2 summary, none blocking): no closes # link (overlaps Axel's #725 centroid + work — cross-ref or supersede?); no docs update; the CLAUDE.md "don't modernize + ngmix" note is now stale; the pyproject ngmix pin should land on a tagged + esheldon/ngmix release rather than a fork branch.

+
+ +
+ +
+
§ Open questions
+

For Cail.

+ +
+ +
+ + diff --git a/.felt/review-ngmix-v2-pr740/review-ngmix-v2-pr740.md b/.felt/review-ngmix-v2-pr740/review-ngmix-v2-pr740.md new file mode 100644 index 000000000..c44543eb8 --- /dev/null +++ b/.felt/review-ngmix-v2-pr740/review-ngmix-v2-pr740.md @@ -0,0 +1,75 @@ +--- +id: 01KTCHWZX873VG28AA663A3ZQE +name: 'Review + work: ngmix v2.0 (PR #740)' +status: closed +tags: + - constitution + - shapepipe + - ngmix + - review +created-at: 2026-06-01T11:50:17.967872934+02:00 +closed-at: 2026-06-05T23:59:38.502Z +outcome: 'INTERACTIVE — live with Cail. Bucket-A cleanups PUSHED to ngmix_v2.0/#741 (dd4f656a..bd60dc8e: 293/1140/254-resume/766 + fitting.py deleted); CI running. Dev image built (/n17data/cdaley/containers/shapepipe-dev) + ngmix 2.4.0 installed → full suite + example pipeline now runnable. Ran a 6+2-agent WORKFLOW on the two hard problems → two decision-ready reports in fiber: weights-report.md, size-report.md, deep-dive-report.html (sent to Cail). WEIGHTS (#604+949): found TWO coupled regressions in prepare_ngmix_weights (ngmix.py:871) — R1 noise estimator regressed from object-free windowed get_noise (now DEAD at :826) to flux-contaminated whole-stamp sigma_mad; R2 lost the v1 binarization → double-counts a real inverse-variance map. EMPIRICALLY CONFIRMED (truth ivar 1e6 → recovered 8.8e11, ratio≈1/noise²). Clean red→green test target via make_data. Rec: SPLIT — minimal v1-restore (reinstate binarization + get_noise) + test in #741; SExtractor BACKGROUND_RMS baseline as a SEPARATE PR (Codex shuttle, ~4-8h, closes #604). SIZE (r50/T): galaxy r50=pars[4]=T (area), PSF r50psf=σ; neither is 1.1774σ; all 5 r50* cols new in v2.0. UNIONS-3500 WL I (arXiv:2605.13549) reports half-light radius r_h as PRIMARY. Rec: TRANSFORM at source (honest r50 in ngmix) + cs_util converter web; bonus find — sp_validation galaxy.py:T_to_fwhm is dimensionally wrong (feeds the scale-dependent PSF-leakage fit). Both work-streams now DISPATCHED as shuttles (Cail''s go): [[ngmix-weights-ivar]] (Codex — minimal #741 regression fix w/ red→green test + the SExtractor BACKGROUND_RMS baseline as a separate PR closing #604) and [[ngmix-size-columns]] (claude-opus — honest r50 at the ngmix source + cs_util converter web + the sp_validation T_to_fwhm leakage fix). Workers prepare branches/commits/PR-descriptions + reports; merge/PR-creation stays Cail''s. STILL OPEN for the eventual #741 reply (not in these streams): *_psfo reconvolved-PSF merge-gate (1045) + the runner-decorator contract (testable via the example pipeline). PLAN: let the two shuttles finish → reassess → write the Martin reply. No reply drafted yet.' +horizon: now +shuttle: + enabled: true + kind: oneshot + interactive: true + host: candide + project_dir: /automnt/n17data/cdaley/unions/shapepipe + agent: claude-opus +--- + +# Review + work: ngmix v2.0 (PR #740) + +Martin requested Cail's review on **[CosmoStat/shapepipe #740 "Ngmix v2.0"](https://github.com/CosmoStat/shapepipe/pull/740)** (head `ngmix_v2.0` → base `develop`, +3894/−3410 across 67 files). This is the PR that **realizes [[ngmix-update]]**: replace Axel's `stable_version` ngmix fork with **upstream ngmix 2.4.0** and adopt **@lbaumo's (Lucy Baumont's) new ngmix classes + interface**, reconciling the cleaned-up wrapper from her visit. Track it in [[prs-in-flight]]; it sits in the broader Martin collaboration under [[shapepipe]]. + +## Desired State + +PR #740 has a **thorough, scientifically-grounded review**, and the collaboration moves forward: + +1. **A structured review exists**, worked against Martin's reviewer checklist (targets `develop` ✓, labels, clear description, "closes #" links, code/doc style, docs updated, **CI passing**, all changed files read with comments) — *and* against a **science-quality bar** that the checklist doesn't capture (below). +2. **Comments are posted to GitHub** in Cail's voice (signed "— Claude on behalf of Cail" per the authorship convention, or Cail's own voice as he chooses on the call) — line-level where it's a specific code point, PR-level for the broader read. +3. Where Cail decides, **fixes/suggestions are pushed** (as PR review suggestions or a branch off `ngmix_v2.0`) or **requested** from Martin/Lucy. **Merge/approve is Cail's gesture, never the worker's.** + +### The science-quality bar (what the checklist misses) + +The diff is dominated by `src/shapepipe/modules/ngmix_package/ngmix.py` (~1252 lines changed) — the module overhaul. The review must actually reason about the science, not just the diff mechanics: + +- **ngmix 2.4.0 API correctness.** The 1.x→2.x ngmix API changed substantially (observation/fitter/bootstrapper interfaces). Does the new code use the 2.4.0 API correctly — priors, PSF fitting, the metacal/bootstrapping path, guesses? @lbaumo's new classes: are the abstractions sound, and does the interface match how shapepipe calls them (`ngmix_runner.py`)? +- **The centroid-bias fix.** A large part of this PR is centroid-bias work (`scripts/validation/centroid/centroid_bias{,_v2}.py`, the bug/fix/v2 configs, `test_centroid_shift.py`). Centroid shift biases shapes — this matters for UNIONS shear. **Does the fix actually remove the bias?** Look for the before/after evidence; if it's not in the PR, that's the key thing to ask for (or to *run* on candide). +- **`r50` as the galaxy-size guess** (changed from `T`). Reasonable? Convergence/robustness implications for the fitter guesses. +- **New `src/shapepipe/testing/simulate.py` + `scripts/python/fitting.py`** — what do they test/simulate, and do they give a cheap way to validate the v2.0 fitter independent of a full pipeline run? + +### Run it on candide (the repo is right here) + +`project_dir` is the live shapepipe checkout. Don't review the diff cold — exercise it: + +- `gh pr checkout 740` (or fetch `martin/ngmix_v2.0`) into a worktree so `develop`/`docs-rework` stays clean. +- Run the **ngmix module / its tests**; run `test_centroid_shift.py` and the centroid configs if feasible; check the v2.0 fitter produces sane output on a small input. Use the repo's `CLAUDE.md` for build/run conventions and the `uv` lockfile. +- Confirm **CI status** via `gh pr checks 740`. +- Capture what you ran and what you saw — that empirical check is the most valuable part of the review (the curious eye; don't report a result you didn't observe). + +## Interactive — how this runs + +`claude-opus`, **interactive**. Do the full autonomous prep first: read the diff (`gh pr diff 740`), read the changed `ngmix.py` / runner / Lucy's classes closely, run what you can on candide, and assemble (a) the checklist pass and (b) the science-quality read, with specific line-level notes staged as *draft* comments. **Then hold and wait for Cail to attach** — he'll talk through the science, sharpen the comments, and decide what's posted, what's pushed, and whether to approve. Posting to GitHub and any push happen *with* him, in his voice; the worker drafts, Cail directs. Do not close or `kill` until Cail signals done. + +## Context + +- **Repo:** `/automnt/n17data/cdaley/unions/shapepipe` on candide. Remotes: `origin` = CosmoStat/shapepipe, `martin` = martinkilbinger/shapepipe-1. `gh` is authenticated here. +- **PR #740** by Martin Kilbinger, opened 2026-06-01; head `ngmix_v2.0`, base `develop`; OPEN. Reviewer checklist is in the PR body. +- **@lbaumo = Lucy Baumont** — her ngmix classes/interface are central; her wrapper cleanup came out of her visit (see [[ngmix-update]]). +- **Prior art / related:** [[ngmix-update]] (the future-intent this realizes), [[prs-in-flight]] (the PR tracker — add #740 and its disposition), [[shapepipe]] (the Martin collaboration root). The repo `CLAUDE.md` carries build/test/run conventions — read it. +- **Authorship:** GitHub comments in Cail's voice, signed "— Claude on behalf of Cail" unless Cail asks otherwise on the call. + +## Round 2 (2026-06-05): next-steps report from Martin's responses — SUPERSEDES the merge-comment scope for THIS round + +**Cail's ask (night of 2026-06-05):** read Martin's comments/responses to the standing two-part review (now on #741), do a *further code-review pass*, and produce a **report of recommended next steps** given where Martin has landed. + +**Desired State (round 2):** +- Read every Martin (`martinkilbinger`) response across **#740** and **#741** — inline replies, review threads, and any new commits/force-pushes since the part-2 review (2026-06-03). Note the `# MKDEBUG` markers still in the code. (As of dispatch, Martin's engagement is light: one COMMENTED review + one inline reply at `ngmix.py:951` "Good. This probably was previous code before we changed to a random seed per tile." Verify whether more has landed.) +- **Further code-review pass** over the ngmix module + the 4 runners whose decorator contracts part-2 flagged (`sextractor_runner` et al.): for each of the 11 line-anchored findings, determine whether Martin's v2.0 now addresses it or it still stands. +- Produce a **next-steps report**: `report.html` in the fiber dir **and** a posted PR summary comment on #741 (Cail's voice / "— Claude on behalf of Cail"). For each open item: `{resolved by Martin | still open | needs a decision}`, owner (`Cail / Martin / Lucy`), and the **recommended sequence to get #740/#741 mergeable**. +- **REPORT ONLY — do NOT push new commits to the ngmix branch this round.** Pure analysis + PR comment. Keeps it collision-free while other shapepipe/pure_eb work is in flight. + +**Exit:** autonomous — produce the report + PR comment, then close for Cail's review. diff --git a/.felt/review-ngmix-v2-pr740/size-report.md b/.felt/review-ngmix-v2-pr740/size-report.md new file mode 100644 index 000000000..e43e997c3 --- /dev/null +++ b/.felt/review-ngmix-v2-pr740/size-report.md @@ -0,0 +1,94 @@ +# ngmix v2.0 size-column naming: ground truth, official convention, recommendation + +**Scope:** the five new `r50*` size columns added in ngmix v2.0 (PR #741), what they actually contain, what the official UNIONS papers report, and how to make the producer (shapepipe) → shared (cs_util) → consumer (sp_validation) stack honest and consistent. + +**Verification:** code claims verified in the worktree; conversion arithmetic verified numerically; paper convention from arXiv:2605.13549 and arXiv:2204.04798; downstream consumption from the sp_validation source tree. + +**Headline:** the v2.0 `r50` columns are mislabeled — galaxy `r50` == the area `T`; PSF `r50psf` == `σ` (not `1.1774σ`). Cleanest fix: make the ngmix writer emit honest, correctly-converted columns at the source (true half-light radius `r50 = 1.1774·√(T/2)` for both galaxy and PSF, alongside the existing `T`), rather than a rename or a downstream-only transform. A *separate* load-bearing bug lives in sp_validation's `T_to_fwhm`, which a clean r50/σ surface lets us retire. + +--- + +## 1. Code ground truth — what each size column actually holds + +In ngmix `gauss`, `pars = [cen1, cen2, g1, g2, T, flux]` and **`pars[4] = T = 2·σ²` is an AREA** (arcsec²). The half-light radius is `r50 = 1.1774·σ = 1.1774·√(T/2)`. The `σ = √(T/2)` mapping is confirmed twice in-file (`galsim.Gaussian(sigma=np.sqrt(guess[4]/2))`; `r50_psfo = np.sqrt(max(psf_res['T_psf'],0)/2)`). + +| Column (FITS) | Source | Value in σ | True r50 = 1.1774σ? | Status | +|---|---|---|---|---| +| `T`, `T_err` | `results[…]["T"]` | `2σ²` (area) | — | Correctly named area | +| `r50`, `r50_err` | `pars[4]`, `pars_err[4]` | **`2σ²` (area)** | **No** | **MISLABELED: == `T`/`T_err` bit-for-bit. Neither √ nor ×1.1774 applied.** | +| `Tpsf` | `T_PSFo` | `2σ_psf²` (area) | — | Correctly named area | +| `T_psfo_ngmix`, `T_err_psfo_ngmix` | `T_PSFo`, `T_err_PSFo` | `2σ_psf²` | — | Correct; `T_psfo_ngmix` **duplicates** `Tpsf` | +| `r50psf` | `r50_PSFo` = `√(T_psf/2)` | **`σ_psf` (length)** | **No — it's σ, off by 1.1774×** | Genuine length, missing the factor | +| `r50_psfo_ngmix` | `r50_PSFo` | `σ_psf` | No — it's σ | **Duplicates `r50psf`** | +| `r50_err_psfo_ngmix` | `T_psf_err/(2·r50_psfo)` | `σ_psf_err` | No — error on σ | ~~Correct error-prop of σ~~ **also a factor-2 over-estimate: dσ/dT = 1/(4σ), not 1/(2σ)** (caught in [[shapepipe/ngmix-size-columns]] implementation); NaN when σ=0 | + +**Headline hazard:** the same name root `r50` means an **area** on the galaxy side and a **length** on the PSF side. A user ratioing galaxy `r50` against `r50psf` divides an area by a length. **Zero columns in the file are a true half-light radius** — every "r50" is either `T` (galaxy) or `σ` (PSF). + +**Arithmetic check (verified):** `√(2·ln2) = 1.17741`; `2.355/1.17741 = 2.000` (why the downstream `T_to_fwhm` is dimensionally wrong — §4). + +**v1 provenance:** `origin/develop` wrote sizes **only as `T`**. **All five `r50*` columns are NEW in v2.0**, and the galaxy two were wired to the area `pars[4]` — introduced as mislabeled duplicates. + +--- + +## 2. Official convention + downstream consumption + +**UNIONS-3500 WL I — A Galaxy Shape Catalogue (arXiv:2605.13549), the current shape-catalogue paper:** +- Reports the **half-light radius `r_h` (= ngmix `r50`)** as the **PRIMARY** size for **both galaxy and PSF**. +- Size cut is the dimensionless **`0.707 ≤ r_h/r_h,psf ≤ 3`**, applied *inside Metacalibration* (enters the selection response). +- Resolution factor `R = r_PSF/(r_PSF + r_h)` — written in `r`, not `T`. +- **`T` is derived only**, via Eq. 1 `T = (r_h²/ln2)·(1+g1²+g2²)/(1−g1²−g2²)`, "to relate r_h to the DES size definition." Round-object case `T = r_h²/ln2 = 2σ²` — identical to standard ngmix `T = 2σ²`. + +**Guinot et al. 2022 (arXiv:2204.04798):** mixed — galaxy cut in `T` (`T_gal/T_PSF > 0.5`), PSF size `T = 2σ²`, but ngmix fit param + a prior in `r50` (arcsec). + +**sp_validation:** treats galaxy/PSF size only as the dimensionless **ratio** `T/Tpsf` (size cut, DES-weight & leakage binning) — robust regardless of absolute meaning — plus one place where PSF `T` is converted to FWHM for the leakage fit (§4). **It reads NONE of the five `r50*` columns** (0 hits in `src/`). + +**Takeaway:** the official paper wants `r_h` (= r50) as the headline size — exactly the quantity the v2.0 `r50*` columns *claim* to provide but don't. Standardize on **r50 = half-light radius as primary, T = 2σ² as the derived DES area.** + +--- + +## 3. Recommendation — TRANSFORM at the source, thin cs_util surface + +Three options: a **pure RENAME** (`r50`→`T`/`σ`) is honest but discards the chance to ship the paper's headline quantity; a **cs_util-only TRANSFORM** (fix on read) leaves a foot-gun baked into every FITS file; **TRANSFORM at the source** (recommended) makes the writer emit honest, paper-consistent columns with cs_util holding the convention once. + +### 3a. ngmix.py — every size column honest and on the same scale +Keep all `T*` columns unchanged (correct areas). Change the radius columns to true half-light radii: +- **Galaxy:** `r50 = 1.1774·√(T/2)` (= `√(ln2·T)`); error `r50_err = (1.1774/(2√2))·T_err/√T`. *(Currently append the area `pars[4]`/`pars_err[4]`.)* +- **PSF:** multiply existing σ values by `1.17741` so `r50psf`/`r50_psfo_ngmix`/`r50_err_psfo_ngmix` are true half-light radii, not σ. + +After this, **every "r50" column is a length meaning the same thing (1.1774σ)** on both sides, matching UNIONS-3500 I Eq. 1 for round objects. + +### 3b. De-duplicate +`T_psfo_ngmix` duplicates `Tpsf`; `r50_psfo_ngmix` duplicates `r50psf`. Retire the redundant names (or document as aliases). Cleanest end state — one name per quantity: galaxy `{T, T_err, r50, r50_err}`, PSF `{Tpsf, Tpsf_err, r50psf, r50psf_err}`. Also fix the asymmetry: a `T_err_psfo_ngmix` exists but no `Tpsf_err` twin. + +### 3c. cs_util — single source of truth for the size web +Expose thin, correctly-named, tested converters (home: `cs_util`, since `sp_validation/galaxy.py` already does `from cs_util import cfis`, and cs_util is the shared layer): + +``` +T_to_r50(T) = 1.1774 * sqrt(T / 2) # = sqrt(ln2 * T) +r50_to_T(r50) = 2 * (r50 / 1.1774)**2 +sigma_to_fwhm(s) = 2.355 * s # already correct in galaxy.py +T_to_fwhm(T) = 2.355 * sqrt(T / 2) # CORRECTED (see §4) +``` + +--- + +## 4. Migration impact on sp_validation + +**No change for the size-ratio paths.** Galaxy `T` and PSF `Tpsf` enter only as the dimensionless ratio `T/Tpsf` (size cut, `size_ratio` binning) — robust by construction. None of the five `r50*` columns is read in sp_validation `src/` (0 hits), so the mislabel harms no current computation. + +**One genuine, load-bearing fix — the PSF-leakage size regressor.** `fwhm_PSF = T_to_fwhm(T_PSFo)` (built in `notebooks/extract_info.py`) is the third regressor in the scale-dependent PSF-leakage fit (`run_object.py:510`, `size_PSF_col` at `calibration.py:540`). The current `galaxy.py:T_to_fwhm` is `T/1.17741*2.355` and its own `MKDEBUG` comment doubts it ("Why is FWHM not quadratic in T???"). It is **wrong**: `T = 2σ²` is an area, so `FWHM = 2.355·σ = 2.355·√(T/2)` — a √ is required, not a linear factor (`2.355/1.17741 = 2.000` shows it treats T as if already σ). Result: a monotonic-but-nonlinear distortion of the PSF-size axis, biasing any `α(PSF-size)` trend and per-bin leakage coefficients. Spatially-constant leakage unaffected. + +**Migration:** point the `fwhm_PSF` builder at the new corrected PSF radius column via `cs_util.sigma_to_fwhm` (or import the fixed `cs_util.T_to_fwhm`), retire the local `galaxy.py:T_to_fwhm`. This is the change that actually moves a science number; the column rename/transform is hygiene + paper-consistency that also removes the foot-gun. + +**Sequencing:** the catalogue-writer change (§3a/3b) and the sp_validation regressor fix (§4) are independent — sp_validation doesn't read the `r50*` columns today, so fixing ngmix won't break it. cs_util converters should land before sp_validation switches to importing them. + +--- + +## 5. Open questions for Lucy / Martin + +1. **Primary on-disk convention:** standardize on **r50 as primary** (matching UNIONS-3500 I) while keeping `T`? Or keep `T` as the working column with r50 derived? +2. **Drop vs keep duplicates** (`T_psfo_ngmix`≡`Tpsf`, `r50_psfo_ngmix`≡`r50psf`): anything outside sp_validation `src/` (notebooks, older catalogues) bound to the `_psfo_ngmix` names? If so, alias rather than delete. +3. **cs_util as the home** for the converter web — agreed? Naming preference (`T_to_r50` vs `T_to_halflight`)? +4. **PSF-leakage refit:** were current/in-flight scale-dependent leakage results produced with the buggy `T_to_fwhm`, and should they be regenerated once fixed? How much does `α(size)` move? +5. **`r50_err` for the galaxy:** is the error-propagation form acceptable, or prefer the full pars covariance? +6. **Catalogue versioning:** does changing on-disk `r50*` semantics warrant a format version bump / changelog so downstream users know v2.0 pre- vs post-fix differ? diff --git a/.felt/review-ngmix-v2-pr740/weights-report.md b/.felt/review-ngmix-v2-pr740/weights-report.md new file mode 100644 index 000000000..17edea424 --- /dev/null +++ b/.felt/review-ngmix-v2-pr740/weights-report.md @@ -0,0 +1,136 @@ +# ngmix v2.0 weight / inverse-variance handling — decision-ready report + +**Scope:** PR #741 (branch `ngmix_v2.0`), the v2.0 refactor of noise/weight handling in `ngmix.py`, against the v1 baseline (`origin/develop`), and its relationship to Issue #604 (Weight Handling). All file:line refs are in the worktree (`ngmix_v2.0` head + review cleanups). Load-bearing claims verified against source **and confirmed empirically** (see §2a). + +**Headline:** v2.0 introduced two real, test-verifiable regressions in the ngmix weight map (dead noise estimator + lost mask binarization). The minimal v1-restore belongs in #741; the proper inverse-variance fix from #604 is a clean, mostly-config follow-up PR that should land separately. + +--- + +## 1. Precise regression characterization + +The entire live weight path is `prepare_ngmix_weights` (`ngmix.py:871–908`): + +```python +weight_map = np.copy(weight) # 894 FSCALE-rescaled Megapipe 0/1 mask +weight_map[flag != 0] = 0.0 # 895 +sig_noise = sigma_mad(gal) # 897 noise scalar over the WHOLE stamp +... +weight_map *= 1 / sig_noise ** 2 # 906 +``` + +v1 (`origin/develop`, `do_ngmix_metacal`), in order: + +```python +weight_map = np.copy(weights[n_e]) +weight_map[np.where(flags[n_e] != 0)] = 0.0 +weight_map[weight_map != 0] = 1 # binarize to exactly 0/1 <-- DROPPED in v2 +... +sig_noise = get_noise(...) # object-free windowed estimator <-- ORPHANED in v2 +... +weight_map *= 1 / sig_noise**2 +``` + +Two coupled regressions, both verified: + +**R1 — Noise estimator regressed from object-free to whole-stamp.** v1 called the windowed `get_noise` (masks the object via a model-Gaussian window, re-estimates `sigma_mad` on object-free pixels). v2 uses bare `sigma_mad(gal)` over the full, flux-contaminated stamp. `get_noise` still exists (`ngmix.py:826`, body byte-identical to v1) but **has zero call sites in `src/`** — confirmed dead. v2's `sig_noise` is biased high by the galaxy's own flux, in a **size/flux-dependent** way. Dominant numerical effect for today's inputs. + +**R2 — Lost the mask binarization.** v1 collapsed the weight to exactly 0/1 before scaling, so the final weight was a clean uniform `1/sig_noise²` per unmasked pixel. v2 dropped that line, so line 906 multiplies the *raw rescaled mask* by `1/sig_noise²`. + +What `weight` contains: traced from `ngmix_runner.py` input #4 → vignetmaker `RUN_2` → single-exposure weight images = Megapipe weight maps, which Gwyn confirmed are **0/1 masks** (#604). The only transform before line 894 is the FSCALE rescale (`rescale_epoch_fluxes`: `weight * 1/Fscale**2`) — identical to v1. + +Consequences of R2: +- **For today's 0/1 masks:** each unmasked pixel carries `Fscale⁻² / sig_noise²`. FSCALE is applied to the weight *twice* and to the noise *once* → per-pixel ivar uniformly off by `1/Fscale²` per epoch vs v1. A latent scale bug, varies epoch-to-epoch. +- **Latent hazard (dangerous):** if a *real* inverse-variance map is fed in (THELI, or the #604 BACKGROUND_RMS path), v2 multiplies a genuine `1/var_pix` by another `1/sig_noise²` → **variance counted twice**. v1's binarization made the code robust to this; v2 is not. + +### Downstream direction of the effect +`weight_map` → `Observation.weight` = per-pixel inverse-variance in the ngmix χ². +- Inflated `sig_noise` → too-low weight → under-weighted likelihood → **over-estimated** `g_err`/`T_err`/`flux_err`, **under-estimated** `s2n`. Because whole-stamp `sigma_mad` grows with brighter/larger galaxies, this is a **size/flux-dependent miscalibration** — the signature of a multiplicative shear bias `m`, consistent with the prior old-path finding `m ~ -2.8e-2`. +- **Multi-epoch PSF averaging** weights each epoch by `obs.weight.sum()` → now driven by whole-stamp MAD and FSCALE; reported `PSFo` g/T shift accordingly. + +--- + +## 2. Test-verifiability: **YES — clean red→green** + +`make_data` (`simulate.py:91`) injects the exact truth inverse-variance: `weights = im*0 + 1/noise**2`, `flags = 0`. That `1/noise²` is the oracle. A unit-level test on `prepare_ngmix_weights` (no ngmix fit needed — fast, deterministic) asserts the returned weight equals the injected inverse-variance: + +```python +def test_weight_map_recovers_injected_inverse_variance(): + noise = 1e-3 + gals, psfs, _, weights, flags, jacobs = make_data( + rng=np.random.RandomState(123), shear=(0.0, 0.0), + noise=noise, n_epochs=1, img_size=201) + _, weight_map, _ = prepare_ngmix_weights( + gals[0], weights[0], flags[0], np.random.RandomState(0)) + truth_ivar = 1.0 / noise ** 2 + recovered = np.median(weight_map[weight_map > 0]) + npt.assert_allclose(recovered, truth_ivar, rtol=0.10) +``` + +**Coverage gap today:** `test_ngmix.py`'s only weight-path test (`test_metacal_is_reproducible_with_fixed_seed`) asserts `run(42)==run(42)` — a determinism check that passes for *any* deterministic normalization, correct or not. The regression is exercised but never checked against truth. + +**Caveats:** `make_data` injects *homoscedastic* noise — the test catches the double-normalization + scalar bias, but to prove the fix respects *spatially-varying* ivar (#604's point), `make_data` needs a per-pixel RMS option (follow-up, not a blocker). + +### 2a. EMPIRICAL CONFIRMATION (observed, this session) + +Ran the check against the actual code (dev container, galsim+numpy; no ngmix fit needed): + +| noise | truth ivar | sigma_mad(gal) | recovered median | ratio (correct = 1.0) | +|---|---|---|---|---| +| 1e-3 | 1e6 | 1.065e-3 | 8.81e11 | **8.81e5** | +| 1e-4 | 1e8 | 1.167e-4 | 7.35e15 | **7.35e7** | + +The recovered weight is off by ≈`1/noise²` — exactly the R2 double-count (real ivar × another `1/sig_noise²`). `sigma_mad(gal)` runs ~6% above truth noise (R1 flux contamination, small for a compact galaxy in a 201² stamp). The data matches the by-reading analysis to ~6 orders of magnitude. **The regression is real and observed, not inferred.** + +--- + +## 3. Recommended baseline: SExtractor BACKGROUND_RMS — concrete change plan + +The existing machinery is already generic: SExtractor's check-image handler is list-driven and image-agnostic; vignetmaker's ME loop loops over `ME_IMAGE_PATTERN` and reads `get_data(0)` — a single-HDU BACKGROUND_RMS check-image flows through the same path as BACKGROUND with **zero code change**. + +| Step | What | File(s) | Cost | +|---|---|---|---| +| 1 | `CHECKIMAGE = BACKGROUND, BACKGROUND_RMS` | `example/cfis/config_exp_*.ini` | **config only** | +| 2 | Add `background_rms` to vignetmaker `ME_IMAGE_DIR`/`ME_IMAGE_PATTERN` | `config_tile_MiViSmVi.ini` | **config only** | +| 3 | Wire one **optional** input (RMS sqlite) through runner + `Vignet` | `ngmix_runner.py`, `ngmix.py` `Vignet`/`Ngmix.__init__` | small code | +| 4 | **Core:** build per-pixel ivar from RMS, gated on mask | `ngmix.py` `prepare_ngmix_weights` et al. | real work | +| 5 | Keep `sigma_mad`/`get_noise` as **fallback** when no RMS map present | `ngmix.py` (branch on `bkg_rms is not None`) | small code | + +**Core (Step 4) sketch** — Megapipe weight stays the *mask*; RMS supplies the *variance* (the two roles were conflated before): + +```python +weight_map = np.copy(weight) # Megapipe mask (0 where masked) +weight_map[flag != 0] = 0.0 +mask = weight_map != 0 +ivar = np.zeros_like(gal) +ivar[mask] = 1.0 / bkg_rms[mask] ** 2 # per-pixel inverse variance +sig_noise = np.median(bkg_rms[mask]) if mask.any() else sigma_mad(gal) +return gal_masked, ivar, noise_img +``` + +RMS rescaling: BACKGROUND_RMS is in image counts, scales by `Fscale`; `1/rms²` scales by `1/Fscale²` — identical to the existing weight rescale, slots into `rescale_epoch_fluxes`. + +**Effort: S–M, ~4–8 h.** Config + plumbing ~1–2 h; core rewrite + fallback ~2–3 h; example-pipeline before/after on shear/s2n/flags ~2–3 h. + +**Honest read on aguinot's "almost free":** true for the plumbing (steps 1–2 are pure config), undersells the core. Delivering the data is cheap; *using it correctly* (changing what the weight means, getting Fscale + mask roles right, validating no fit regression) is the earned part. + +**Risks:** RMS units/rescaling; zeros/NaNs in RMS at masked/off-image pixels → `1/rms²` blowup (gate on mask, guard `rms<=0`); double-counting the mask; positional input-list shift breaking configs (mitigated by optional-input design). + +--- + +## 4. Scoping: **separate PRs.** + +- **In #741 (now):** the minimal v1-restore — reinstate `weight_map[weight_map != 0] = 1` and call `get_noise(...)` instead of `sigma_mad(gal)` — plus the red→green unit test. Two-line change fixing a regression *this PR introduced*; restores known-good behavior; restoring the binarization makes the fallback robust to real ivar maps again. Does not depend on #604. +- **#604 as its own PR (next):** the BACKGROUND_RMS baseline changes what the weight map *means*, touches new I/O, needs its own before/after validation. A distinct review surface; bundling into #741 would mix "fix the regression I just introduced" with "implement a 3-year-old feature request." + +The red→green test validates both the #741 minimal fix and the #604 baseline; it travels with #741 now and proves out #604 later. + +--- + +## 5. Open questions for Martin (science call) + +1. **Baseline vs THELI.** Confirm in-house SExtractor-BACKGROUND_RMS is the baseline, with THELI weight images slotting in later as another `ME_IMAGE_PATTERN` source (not blocking). +2. **Optional vs required RMS input.** Recommend optional with `sigma_mad`/`get_noise` fallback. Confirm. +3. **#741 minimal fix:** restore the windowed `get_noise` (reintroduces the gal-guess machinery the PR may have deliberately stripped), or accept bare `sigma_mad(gal)` until #604 lands? R1 is the dominant effect, so this matters even short-term. +4. **Does the RMS map feed anything beyond the weight + noise-fill scale** (e.g. background subtraction)? Recommend weight + noise-fill only. +5. **Scope of config edits** for the #604 PR: cfis only, or cfis + cfis_simu + canfar batch? +6. **Validation bar to merge #604:** smoke before/after on the example tile, or a full sim m/c calibration before merge? diff --git a/.felt/review-pr741-fresh-pass/review-pr741-fresh-pass.md b/.felt/review-pr741-fresh-pass/review-pr741-fresh-pass.md new file mode 100644 index 000000000..ba3f2c2a9 --- /dev/null +++ b/.felt/review-pr741-fresh-pass/review-pr741-fresh-pass.md @@ -0,0 +1,21 @@ +--- +id: 01KTT34NW92KPHH9GDDXAGZZ89 +name: 'Fresh-pass review: PR #741 (post-integration)' +tags: + - shapepipe + - ngmix + - review +created-at: 2026-06-11T03:00:58.633703716+02:00 +outcome: 'Fresh full-diff review of #741 at b2dcd793 (empirical, against ngmix 2.4.0): scientific core verified sound (seed-reproducibility holds; injected shear recovered to 0.2% with response correction; r50/ivar-weights math checks out) but two empirically demonstrated blockers found in the v2.0 rewrite''s robustness shell — TILE_ID metadata key truncates the post-process CCD scan (~80% of epochs silently lost to multi-epoch bookkeeping) and failed galaxy fits crash the whole tile at save (KeyError, v1''s per-object BootGalFailure contract lost). Plus: one failed PSF epoch kills the object despite ignore_failed_psf; mcal_flags hardcoded 0 (dead quality column); cfis_simu configs broken against the positional-WCS contract. Science note: with psf=''fitgauss'' the reconvolved metacal PSF is round by construction, so g1/g2_psfo are identically zero — PSF-leakage regressions on them are degenerate. Fixes + review part 3 posted to the PR same night.' +--- + +Second-look review requested by Cail after the constitution-sweep integration round, run by a fresh agent with no investment in the branch. Full findings live in PR #741's "Review — part 3" comment; this fiber is the pointer plus what matters beyond the PR. + +## Why the blockers hid + +Both blockers survive a green unit-test suite and a one-tile smoke run with easy objects: the TILE_ID truncation only bites multi-CCD bookkeeping breadth (the first 8 CCDs still work), and the compile_results KeyError needs a hard-to-fit object (easy smoke objects all converge). Lesson for the test suite: structural tests catch import/config rot; these needed *adversarial empirical* tests — forced-failure objects, metadata-polluted sqlite fixtures. The regression tests added with the fixes are exactly that shape. + +## Science threads left deliberately open + +- **fixnoise homoscedasticity** (#604 thread): metacal noise-symmetrization currently draws at median(bkg_rms) while weights are per-pixel — second-order within a 51px stamp, one-line to change, Martin's call. +- **g1/g2_psfo ≡ 0 under psf='fitgauss'** (*_psfo thread): keep/rename/drop decision affects any downstream PSF-leakage regression; relevant to the tutorial. diff --git a/.felt/shapepipe.md b/.felt/shapepipe.md index 044d7a3b1..502e65d3e 100644 --- a/.felt/shapepipe.md +++ b/.felt/shapepipe.md @@ -1,40 +1,54 @@ --- -name: ShapePipe — project knowledge & active threads +id: 01KTCHX00NMQ1VDPGXRYJ6RGZR +name: ShapePipe maintenance & PRs tags: - shapepipe + - portolan created-at: 2026-04-27T11:26:38.71538657+02:00 -outcome: 'Root of ShapePipe''s felt store: the stack division, repo conventions, and the why behind in-flight infra/cleanup threads.' +outcome: 'Root: collaboration with Martin on ShapePipe — PRs, infra, future ngmix and Fabian work' --- -This is the root of ShapePipe's felt store — shared notes on architecture -decisions, conventions, and in-flight work, for the team and AI agents alike. -ShapePipe is the UNIONS galaxy shape-measurement pipeline; `CLAUDE.md` covers the -build / container / CI overview, and the fibers here carry the *why*. Start here, -then follow the links. +ShapePipe is the UNIONS shape-measurement pipeline. I'm not the primary +maintainer (that's Martin Kilbinger); my role is collaborator helping +clean up infra, surface bugs, and keep the merge queue moving while +Martin focuses on science threads. -## Stack division +## Working agreement with Martin -ShapePipe **produces** shear catalogues; `sp_validation` / `cosmo_val` -**consume** and validate them; `cs_util` holds code shared across both. A concern -about *validating* catalogues belongs downstream, not in ShapePipe. +Surfaced over a 2026-04-27 walking conversation. Captured in +[[shapepipe/prs-in-flight]] and the per-thread fibers below. + +- I review and patch his PRs; he reviews mine. Bugs found during review + go to a dedicated PR rather than getting bundled into his feature + branch (per `feedback_separate_infra_prs`). +- v2.0 was merged fast (it was ready). The skaha base it brought in is + the active source of pain → see [[shapepipe/docker-uv-revert]]. +- I file the issues; Claude usually drafts the PRs in my voice. + Disclosure on Claude-only review per + `feedback_claude_only_review_disclosure`. + +## Active threads (refreshed 2026-06-10) + +- **PR #737** (MPI on candide + containerized SLURM scripts) — **MERGED 2026-06-10** (rebased onto develop, both CI runs green). +- **PR #738** (versioned docs + switcher) — **MERGED 2026-06-10** after independent review; sfarrens offered post-hoc comments. Stable root refreshes on next master push. See [[docs-versioning]]. +- **PR #739** (machine-specific cluster docs tree) — awaiting Martin's review; check rebase state after the #737/#738 merges. See [[docs-cluster-tree]]. +- **PR #741** (ngmix v2.0, CI mirror of #740) — Martin left 10 inline comments Jun 5, no verdict yet; shear recovery verified unbiased to ~1e-4 in m. See [[ngmix-weights-ivar]] (PR-ready regression fix + #604 ivar plan) and [[ngmix-size-columns]] (honest r50 spec). +- **Martin's PRs #704 (contributors) & #699 (coverage mask)** — Cail's review requested; both conflicting, need Martin's rebase first. +- **[[ci-green-on-develop]]** — conda fully removed (#733 merged); remaining follow-ups tracked there. + +## Earlier threads (superseded) + +- [[shapepipe/docker-uv-revert]] / [[shapepipe/prs-in-flight]] — the #719-era conda removal; landed via #733, current state in [[ci-green-on-develop]]. +- **[[shapepipe/ngmix-update]]** — became the #740/#741 ngmix v2.0 review thread. +- **[[shapepipe/fabian-coord-bug]]** — port Fabian's 1-line coord propagation fix; still pending his image-sim code reaching github. ## Conventions specific to this repo -- **Rho-statistics are obsolete inside ShapePipe.** PSF-systematics validation - moved downstream to `sp_validation` / `cosmo_val` (via `shear_psf_leakage`); - the stile/treecorr rho code was removed in #715. But the **meanshapes / - ellipticity focal-plane plots** (`mccd_plots_runner`) are *deliberately kept* — - they are a general PSF/star-catalogue diagnostic, not rho-stats, and feed - catalogue-paper figures. Don't delete that path along with rho-stats; see - [[shapepipe/cleanup-rhostats-jobscripts]] for where the boundary actually sits. -- Run the pipeline through the container; use `python3.12` explicitly inside it. -- **ngmix** is pinned to a fork branch until fixes land upstream — don't bump - that dependency line. [[ngmix-update]] tracks the path back to upstream. - -## Active threads - -- **[[shapepipe/ci-green-on-develop]]** / **[[shapepipe/test-suite]]** — a - tiered, in-image test suite and trustworthy CI on `develop`. -- **[[docker-uv-revert]]** — slim Python base + uv lockfile, dropping conda. -- **[[shapepipe/mpi-hybrid]]** — running hybrid MPI through the container on candide. -- **[[ngmix-update]]** — replacing the pinned ngmix fork with upstream. +- Container runs through `app` (apptainer wrapper); use `python3.12` + inside the shapepipe container (see `reference_containers`). +- ShapePipe produces; `sp_validation` consumes; `cs_util` is shared (see + `project_stack_division`). +- Rho stats are obsolete here — sp_validation/cosmo_val took over (see + `project_rho_stats_obsolete`). +- Royal "we" in PR/issue voice; specific findings attributed to Claude + by name (see `feedback_writing_voice_on_cails_behalf`). diff --git a/.felt/shapepipe/ci-develop-trigger/ci-develop-trigger.md b/.felt/shapepipe/ci-develop-trigger/ci-develop-trigger.md index 629d6c23d..2c5e4ebec 100644 --- a/.felt/shapepipe/ci-develop-trigger/ci-develop-trigger.md +++ b/.felt/shapepipe/ci-develop-trigger/ci-develop-trigger.md @@ -1,4 +1,5 @@ --- +id: 01KTCHWZX8GRCMJHGCRARBGFS8 name: CI silently broken on develop; install_shapepipe conda.sh lookup status: closed tags: diff --git a/.felt/shapepipe/ci-green-on-develop/ci-green-on-develop.md b/.felt/shapepipe/ci-green-on-develop/ci-green-on-develop.md index 3b304539f..3ae6c6bc3 100644 --- a/.felt/shapepipe/ci-green-on-develop/ci-green-on-develop.md +++ b/.felt/shapepipe/ci-green-on-develop/ci-green-on-develop.md @@ -1,4 +1,5 @@ --- +id: 01KTCHWZX9Q1MG2FB20N5Y52TD name: 'ShapePipe CI: green & trustworthy test suite on develop' status: open tags: @@ -6,7 +7,7 @@ tags: - ci - constitution created-at: 2026-05-29T12:34:31.800189806+02:00 -outcome: 'Conda removed from ShapePipe (#733 merged); develop is green including image publish. The container (slim Python 3.12 + apt + uv lockfile) is the single source of truth and CI runs pytest inside the dev image. ci-release.yml + doc-tests.yml + install_shapepipe + environment*.yml deleted; cd.yml builds docs in the image; install docs rewritten. The conda path had rotted into a multi-layer install failure — deleted rather than repaired, so the production astropy bug is moot. Remaining for a clean/tight/shared dev environment: (1) commit a shared .felt in the repo + curate which fibers go public; (2) make repo CLAUDE.md a real onboarding doc; (3) port cluster job scripts off conda; (4) #729 dependabot actions (recreating); (5) verify docs deploy on master; (6) strengthen the thin test suite (#708).' +outcome: 'Near-realized. Ground truth 2026-06-11: develop green incl. image publish; #708 closed (test scaffolding landed on develop: 5 structural test files + property/synthetic-FITS tiers); candide PBS scripts modernized to SLURM+apptainer (#737); CLAUDE.md already a solid onboarding doc (minor gaps: CONTRIBUTING pointer, first-PR walkthrough); docs deploy infra correct post-#738 — /latest/ live, switcher published, ROOT still serves 2022 v1.0.1 until the next master push (the one remaining trigger). Conda survives only in out-of-scope CANFAR/CC-IN2P3 scripts (job_sp_canfar.bash + init_run_exclusive_canfar.sh have ACTIVE conda paths; cc_{mpi,smp}.sh use ccenv anaconda) — needs its own cluster-aware pass to fully realize ''no conda anywhere''.' --- ## Desired State diff --git a/.felt/shapepipe/cleanup-rhostats-jobscripts/cleanup-rhostats-jobscripts.md b/.felt/shapepipe/cleanup-rhostats-jobscripts/cleanup-rhostats-jobscripts.md index 4083ae1b8..a7fda558f 100644 --- a/.felt/shapepipe/cleanup-rhostats-jobscripts/cleanup-rhostats-jobscripts.md +++ b/.felt/shapepipe/cleanup-rhostats-jobscripts/cleanup-rhostats-jobscripts.md @@ -1,4 +1,5 @@ --- +id: 01KTCHWZXE6Z6MG1P21F804RZ2 name: 'ShapePipe cleanup: remove obsolete rho-stats/stile; modernize candide job scripts' status: closed tags: @@ -6,27 +7,25 @@ tags: - cleanup - constitution created-at: 2026-05-30T21:45:50.977369486+02:00 -closed-at: 2026-05-31T12:53:30.382233194+02:00 +closed-at: 2026-05-30T22:30:55.745891097+02:00 outcome: |- - Resolved as one shipped PR + one corrected mis-scope. - - D1 (rho-stats removal) was a STALE PREMISE: the rho-stats/stile/treecorr code was - already surgically removed from develop in #715 (merged 2026-04-23). What remained - in `mccd_plots_runner.py` / `mccd_plot_utilities.py` is pure meanshapes/ellipticity - plotting — NOT rho-stats — and Martin explicitly asked to keep it on #715 ("Let's - keep meanshapes, this is very useful... can be run on merged star and PSF catalogues"). - PR #736 was opened then CLOSED (not merged): deleting meanshapes would contradict - Martin and risk a catalogue-paper figure path. `stile` was already gone everywhere. - Lesson: verify the premise against current develop before cutting the branch. - - D2 (candide PBS scripts) SHIPPED as PR #737 — OPEN, CI green, mergeable, awaiting - Martin's review. candide_smp.sh / candide_mpi.sh now run via `apptainer exec` against - ghcr.io/cosmostat/shapepipe:develop-runtime (no conda); host-clone bind-mounted at the - same path so $SPDIR-relative configs resolve identically in/out of container; MPI uses - the hybrid host-mpiexec pattern. Tested on c03=candide: SMP runs the example pipeline - end-to-end with 0 errors; MPI hybrid needs a real multi-node allocation to verify e2e. - canfar + ccin2p3 scripts deliberately untouched (different clusters, can't verify here) - and noted in the PR. Also fixed a stale config path and propagated the real exit code. + Mixed. PR #737 (candide scripts) good and open; PR #736 (rho-stats) CLOSED as + an over-cut. The constitution's premise was stale: it treated mccd_plots_runner + as "the rho-stats path," but the authorized rho-stats removal (Martin #657 + "option 1, delete") had ALREADY shipped in #715 (merged 2026-04-23). By develop, + mccd_plots_runner/mccd_plot_utilities held NO rho/stile/treecorr — pure + meanshapes (plot_meanshapes: focal-plane ellipticities/sizes/residuals + 1D + histograms). On #715 Martin EXPLICITLY said keep meanshapes ("very useful... + maybe Fabian's 2D plots in the catalogue paper"). #736 deleted exactly that path + → net-zero rho cleanup, entirely over-reach against Martin's instruction. Closed + with explanation (2026-05-31), not trimmed — nothing left to salvage. + (2) PR #737 stands: candide_smp.sh / candide_mpi.sh run via apptainer + runtime + image, no conda; SMP verified end-to-end on c03 (0 errors), MPI hybrid pattern + written but needs a real allocation to verify. canfar + ccin2p3 left untouched. + Scope findings that held up: `stile` already vestigial; `random_cat` correctly + KEPT (general LSS random generator, not rho-stats). LESSON: check what prior + merged PRs already did before acting on a cleanup constitution — verify the + premise against current `develop`, not the constitution's snapshot. shuttle: enabled: true kind: oneshot @@ -34,9 +33,10 @@ shuttle: project_dir: /automnt/n17data/cdaley/unions/shapepipe agent: claude-opus session: - id: f1758ecc-bf5f-452c-9f92-6393adebe65e + id: 7c02b521-17bd-44dc-ad10-709520d6d2bf agent: claude-opus - dispatched_at: 2026-05-31T10:51:28.745315935Z + dispatched_at: 2026-05-30T20:29:24.913621662Z +tempered: true --- ## Desired State @@ -137,7 +137,30 @@ green on each, neither merged, canfar untouched-and-noted. don't touch); the broader test-suite work (separate, done); any scientific-algorithm change. -## Open Questions - -- Is `random_cat` truly rho-stats-only, or does any non-rho config/use depend on - it? Confirm before deleting it (vs. just `mccd_plots_runner`). +## Resolution + +**`random_cat` is NOT rho-stats-only — kept.** Empirically: `random_cat_package` +generates a random catalogue of points within the survey mask (healpix output, +`config_Rc.ini`, authored by Martin). That is a general clustering / LSS tool +(Landy-Szalay-style randoms), independent of PSF systematics. Rho statistics are +auto/cross-correlations of PSF-ellipticity residuals at *star* positions and need +no random catalogue. So the conditional in Deliverable 1 ("remove after confirming +random_cat is rho-only") failed its precondition → random_cat stays. Removing it +would overreach what Martin called obsolete; flagged in PR #736 for a follow-up if +he does want it gone. + +**`stile` was already vestigial.** Zero references in `src/`, `example/`, `docs/`, +`pyproject.toml`, `uv.lock` — never a declared dep, never imported. Nothing to +remove; the absence is noted in PR #736 so it isn't read as an oversight. + +**The "rho-stats path" was really PSF-diagnostic plotting.** `mccd_plot_utilities` +imports only matplotlib/mccd/numpy/astropy — it computes mean shapes and ellipticity +histograms, no `treecorr`/`stile` correlation functions. The docstring's "rho +statistics plot" was aspirational. Maps cleanly to Martin's "rho stats within +shapepipe" = the PSF-leakage diagnostics now owned by `shear_psf_leakage`. + +**MPI verification gap.** `candide_smp.sh` is verified end-to-end through the +container on c03. `candide_mpi.sh` uses the documented hybrid Apptainer pattern +(host `mpiexec` launches container ranks) but can't be verified on a login node — +`mpiexec` inside the container hangs without a PMIx allocation. ABI note for the +reviewer: image ships OpenMPI 4.1.4, candide modules are now OpenMPI 5.0.x. diff --git a/.felt/shapepipe/docs-cluster-tree/docs-cluster-tree.md b/.felt/shapepipe/docs-cluster-tree/docs-cluster-tree.md index c387c0980..bf05714e0 100644 --- a/.felt/shapepipe/docs-cluster-tree/docs-cluster-tree.md +++ b/.felt/shapepipe/docs-cluster-tree/docs-cluster-tree.md @@ -1,4 +1,5 @@ --- +id: 01KTCHWZY2NPWM4RQ7VAZK7WFE name: Machine-specific cluster docs tree + freshness pass tags: - shapepipe diff --git a/.felt/shapepipe/docs-versioning/docs-versioning.md b/.felt/shapepipe/docs-versioning/docs-versioning.md index 7246d684a..18909ea5b 100644 --- a/.felt/shapepipe/docs-versioning/docs-versioning.md +++ b/.felt/shapepipe/docs-versioning/docs-versioning.md @@ -1,11 +1,14 @@ --- +id: 01KTCHWZYMXBPT7S0NXJ7AT2CD name: Versioned docs site with a version switcher +status: closed tags: - shapepipe - docs - ci created-at: 2026-05-31T20:39:06.459699229+02:00 -outcome: 'PR #738 (off develop): docs site now versioned — stable@root (non-destructive to existing URLs), develop@/latest/, tags@/vX.Y.Z/, switcher.json drives the dropdown. Fixes that the site published only from master (~78 commits behind) so container.md/canfar.md were never published. CI green incl. real dev-image build + PR-preview artifact.' +closed-at: 2026-06-10T17:44:55.028992495+02:00 +outcome: 'Shipped: PR #738 merged 2026-06-10 (independent review pass, sfarrens offered post-hoc comments). master→root, develop→/latest/, tags→//, per-ref builds, switcher.json at root. Stable root refreshes on next master push.' --- ## Why diff --git a/.felt/shapepipe/ngmix-size-columns/ngmix-size-columns.md b/.felt/shapepipe/ngmix-size-columns/ngmix-size-columns.md new file mode 100644 index 000000000..3f5fd7c22 --- /dev/null +++ b/.felt/shapepipe/ngmix-size-columns/ngmix-size-columns.md @@ -0,0 +1,76 @@ +--- +id: 01KTCQPE4M6S1E4V9WMPCAFCKT +name: 'ngmix size columns: honest r50 + cs_util converter web' +status: closed +tags: + - constitution + - shapepipe + - ngmix +created-at: 2026-06-05T22:30:50.004535516+02:00 +outcome: 'Delivered end-to-end and FULLY MERGED 2026-06-11: cs_util#65-67 (v0.2.1 on PyPI with cs_util.size), shapepipe#743 (honest r50, merged into ngmix_v2.0), shear_psf_leakage#23 (cs-util cap relaxation), sp_validation#198 (T_to_fwhm fix + galaxy smoke test) — the whole dependency chain closed in one night. The T_to_fwhm fix moves α(size) leakage results; regenerate where it mattered. Deferred (stated in #743): wiring honest r50/r50psf into make_cat so paper-convention r_h reaches the science catalogue.' +shuttle: + kind: oneshot + host: candide + agent: claude-fable +--- + +# ngmix size columns: honest r50 + cs_util converter web + +Spun out of the [[review-ngmix-v2-pr740]] review. The full investigation — code ground-truth table, +the UNIONS-paper convention, the sp_validation consumption map, and the bonus downstream bug — lives in +**`.felt/review-ngmix-v2-pr740/size-report.md`**. Read it first; it is the detailed spec. + +The v2.0 ngmix rewrite added five `r50*` columns, **none of which is a half-light radius**: galaxy +`r50` = `pars[4]` = `T` = 2σ² (an *area*, a bit-for-bit copy of the `T` column); PSF `r50psf` = +`√(T/2)` = σ (a length, but off by the `1.1774` factor). So the same name root means an area on the +galaxy side and a length on the PSF side. The official catalogue paper (**UNIONS-3500 WL I**, +arXiv:2605.13549) reports the half-light radius `r_h` (= r50) as the **primary** size for both galaxy +and PSF, with `T` derived. Cail approved fixing this at the source. + +## Desired State + +**1. ngmix writer emits honest, self-consistent size columns** (`ngmix.py`, on `ngmix_v2.0`): +- Galaxy `r50 = 1.1774·√(T/2)` (= `√(ln2·T)`) with propagated error `r50_err = (1.1774/(2√2))·T_err/√T` + — replacing the current raw `pars[4]`/`pars_err[4]`. +- PSF `r50psf`/`r50_psfo_ngmix`/`r50_err_psfo_ngmix` = existing σ values × `1.17741` (true half-light + radii, not σ). +- Keep all `T*` columns as the (correctly-named) DES areas. After this, every "r50" column is a length + meaning the same thing on both sides, matching the paper. +- **De-duplicate:** `T_psfo_ngmix` ≡ `Tpsf` and `r50_psfo_ngmix` ≡ `r50psf`. Default = retire the + redundant names; fall back to documenting them as aliases if anything outside `sp_validation/src` + (notebooks, older catalogues) depends on the `_psfo_ngmix` names. Fix the `T_err_psfo_ngmix`-with-no- + `Tpsf_err` asymmetry while here. + +**2. `cs_util` holds the conversion web** — the single source of truth, since `sp_validation/galaxy.py` +already does `from cs_util import cfis` and cs_util is the shared producer/consumer layer: +- `T_to_r50(T) = 1.1774·√(T/2)`, `r50_to_T(r50) = 2·(r50/1.1774)²`, `sigma_to_fwhm(s) = 2.355·s`, + and the **corrected** `T_to_fwhm(T) = 2.355·√(T/2)`. With tests. + +**3. Fix the downstream PSF-leakage bug in sp_validation.** `galaxy.py:T_to_fwhm` is `T/1.17741*2.355` +— dimensionally wrong (treats the area `T` as σ; `2.355/1.17741 = 2.000`). It builds the `fwhm_PSF` +regressor for the scale-dependent PSF-leakage fit (`run_object.py:510`), so it distorts the PSF-size +axis and biases `α(size)`. Switch it to `cs_util.sigma_to_fwhm` on the corrected PSF size (or import +the fixed `cs_util.T_to_fwhm`) and retire the local function. **This is the change that actually moves +a science number** — flag it prominently for Cail (it touches the pure_eb / leakage analysis). + +**Scoping:** the ngmix-writer change (1) and the sp_validation fix (3) are independent — sp_validation +reads none of the `r50*` columns today, so fixing ngmix won't break it. Land cs_util (2) before +sp_validation (3) switches to importing it. **Merge/PR-creation stay Cail's gesture**; surface the open +calls (r50-as-primary convention, drop-vs-alias, catalogue version bump, whether in-flight leakage +results need regenerating) in this fiber's report for Cail's eventual reply — **do not post to Martin.** + +## Context + +- **Detailed spec:** `.felt/review-ngmix-v2-pr740/size-report.md` (ground-truth table, paper citations, + consumption map, the 6 open questions). +- **Repos (all on candide):** shapepipe `/automnt/n17data/cdaley/unions/shapepipe` (branch `ngmix_v2.0`); + sp_validation `/automnt/n17data/cdaley/unions/pure_eb/code/sp_validation/src/sp_validation/` + (`galaxy.py`, `calibration.py`, `run_object.py`, `notebooks/extract_info.py`); cs_util (shared — + locate the checkout; `sp_validation/galaxy.py` imports it). Cross-repo, so likely cross-PR. +- **Coordinate with [[ngmix-weights-ivar]]:** both edit `ngmix.py` but different functions (size-column + writer vs `prepare_ngmix_weights`) — low conflict; each on its own branch off `ngmix_v2.0` (after the + review-cleanup commit `bd60dc8e` lands). +- **Convention:** UNIONS-3500 WL I (arXiv:2605.13549) → `r_h`/r50 primary, `T = 2σ²` derived (Eq. 1). + Arithmetic verified: `√(2·ln2) = 1.17741`. +- **Agent:** Cail's call — the cs_util/sp_validation API + naming is taste-y (leans Claude); the ngmix + column edits are mechanical. diff --git a/.felt/shapepipe/ngmix-size-columns/report.html b/.felt/shapepipe/ngmix-size-columns/report.html new file mode 100644 index 000000000..f1004ac28 --- /dev/null +++ b/.felt/shapepipe/ngmix-size-columns/report.html @@ -0,0 +1,207 @@ + + + + + +Report — ngmix-size-columns + + + + + + +
+ +
+ + 2026 · 06 · 10 + · + shapepipe/ngmix-size-columns +
+ +

Honest r50 columns, a shared size web, and the leakage-axis fix.

+ +
+ +
+
§ Current state
+

All three branches landed and fresh-eyes reviewed; PR creation is yours.

+ +

Every piece of the constitution's desired state is implemented, tested, and pushed. + A second session independently reviewed all three diffs: re-derived both error propagations + (galaxy r50·T_err/2T and PSF √(2ln2)·T_err/(4σ) are the same formula — + consistent), confirmed zero remaining consumers of the retired _psfo_ngmix size + columns (only make_cat, repointed), confirmed the dropped pixel_size + kwarg on sigma_to_fwhm has no callers, re-ran the tests (shapepipe size tests pass + in the dev container; cs_util 24/24; T_to_fwhm(2.0) → 2.35482 reproduced in the + sp_validation container against the branch), and verified all three push SHAs. No code defects + found. Three branches across three repos, in dependency order:

+ + + + + + + + + + + + + + + + + + +
RepoBranchWhat it does
CosmoStat/shapepipefix/ngmix-size-columns
(off ngmix_v2.0, commit d6531b1b)
Galaxy r50 = √(ln2·T) with propagated error; PSF r50psf now a true + half-light radius (×1.17741); duplicates T_psfo_ngmix/r50_psfo_ngmix/their + errors retired; Tpsf_err and r50psf_err added; make_cat + repointed at Tpsf (its output column names unchanged → zero downstream breakage). + Two new tests pin the semantics.
cailmdaley/cs_util (fork)feat/size-conversions
(off develop, commit fc9580b)
New cs_util/size.py: the conversion web (T ↔ σ ↔ r50 ↔ FWHM) with exact constants + and 6 tests (closed forms, round trips, FWHM = 2·r50). Full suite 24/24 green. + On my fork — you have no push rights on CosmoStat/cs_util.
CosmoStat/sp_validationfix/psf-leakage-fwhm
(off develop, commit bfac233)
Retires the dimensionally wrong local T_to_fwhm; galaxy.py re-exports + T_to_fwhm/sigma_to_fwhm from cs_util.size, so all + star-import consumers (extract_info.py, the notebooks' size_to_fwhm + dispatch) keep working unchanged. Verified in the sp_validation container: + T_to_fwhm(2.0) → 2.35482 (old code gave 4.0003).
+ +
+ The science-number change: the sp_validation fix corrects the + fwhm_PSF regressor feeding the scale-dependent PSF-leakage fit + (run_object.py:510, size_PSF_col). Any α(PSF-size) trend or per-bin + leakage coefficient computed with the old function is biased by a monotonic-but-nonlinear + distortion of the size axis. Spatially-constant leakage is unaffected. In-flight pure_eb / + leakage results produced with the old code should be regenerated after this merges. +
+ +

Merge order: cs_util first (sp_validation's import fails loudly without it — + intentional fail-fast), then sp_validation; the shapepipe branch is independent and can merge any + time. No PRs opened — that's your gesture. cs_util needs a release (or a git-ref pin in + sp_validation) before the sp_validation branch is safe to merge.

+ +
+ CI will not police the merge order. sp_validation's CI is already green on the + pushed branch (run 27248208943) — but vacuously: the image installs released + cs-util 0.1.9 (no size module), and the test suite never imports + galaxy.py (tests cover cosmo_val/cosmology/survey/plots + only). Merging sp_validation before the cs_util release would pass CI and then raise + ImportError at runtime, in extract_info.py / the notebooks. The + ordering is enforced by discipline, not automation. +
+
+ +
+ +
+
§ Findings
+

What surfaced beyond the spec.

+ +

The old PSF r50 error was also a factor-2 over-estimate

+

Beyond the missing 1.1774 factor, r50_err_PSFo = T_err/(2σ) was wrong as an error + on σ: for σ = √(T/2), dσ/dT = 1/(4σ), not 1/(2σ). (The size-report table had marked this + "correct error-prop of σ" — it wasn't.) Both verified by finite differences; the new code carries + r50_err = √(2ln2)·T_err/(4σ).

+ +

Remaining duplicates, deliberately left

+

g1/g2_psfo_ngmixg1/g2_psf in the ngmix output (both read from + g_PSFo) — same disease, different organ; out of this fiber's size scope. Downstream, + make_cat writes both NGMIX_Tpsf_* and NGMIX_T_PSFo_* from the + now-identical source. Candidates for a follow-up dedupe pass.

+ +

shapepipe keeps a local constant, for now

+

ngmix.py defines SIGMA_TO_R50 = √(2 ln 2) locally rather than importing + cs_util.size — shapepipe pins released cs_util (>=0.1.9), and making the + branch depend on an unreleased fork would block its CI. Once cs_util ships the size module, a + one-line follow-up can switch the import.

+ +

Exact constants over truncated literals

+

cs_util uses √(2ln2) / 2√(2ln2) exactly; sp_validation's old 2.355 and + 1.17741 literals are retired with the functions. FWHM values shift at the 4th decimal + (2.355 → 2.35482) — cosmetically visible if anyone diffs old catalogs against new.

+ +

Pre-existing test failure, not mine

+

test_metacal_is_reproducible_with_fixed_seed fails in the local + shapepipe-dev container — reproduced by the review session on clean + ngmix_v2.0 with the identical error (ngmix.fitting has no + Fitter: the local container carries an older ngmix than the branch expects). The + other tests, including the two new size tests, pass against the branch code. Note shapepipe's + deploy-image.yml triggers only on PRs/pushes targeting + develop/main/master — so a PR of this branch into + ngmix_v2.0 won't run CI by itself; the suite gets arbitrated when the change rides + PR #740's chain toward develop.

+
+ +
+ +
+
§ Open questions
+

Your calls, none blocking.

+ +
    +
  1. Drop-vs-alias — resolved in favor of drop, verify you agree. I searched + sp_validation (src, notebooks, .ipynb) and all of shapepipe: the only consumer of the size + _psfo_ngmix names was make_cat.py, now repointed. Zero external hits, so + I dropped rather than aliased, per the constitution's default.
  2. +
  3. Catalogue version bump: on-disk r50* semantics change between + v2.0 pre-/post-fix files. Does this warrant a format-version note/changelog entry, and where?
  4. +
  5. Leakage regeneration: were any current/in-flight scale-dependent leakage + results produced with the buggy T_to_fwhm? How much does α(size) move? (Worth a + before/after on one catalog when convenient.)
  6. +
  7. cs_util path to merge: branch lives on the cailmdaley/cs_util fork. + PR to CosmoStat/cs_util + release (0.3?), then bump pins in sp_validation (and optionally switch + shapepipe's local constant to the import).
  8. +
  9. Galaxy r50_err form: implemented as first-order propagation from + T_err (r50_err = r50·T_err/2T), per spec. If you'd rather use the full + pars covariance, that's a writer-side swap.
  10. +
+
+ +
+ + diff --git a/.felt/shapepipe/ngmix-weights-ivar/ngmix-weights-ivar.md b/.felt/shapepipe/ngmix-weights-ivar/ngmix-weights-ivar.md new file mode 100644 index 000000000..bd72a223b --- /dev/null +++ b/.felt/shapepipe/ngmix-weights-ivar/ngmix-weights-ivar.md @@ -0,0 +1,124 @@ +--- +id: 01KTCQPE3JGEYN7NQS8HW1AT6B +name: 'ngmix weight map: fix v2.0 regressions + inverse-variance (#604)' +status: closed +tags: + - constitution + - shapepipe + - ngmix +created-at: 2026-06-05T22:30:49.970813955+02:00 +closed-at: 2026-06-11T02:37:33.556855912+02:00 +outcome: 'Realized, fresh-eyes reviewed (no blockers), and delivered to PR #741 (CosmoStat/shapepipe, ngmix_v2.0): per-pixel 1/RMS² inverse-variance weights gated by weight/flag/valid-RMS masks with scalar sigma_mad fallback — feat f466c987, mask binarization 6ddea5b2, NaN-edge guard 4aa3b2a1, integration test pinning spatially-varying RMS through the full observation chain 158986a4, config wiring 7a44abda + all-or-nothing semantics documented 0bc6016e. Martin''s #604 weight-map thread answered with the commits. BKG_RMS_VIGNET_PATH is all-or-nothing per tile (missing file → FileNotFoundError; fallback only when option absent). Container gotcha: test from a scratch-wf worktree needs --bind /automnt/n17data and /automnt-prefixed paths (plain /n17data symlink invisible inside the sandbox); dev sandbox ships ngmix 1.3.6 so test_metacal_is_reproducible_with_fixed_seed only passes in CI''s v2 image.' +shuttle: + enabled: true + kind: oneshot + host: candide + agent: codex +--- + +# ngmix weight map: fix v2.0 regressions + inverse-variance (#604) + +Spun out of the [[review-ngmix-v2-pr740]] review. The full investigation, with file:line +evidence, the empirical confirmation, the concrete change plan, effort, and risks, lives in +**`.felt/review-ngmix-v2-pr740/weights-report.md`** — read it first; it is the detailed spec. + +The v2.0 ngmix rewrite (PR #741, branch `ngmix_v2.0`) introduced two coupled regressions in +`prepare_ngmix_weights` (`ngmix.py:871`), **empirically confirmed** this session (fed `make_data`'s +truth inverse-variance `1/noise²` → recovered `8.8e11`, off by ≈`1/noise²`): + +- **R1** — noise estimate regressed from v1's object-free windowed `get_noise` (still present at + `:826`, now **dead**) to flux-contaminated whole-stamp `sigma_mad(gal)`. Size/flux-dependent bias + on the likelihood weighting → biased `g_err`/`T_err`/`s2n`; the fingerprint of a multiplicative + shear bias (cf. old-path `m≈-2.8e-2`). Dominant today. +- **R2** — lost the v1 binarization `weight_map[weight_map != 0] = 1`, so the raw weight is multiplied + by `1/sig_noise²`: a per-epoch `1/Fscale²` error today, and a latent double-count the moment a real + inverse-variance map is fed in. + +Neither v1 nor v2 is *correct* in absolute terms — both approximate per-pixel inverse-variance with a +per-stamp scalar on a 0/1 mask. #604 ("Weight Handling", open since 2023) is the real fix: feed ngmix +genuine inverse-variance maps. + +## Desired State + +**Two PRs, in this order:** + +1. **Minimal regression fix on `ngmix_v2.0` (#741).** Restore the v1 binarization (R2) and add a + **red→green unit test** on `prepare_ngmix_weights` using `make_data`'s truth ivar (the test sketch + is in the report; it needs no ngmix fit, so it's fast and deterministic). This is a clean, small + change that removes the double-count hazard. + - **Open call (default = defer R1):** restoring the windowed `get_noise` for R1 reintroduces the + gal-guess machinery this PR may have deliberately stripped. Default to fixing R2 + the test now, + and let the proper inverse-variance path (PR 2) retire whole-stamp `sigma_mad` wholesale — which + resolves R1 without reviving the old machinery. If R1 proves to matter for the interim, surface + it; **do not block on it, and do not post to Martin** — these science calls go into this fiber's + report for Cail to fold into the eventual #741 reply. + +2. **SExtractor `BACKGROUND_RMS` baseline — separate PR, closes #604.** The in-house path (THELI + weights are an external product and slot in later as another `ME_IMAGE_PATTERN` source). Follow the + report's change plan: `CHECKIMAGE = BACKGROUND, BACKGROUND_RMS` (config), cut the RMS stamp via + vignetmaker's existing list-driven ME loop (config), wire one **optional** RMS input through + `ngmix_runner`/`Vignet`, and rewrite `prepare_ngmix_weights` to build per-pixel `1/RMS²` gated on + the mask, with `sigma_mad`/`get_noise` retained as the **fallback** when no RMS map is present. + Validate with a before/after on the example tile (shear/s2n/flags). Effort ~4–8h. + +**Quality bar:** every claim verified by running, not inferring (dev image + ngmix 2.4.0 are ready — +see Context). The red→green test must actually go red on current code and green after. **Merge and +PR-creation stay Cail's gesture** — prepare branches, commits, and a PR-ready description; surface the +science decisions in this fiber's `report.html`/outcome rather than pushing to Martin. + +## Context + +- **Detailed spec:** `.felt/review-ngmix-v2-pr740/weights-report.md` (change plan, file list, risks, + the 6 science questions for Martin). This constitution is the pointer; the report is the substance. +- **R2 checkpoint:** `/tmp/pr740-wt` commit `953a52a3` restores + `weight_map[weight_map != 0] = 1` in `prepare_ngmix_weights` and adds + `test_weight_map_recovers_injected_inverse_variance`. Observed red before the fix: + recovered `8.80916e11` vs truth `1e6`; observed green after the fix: recovered + `8.80916e5` vs truth `1e6` with `rtol=0.15`. The tolerance is intentionally loose + enough to leave R1 visible: whole-stamp `sigma_mad(gal)` still overestimates the + compact simulated stamp's noise by about 6%, producing a ~12% inverse-variance + shortfall. This commit fixes R2 only; it does not claim R1 is solved. +- **BACKGROUND_RMS checkpoint:** `/tmp/pr740-wt` commit `03bf12b6` adds the + first #604 implementation slice. `ngmix_runner` now takes optional + `BKG_RMS_VIGNET_PATH` (with `{file_number_string}` substitution) without + changing the existing six-input pipeline contract. `Vignet` opens the + optional RMS sqlite, `prepare_postage_stamps` rescales RMS by `FSCALE`, and + `prepare_ngmix_weights(..., bkg_rms=...)` uses per-pixel `1/RMS²` on pixels + where the Megapipe weight mask, flag mask, and RMS validity mask are all good. + Verification observed in the dev image: + `test_weight_map_recovers_injected_inverse_variance`, + `test_background_rms_builds_per_pixel_inverse_variance`, and + `test_rescale_epoch_fluxes_scales_background_rms_like_image_counts` all + passed. `py_compile` and `git diff --check` were also clean. +- **BACKGROUND_RMS config checkpoint:** `/tmp/pr740-wt` commit `ee87b5bd` + wires the active CFIS and CFIS simulation example configs into that code + path. The exposure SExtractor configs request `CHECKIMAGE = BACKGROUND, + BACKGROUND_RMS`; the multi-epoch vignetmaker configs add + `background_rms` beside `background`; and the ngmix templates set + `BKG_RMS_VIGNET_PATH` to the expected + `background_rms_vignet{file_number_string}.sqlite` output. Verification: + the three targeted RMS/weight tests pass in the dev image; all touched + example configs parse with `ConfigParser(interpolation=None)`; `git diff + --check` is clean. +- **Optional RMS input-contract guard:** `/tmp/pr740-wt` commit `22e31e51` + adds `test_ngmix_accepts_optional_background_rms_vignet`, which + instantiates `Ngmix` with the seventh `BACKGROUND_RMS` sqlite input and + verifies `Vignet` opens it. Verification observed in the dev image: the + focused RMS/weight pytest selection passed, `py_compile` passed for touched + ngmix files, and `git diff --check` was clean. +- **Current test caveat:** the targeted unit test passes in the dev image. The full + `src/shapepipe/tests/test_ngmix.py` file currently still fails in the existing + metacal smoke test because the container's `ngmix.fitting` has no `Fitter` + attribute; that failure is independent of the new R2 unit path. +- **Repo:** `/automnt/n17data/cdaley/unions/shapepipe`, branch `ngmix_v2.0` (= #741 head). Base the + work on `ngmix_v2.0` **after** the review-cleanup commit lands (it removes dead code and is currently + staged at `/tmp/pr740-wt` commit `bd60dc8e`, pending Cail's push). Coordinate with + [[ngmix-size-columns]] — it also edits `ngmix.py`, but a different function (the size-column writer + vs `prepare_ngmix_weights`); low conflict, but each work-stream gets its own branch. +- **Test environment:** dev-image sandbox at `/n17data/cdaley/containers/shapepipe-dev` with **ngmix + 2.4.0** installed (the published `:develop` image ships the old `aguinot` fork v1.3.6 — wrong for + this branch). Run via `CONTAINER_PATH=/n17data/cdaley/containers/shapepipe-dev PYTHONPATH=/src app python3.12 -m pytest …`. + The `--writable` sandbox needs bind mountpoints to exist (`mkdir` them); see this session's history. +- **Issue #604:** megapipe weights are 0/1 masks (Gwyn confirmed); aguinot's BACKGROUND_RMS proposal is + the basis for PR 2. Cail confirmed the SExtractor-RMS baseline over THELI. +- **Agent:** Codex (Cail's call) — gritty, well-defined cross-file implementation. diff --git a/.felt/shapepipe/np-str0-numpy2/np-str0-numpy2.md b/.felt/shapepipe/np-str0-numpy2/np-str0-numpy2.md new file mode 100644 index 000000000..d23de432a --- /dev/null +++ b/.felt/shapepipe/np-str0-numpy2/np-str0-numpy2.md @@ -0,0 +1,40 @@ +--- +id: 01KTK3WGKFCGD2VQAQSABC3TBJ +name: np.str0 fix (NumPy 2 aftershock) +status: closed +tags: + - shapepipe + - numpy + - bugfix +created-at: 2026-06-08T09:59:18.639370578+02:00 +closed-at: 2026-06-10T17:13:38.507495758+02:00 +outcome: 'np.str0 removed from file_io.py on develop (verified 2026-06-10: only np.str_ remains); NumPy 2 string-column saves work. Regression guard idea (NAME column in roundtrip test) noted in body if ever needed.' +--- + +Fabian Hervas-Peters hit `module 'numpy' has no attribute 'str0'` running the +new container against `src/shapepipe/pipeline/file_io.py`. Root cause: the +container now ships **NumPy 2.4.4** (`pyproject` pins `numpy>=2.0`), and +`np.str0` — a deprecated alias for `np.str_` — was *removed* in NumPy 2.0. + +The offending line was `_get_fits_col_type`'s string branch: +`elif type(col_data[0]) in [str, np.str_, np.str0]:`. The fix is to drop the +dead alias — `np.str_` already covers it, so nothing is lost. + +**Why the test suite never caught it.** The list literal `[str, np.str_, np.str0]` +is only *evaluated* when that `elif` is actually reached, i.e. only when a column +whose first element is a string flows through `save_as_fits`. The existing +`test_fits_catalogue_table_roundtrips` saved only a `float64` and an `int16` +column — both short-circuit on earlier branches and never touch the string path. +So the `AttributeError` stayed dormant until Fabian saved a catalogue with a +string column. The regression guard is a one-line addition: a `NAME` string +column in the roundtrip test, which now exercises the `"A"` (FITS char) dispatch. + +Interestingly, `ngmix_v2.0` had *already* removed `np.str0` independently, so +merging develop into it was a clean no-op on `file_io.py` and only carried over +the improved test. + +General lesson: removed-NumPy-2 aliases (`np.str0`, `np.bool0`, `np.int0`, +`np.object0`, `np.unicode_`, …) hide in rarely-reached branches and survive +import — they only blow up at the call site. Worth a grep sweep when migrating a +codebase to NumPy 2. A sweep of `src/shapepipe/` on 2026-06-08 found `str0` was +the only remaining offender. diff --git a/.felt/shapepipe/smoke-test-read-only/smoke-test-read-only.md b/.felt/shapepipe/smoke-test-read-only/smoke-test-read-only.md index b9bbe8849..fca6d5ad4 100644 --- a/.felt/shapepipe/smoke-test-read-only/smoke-test-read-only.md +++ b/.felt/shapepipe/smoke-test-read-only/smoke-test-read-only.md @@ -1,23 +1,14 @@ --- +id: 01KTCHWZZ923H1H5DR6AS3Q5H6 name: Smoke test must work in read-only mode +status: closed tags: - shapepipe - docker - infra created-at: 2026-05-28T10:32:25.53742271+02:00 -outcome: |- - `shapepipe_run -c /app/example/config.ini` fails on read-only SIF - because the example config uses relative `OUTPUT_DIR = ./example/output`, - which resolves under `WORKDIR=/app` — read-only in apptainer/SIF. Fix: - add `scripts/sh/shapepipe_run_example.sh` wrapper that mktemp's a - workdir, copies `/app/example/` into it, cd's, and execs `shapepipe_run`. - Dockerfile's existing auto-symlink rule (`scripts/*/*.sh` → - `/usr/local/bin/`) makes it usable as `shapepipe_run_example` - on PATH. CI smoke step updated to call it under - `docker run --read-only --tmpfs /tmp:rw` (emulates SIF). Drive-by: - tightened `.gitignore` from `*shapepipe_run_*` (catches the wrapper) - to `example/output/shapepipe_run_*`. Submitted as #731 from branch - `chore/smoke-test-read-only`; awaiting CI. +closed-at: 2026-06-10T17:13:38.485047103+02:00 +outcome: 'shapepipe_run_example wrapper (copy example to /tmp) landed via PR #731, merged to develop; CI smoke steps use it. Read-only SIF runs work.' --- ## The gap diff --git a/.felt/shapepipe/test-suite/test-suite.md b/.felt/shapepipe/test-suite/test-suite.md index 37e3d46ca..ffb9aab3a 100644 --- a/.felt/shapepipe/test-suite/test-suite.md +++ b/.felt/shapepipe/test-suite/test-suite.md @@ -1,4 +1,5 @@ --- +id: 01KTCHWZZJ1H3D0HYETGFVMVZH name: 'ShapePipe test suite: tiered, in-image, property-based' status: closed tags: @@ -15,6 +16,7 @@ shuttle: host: c03 project_dir: /automnt/n17data/cdaley/unions/shapepipe agent: codex +tempered: true --- ## Desired State diff --git a/.felt/shapepipe/v2-run-plan/v2-run-plan.md b/.felt/shapepipe/v2-run-plan/v2-run-plan.md index 537c85a96..b5a96e2b8 100644 --- a/.felt/shapepipe/v2-run-plan/v2-run-plan.md +++ b/.felt/shapepipe/v2-run-plan/v2-run-plan.md @@ -1,4 +1,5 @@ --- +id: 01KTCHWZZKM6TYB7WS8J1TQWVY name: v2.0 ShapePipe run — workflow wishlist tags: - shapepipe diff --git a/.felt/sqlitedict-pickle-smell/sqlitedict-pickle-smell.md b/.felt/sqlitedict-pickle-smell/sqlitedict-pickle-smell.md index f035953f5..c2fc64feb 100644 --- a/.felt/sqlitedict-pickle-smell/sqlitedict-pickle-smell.md +++ b/.felt/sqlitedict-pickle-smell/sqlitedict-pickle-smell.md @@ -1,4 +1,5 @@ --- +id: 01KTCHX00N4580BRMCH0FNADXV name: sqlitedict pickle-by-default is a known smell status: open tags: diff --git a/README.rst b/README.rst index da8fe0701..e27b4028d 100644 --- a/README.rst +++ b/README.rst @@ -1,7 +1,7 @@ ShapePipe ========= -|CI| |CD| |python39| |release| +|CI| |CD| |python312| |release| .. |CI| image:: https://github.com/CosmoStat/shapepipe/workflows/CI/badge.svg :target: https://github.com/CosmoStat/shapepipe/actions?query=workflow%3ACI @@ -9,14 +9,48 @@ ShapePipe .. |CD| image:: https://github.com/CosmoStat/shapepipe/actions/workflows/pages/pages-build-deployment/badge.svg :target: https://github.com/CosmoStat/shapepipe/actions/workflows/pages/pages-build-deployment -.. |python39| image:: https://img.shields.io/badge/python-3.9-green.svg - :target: https://www.python.org/‰ +.. |python312| image:: https://img.shields.io/badge/python-3.12-green.svg + :target: https://www.python.org/ .. |release| image:: https://img.shields.io/github/v/release/CosmoStat/shapepipe :target: https://github.com/CosmoStat/shapepipe/releases/latest ShapePipe is a galaxy shape measurement pipeline developed within the -CosmoStat lab at CEA Paris-Saclay. +CosmoStat lab at CEA Paris-Saclay. It runs the full chain from raw survey +images to calibrated shear catalogues — object detection, PSF modelling, and +shape measurement — and produced the first UNIONS cosmic-shear release. -See the `documentation `_ for details -on how to install and run ShapePipe. +Quickstart +---------- + +ShapePipe ships as a container image, so you can run the bundled example +pipeline — a single CFIS tile through the full chain — without installing +anything: + +.. code-block:: bash + + # Apptainer (HPC, no root needed): + apptainer exec docker://ghcr.io/cosmostat/shapepipe:develop-runtime shapepipe_run_example + + # ...or Docker: + docker run --rm ghcr.io/cosmostat/shapepipe:develop-runtime shapepipe_run_example + +The image is published on every push to the `GitHub Container Registry +`_: +``:develop`` tracks the integration branch, release tags (e.g. ``:v1.1.0``) a +stable cut, and the ``-runtime`` suffix selects the slim batch image over the +full interactive one. + +Documentation +------------- + +Full documentation lives at https://cosmostat.github.io/shapepipe. Good places +to start: + +- `Installation `_ — getting ShapePipe onto your machine or cluster. +- `Basic execution `_ and `configuration `_ — running ``shapepipe_run`` and writing pipeline configs. +- `Container workflow `_ — the two image targets and the ``pyproject.toml`` / ``uv.lock`` / ``Dockerfile`` layers. +- `Running on a cluster `_ — pulling the image and submitting jobs, with worked candide (SLURM) and CANFAR examples. + +If you use ShapePipe in academic work, please cite Guinot et al. (2022) and +Farrens et al. (2022). diff --git a/docs/source/basic_execution.md b/docs/source/basic_execution.md index 9e7ca63b4..d9aa6de07 100644 --- a/docs/source/basic_execution.md +++ b/docs/source/basic_execution.md @@ -7,7 +7,7 @@ option: ```{seealso} :class: margin -The `shapepipe` environment will need to be built and activated in order to run this script (see [Installation](installation.md)). +ShapePipe runs inside its container — there is no environment to activate. See [Installation](installation.md). ``` ```bash shapepipe_run --help @@ -37,11 +37,33 @@ shapepipe_run -c ## Running the Pipeline with MPI ShapePipe can also use [mpi4py](https://mpi4py.readthedocs.io/en/stable/) -for managing parallel processes on clusters with multiple nodes. -The `shapepipe_run` script can be run with MPI as follows +to spread work across multiple nodes of a cluster. Set `MODE = mpi` in the +`[EXECUTION]` section of the config and launch with an MPI runner: ```bash -mpiexec -n shapepipe_run +mpiexec -n shapepipe_run -c ``` -where `` is the number of cores to allocate to the run. +where `` is the number of MPI processes to start. + +### Through the container (the supported way on a cluster) + +On a cluster you run ShapePipe from the published image as a standard Apptainer +*hybrid* MPI job: the **host** `mpirun`/`mpiexec` launches one container rank per +slot, and the OpenMPI bundled in the image wires the ranks together. + +```bash +# one-time: pull the runtime image +apptainer pull shapepipe.sif docker://ghcr.io/cosmostat/shapepipe:develop-runtime + +# load a host MPI in the same family as the image's OpenMPI (5.0.x), then launch +module load openmpi +mpirun -n \ + apptainer exec --bind "$PWD:$PWD" shapepipe.sif \ + shapepipe_run -c +``` + +The image ships **OpenMPI 5.0.x** so that its PMIx matches modern cluster +launchers. The host and container MPI must be compatible: if you see *N* copies +of `rank 0 of 1` instead of one *N*-rank job, load a host OpenMPI in the 5.0.x +family. See `example/pbs/candide_mpi.sh` for a complete SLURM batch script. diff --git a/docs/source/canfar.md b/docs/source/canfar.md deleted file mode 100644 index 1f490c504..000000000 --- a/docs/source/canfar.md +++ /dev/null @@ -1,94 +0,0 @@ -# Running `shapepipe` on the canfar science portal - -## Introduction - -## Steps from testing to parallel running - -Before starting a batch remote session job on a large number of images (step 5.), -it is recommended to perform some or all of the testing steps (1. - 4.). - - -1. Run the basic `shapepipe` runner script to test (one or several) modules in question, specified by a given config file, on one image. - This step has to be run in the image run directory. The command is - ```bash - shapepipe_run -c config.ini - ``` - -2. Run the job script to test the job management, on one image. - This step has to be run in the image run directory. The command is - ```bash - job_sp_canfar -j JOB [OPTIONS] - ``` - -3. Run the pipeline script to test the processing step(s), on one image. - This step has to be run in the patch base directory. - - 1. First, run in dry mode: - ```bash - init_run_exclusive_canfar.sh -j JOB -e ID -p [psfex|mccd] -k [tile|exp] -n - ``` - 2. Next, perform a real run with - ```bash - init_run_exclusive_canfar.sh -j JOB -e ID -p [psfex|mccd] -k [tile|exp] -n - ``` - -4. Run remote session script to test job submission using docker images, on one image. - This step has to be run in the patch base directory. - 1. First, run in dry mode=2, to display curl command, with - ```bash - curl_canfar_local.sh -j JOB -e ID -p [psfex|mccd] -k [tile|exp] -n 2 - ``` - - 2. Next, run in dry mode=1, to use curl command without processing: - ```bash - curl_canfar_local.sh -j JOB -e ID -p [psfex|mccd] -k [tile|exp] -n 1 - ``` - 3. Then, perform a real run, to use curl with processing: - ```bash - curl_canfar_local.sh -j JOB -e ID -p [psfex|mccd] -k [tile|exp] - ``` - -5. Full run: Call remote session script and docker image with collection of images - ```bash - curl_canfar_local.sh -j JOB -f path_IDs -p [psfex|mccd] -k [tile|exp] - ``` - with `path_IDs` being a text file with one image ID per line. - -## Monitoring - - -### Status and output of submitted job - -Monitoring of the currently active remote session can be performed using the session IDs `session_IDs.txt` written by the -remote session script `curl_canfar_local.sh`. In the patch main directory, run -```bash -curl_canfar_monitor.sh events -``` -to display the remotely started docker image status, and -```bash -curl_canfar_monitor.sh logs -``` -to print `stdout` of the remotely run pipeline script. - -### Number of submitted running jobs - -The script -```bash -stats_headless_canfar.py -``` -returns the number of actively running headless jobs. - - -## Post-hoc summary - -In the patch main directory, run -```bash -summary_run PATCH -``` -to print a summary with missing image IDs per job and module. - -## Deleting jobs - -```bash - for id in `cat session_IDs.txt`; do echo $id; curl -X DELETE -E /arc/home/kilbinger/.ssl/cadcproxy.pem https://ws-uv.canfar.net/skaha/v0/session/$id; done - ``` diff --git a/docs/source/clusters.md b/docs/source/clusters.md new file mode 100644 index 000000000..41d32a0bf --- /dev/null +++ b/docs/source/clusters.md @@ -0,0 +1,101 @@ +# Running on a Cluster + +ShapePipe runs the same way on every cluster: **through the container**. You +pull the slim `runtime` image once, bind-mount your clone, and run +`shapepipe_run` (or the CANFAR submission tooling) inside it — there is no +environment to install or activate on the host. This page covers the shared +pattern, then the specifics for each supported machine. + +For what is *inside* the image and how it is built, see +[Container Workflow](container.md). + +## The pattern + +Three things hold on any cluster: + +- **The container is the unit of execution.** Pull the runtime image to a SIF + (`apptainer pull … docker://ghcr.io/cosmostat/shapepipe:-runtime`) and run + the pipeline inside it. Nothing else is installed on the host. +- **Bind-mount your clone at the same path.** The config files reference their + location for the input and output directories; bind-mounting the host clone at + the *identical* path inside the container makes those paths resolve the same in + and out of the container. +- **Keep SIFs and Apptainer's scratch off a quota-limited `$HOME`.** A pull under + a tight home quota fails with `disk quota exceeded`; point `APPTAINER_CACHEDIR` + (and the SIF itself) at a roomy data partition. + +MPI jobs add one constraint: the container's MPI must speak the same PMIx wire +protocol as the host launcher (see the candide section below). + +## candide (SLURM) + +candide uses **SLURM** (`sbatch`; the old `qsub`/PBS commands are gone). The repo +ships ready job scripts in `example/pbs/` — `candide_smp.sh` (single node, +parallelised with joblib) and `candide_mpi.sh` (multi-node, hybrid MPI). To run +the bundled single-tile example end to end: + +```bash +# Keep the SIF and Apptainer's scratch off the quota-limited $HOME. +export DATA=/n17data/$USER # adjust to your data partition +export APPTAINER_CACHEDIR=$DATA/.apptainer + +# Pull the runtime image (~850 MB). +apptainer pull "$DATA/shapepipe-runtime.sif" \ + docker://ghcr.io/cosmostat/shapepipe:develop-runtime + +# Submit. SPDIR is your clone, bind-mounted at the same path inside the +# container; SP_IMAGE is the SIF. The same script serves the example and a real +# run — point the config inside it at your own pipeline. +SP_IMAGE="$DATA/shapepipe-runtime.sif" SPDIR="/path/to/shapepipe" \ + sbatch example/pbs/candide_smp.sh +``` + +Partitions are `comp` (2-day limit) and `compl` (5-day). Both job scripts read +`SP_IMAGE` (the SIF) and `SPDIR` (the clone) from the environment, so the only +thing that changes between the example and a real run is the config the script +points at. + +For multi-node MPI, use `candide_mpi.sh`. It `module load`s a host OpenMPI in the +5.0.x series to match the PMIx that the image's OpenMPI speaks; a mismatched host +MPI (e.g. OpenMPI 4) silently degrades the job to *N* independent "rank 0 of 1" +processes instead of one *N*-rank job. The script's header comments carry the +detail. + +Adapting these scripts to another SLURM cluster is mostly the `#SBATCH` +directives and the `module load` line — the `apptainer exec` invocation carries +over unchanged. + +## CANFAR + +CANFAR submission does not go through a batch scheduler. Instead you submit +container jobs to CANFAR's headless system with the `canfar_submit_job` console +script (backed by the `canfar` library), and watch them with `canfar_monitor` / +`canfar_monitor_log`. Pipeline steps are **bit-coded** through `-j` (the same +scheme as `scripts/sh/job_sp_canfar.bash`), the PSF model is chosen with +`-p psfex|mccd`, and `-V` selects the image version: + +```bash +# Submit pipeline step(s) for the configured tiles (bit-coded -j). +canfar_submit_job -j 1 -p psfex -V 1.1 + +# Monitor sessions/jobs and stream logs. +canfar_monitor +canfar_monitor_log +``` + +The full production run — input preparation, the per-step `-j` table, and +post-processing — is documented in the +[CANFAR production walkthrough](pipeline_canfar.md). + +```{note} +The CANFAR production submission scripts (`scripts/sh/job_sp_canfar*.bash`) still +run under the pre-container environment and are slated for the same +container-first cleanup the candide scripts received. Treat the walkthrough as +the current-but-evolving production procedure. +``` + +## ccin2p3 + +ccin2p3 is not yet containerised. The `example/pbs/cc_{smp,mpi}.sh` scripts target +the pre-container environment; a container-first path mirroring candide is future +work. diff --git a/docs/source/configuration.md b/docs/source/configuration.md index 3336123c1..772a9558f 100644 --- a/docs/source/configuration.md +++ b/docs/source/configuration.md @@ -1,6 +1,6 @@ # Configuration -The pipeline requires a configuration file (by default called `conifg.ini`) +The pipeline requires a configuration file (by default called `config.ini`) in order to be run. Example configuration files are provided in the [example](https://github.com/CosmoStat/shapepipe/tree/develop/example) directory. @@ -114,7 +114,7 @@ INPUT_DIR = last:psfex_runner OUTPUT_DIR = /home/username/my_output_dir FILE_PATTERN = psf FILE_EXT = fits -NUMBERING_LIST = -001, -002 +NUMBER_LIST = -001, -002 ``` ShapePipe will look for the specific files `psf-001.fits` and `psf-002.fits` diff --git a/docs/source/container.md b/docs/source/container.md index 2511a7f3e..174cde9dc 100644 --- a/docs/source/container.md +++ b/docs/source/container.md @@ -2,7 +2,8 @@ ShapePipe ships as a container image. This page covers the two image targets and the three configuration files (`pyproject.toml`, `uv.lock`, -`Dockerfile`) that determine what's inside. +`Dockerfile`) that determine what's inside. For running the image on a batch +cluster (candide, CANFAR), see [Running on a cluster](clusters.md). ## Two image targets diff --git a/docs/source/contributing.md b/docs/source/contributing.md index 99cf85a09..814bf4852 100644 --- a/docs/source/contributing.md +++ b/docs/source/contributing.md @@ -2,7 +2,7 @@ ShapePipe is a fully open-source project and we welcome contributions. -Pleas read our +Please read our [contribution guidelines](https://github.com/CosmoStat/shapepipe/blob/develop/CONTRIBUTING.md). for details on how to contribute to the development of this package. diff --git a/docs/source/dependencies.md b/docs/source/dependencies.md index 64c6db0ea..9378bfe00 100644 --- a/docs/source/dependencies.md +++ b/docs/source/dependencies.md @@ -1,32 +1,65 @@ # Dependencies -All third-party software packages required by ShapePipe are installed -automatically (see [Installation](installation.md)). Note that all packages -are pinned to a given version for each release of ShapePipe to maximise the -reproducibility of the results. Below we list the main packages used. +ShapePipe's dependencies are installed automatically — the recommended path is +the [container](container.md), which bundles all of them (see also +[Installation](installation.md)). `pyproject.toml` is the authoritative list: it +declares the **abstract minimum versions** ShapePipe is compatible with, while +`uv.lock` pins the **exact** versions that ship in a given build. The tables +below describe the main packages; for the complete, current set always refer to +`pyproject.toml`. ## Python Dependencies -| Package Name | References | -|--------------|---------------------------------------------------| +Scientific stack: + +| Package | References | +|---------|------------| | [Astropy](https://www.astropy.org/) | {cite:p}`astropy:2013,astropy:2018` | | [GalSim](https://github.com/GalSim-developers/GalSim) | {cite:p}`rowe:15` | -| [Joblib](https://joblib.readthedocs.io/en/latest/) | {cite:p}`joblib:20` | -| [Matplotlib](https://matplotlib.org/) | {cite:p}`hunter:07` | +| [ngmix](https://github.com/aguinot/ngmix) | {cite:p}`sheldon:15` | | [MCCD](https://github.com/CosmoStat/mccd) | {cite:p}`liaudat:21` | | [ModOpt](https://cea-cosmic.github.io/ModOpt/) | {cite:p}`farrens:20` | -| [mpi4py](https://mpi4py.readthedocs.io/en/stable/) | {cite:p}`dalcin:05,dalcin:08,dalcin:11` | -| [NGMIX](https://github.com/esheldon/ngmix) | {cite:p}`sheldon:15` | +| [python-pysap](https://github.com/CEA-COSMIC/pysap) | | | [Numpy](https://numpy.org/) | {cite:p}`harris:20` | +| [Numba](https://numba.pydata.org/) | | | [Pandas](https://pandas.pydata.org/) | {cite:p}`pandas:10,pandas:20` | +| [Matplotlib](https://matplotlib.org/) | {cite:p}`hunter:07` | +| [Joblib](https://joblib.readthedocs.io/en/latest/) | {cite:p}`joblib:20` | +| [mpi4py](https://mpi4py.readthedocs.io/en/stable/) | {cite:p}`dalcin:05,dalcin:08,dalcin:11` | +| [reproject](https://reproject.readthedocs.io/) | | +| [h5py](https://www.h5py.org/) | | | [sf_tools](https://github.com/sfarrens/sf_tools) | | -| [sqlitedict](https://github.com/RaRe-Technologies/sqlitedict) | | -## Executable Dependencies +Data access & infrastructure (CANFAR / UNIONS): + +| Package | Purpose | +|---------|---------| +| [vos](https://github.com/opencadc/vostools) | CADC / CANFAR VOSpace access | +| [skaha](https://github.com/shinybrar/skaha) | CANFAR Science Platform sessions | +| canfar | CANFAR container-job submission | +| [astroquery](https://astroquery.readthedocs.io/) | external catalogue queries | +| [cs_util](https://github.com/CosmoStat/cs_util) | shared CosmoStat utilities | +| [sqlitedict](https://github.com/RaRe-Technologies/sqlitedict) | on-disk pipeline state | + +```{note} +`ngmix` is pinned to the +[`aguinot/ngmix@stable_version`](https://github.com/aguinot/ngmix/tree/stable_version) +fork until the fixes land upstream — do not modernise that line (see the note in +`pyproject.toml`). +``` -| Package Name | References | -|---------------|------------| -| [CDSclient](http://cdsarc.u-strasbg.fr/doc/cdsclient.html) | | +## System Dependencies + +The container also provides the non-Python tools ShapePipe calls, all from Debian +packages (no source builds), plus the MPI stack: + +| Package | References | +|---------|------------| +| [Source Extractor](https://www.astromatic.net/software/sextractor/) | {cite:p}`bertin:96` | | [PSFEx](https://www.astromatic.net/software/psfex/) | {cite:p}`bertin:11` | -| [SExtractor](https://www.astromatic.net/software/sextractor/) | {cite:p}`bertin:96` | | [WeightWatcher](https://www.astromatic.net/software/weightwatcher/) | {cite:p}`marmo:08` | +| OpenMPI (5.0.x) | | + +Python dependencies themselves are managed with [uv](https://docs.astral.sh/uv/); +see [Container Workflow](container.md) for how `pyproject.toml`, `uv.lock`, and +the `Dockerfile` fit together. diff --git a/docs/source/module_develop.md b/docs/source/module_develop.md index 3df8fe88c..b4ef3566f 100644 --- a/docs/source/module_develop.md +++ b/docs/source/module_develop.md @@ -7,7 +7,7 @@ This page provides details on how new modules can be implemented in ShapePipe. Every ShapePipe module requires a *module package*, which is simply a directory with the module name followed by `_package`, e.g. for a module called `my_new_module` you would create a new directory called `my_new_module_package` -and put it in `shapepipe/modules`. Inside this directory you will need to +and put it in `src/shapepipe/modules`. Inside this directory you will need to include a Python file (ideally named after your module, e.g. `my_new_module.py`) and a `__init__.py` file with the following content. diff --git a/docs/source/pipeline_canfar.md b/docs/source/pipeline_canfar.md index 39111b05f..7e8bb29c5 100644 --- a/docs/source/pipeline_canfar.md +++ b/docs/source/pipeline_canfar.md @@ -457,6 +457,16 @@ ln -s ..//unions_shapepipe_comprehensive_struct_2024_v1.6.c.hdf5 unions_shapepip calibrate_comprehensive +```{note} +The matched-star-catalogue and coverage-mask diagnostics in the next two +sections have largely moved out of ShapePipe into +[`sp_validation`](https://github.com/CosmoStat/sp_validation) / `cosmo_val`. +Several helpers referenced below — `merge_psf_cat.py`, `download_headers`, +`extract_field_corners`, `build_coverage_map`, `plot_coverage_map` — are no +longer shipped in this repository; see `sp_validation` for their current +equivalents. (`get_ccds_with_psf` and `summary_run` are still part of ShapePipe.) +``` + ### Create matched star catalogue For diagnostics, a catalogue with multi-epoch shapes measured by ngmix matched with the validation star catalogue is used. diff --git a/docs/source/pipeline_v2.0.md b/docs/source/pipeline_v2.0.md deleted file mode 100644 index bdcd07d08..000000000 --- a/docs/source/pipeline_v2.0.md +++ /dev/null @@ -1,292 +0,0 @@ -# Running `ShapePipe` processing and post-processing pipelines on CANFAR - -Documentation to create ShapePipe output products for catalogues v2.x. - -## Initialise directory structure - -```bash -init_run_v2.0.sh -``` -sets up the directory structure. This will be - -v2.0/ -├── tiles/ -│ ├── 301/ -│ │ ├── 301.278/ -│ │ ├── 301.279/ -│ │ └── ... -├── exp/ -│ ├── 21/ -│ │ ├── 21163916 -│ │ └── ... -├── cfis -> /arc/home/kilbinger/shapepipe/example/cfis -├── tile_numbers -> /arc/home/kilbinger/shapepipe/auxdir/CFIS/tiles_202604/tiles_r.txt -└── debug/ - - -### Interactive job from the terminal for a single tile - -Run bit-coded jobs -```bash -run_job_canfar_v2.0.sh -e ID -j -``` -with job processing tiles: -- 1: download tiles -- 2: uncompress tile weights -- 4: find exposures -then exposures: -- 8: download exposures -- 16: split exposures into single-CCD HDUs -- 32: mask exposures -- 64: process stars (selection, PSF movel) -then back to tiles: -- 128: select objects (using external catalogue) -- 256: create object postage stamps - -## CANFAR Setup - -### CANFAR Login - -Login to the canfar system with - -```bash -canfar auth login -``` - -This can be done from at notebook or terminal within the canfar science portal, -or any remote terminal that has the canfar library installed. - -Check authentication status with - -```bash -canfar auth list -``` - -If not on "default", run - -```bash -canfar auth switch default -``` - - -## From previous setup - -### Merge all final catalogues - -The last step of `ShapePipe` processing is, per patch, to merget all final catalogues. This is done via a python script, as follows. -First, change to parent directory `/path/to/version` and run the following command for all patches - -```bash -patchnum=`tr $patch P ''` -create_final_cat.py -m final_cat_$patch.hdf5 -i . -p $patch/cfis/final_cat.param \ - -P $patchnum -o $patch/n_tiles_final.txt -v -``` - -## Additional `ShapePipe` processing - - -### Create star Catalogue - -We can additionaly create a combined star catalogue, with star shapes projecte from detector to world coordinates. -This is useful for validation and galaxy-PSF/star correlation diagnostics. - -#### Combine all PSF runs - -In each patch directory /path/to/version/$patch, run - -```bash -combine_runs.bash -p $psf -c psf -``` - -to create a single output directory of PSF files (symbolic links). - -Optionally, to create and plot results for this patch only: - -```bash -shapepipe_run -c $SP_CONFIG/config_Ms_$psf.ini -shapepipe_run -c $SP_CONFIG/config_Pl_$psf.ini -``` - -#### Convert star catalogue to wCS - -Convert all input validation PSF files and create directories per patch `P?`. -Create files `validation_psf_conv--.fits` (for the v1.4 setup only one file): - -```bash -cd /path/to/version -mkdir stat_car -cd star_cat -``` - -For each patch run - -```bash -convert_psf_pix2world.py -i .. -P $patchnum -v -``` - -Combine previously created files as links within one ShapePipe run directory (for the v1.4 setup only one link). -First (and optiohnal), create a subdir for a run and link to the input patches: - -```bash -cd /path/to/version/star_cat -mkdir v1.6 -ln -s ../P1 -ln -s ../P2 -... -``` - -Next, create links to all `validation_conv` runs: - -```bash -combine_runs.bash -p psfex -c psf_conv -``` - -Merge all converted star catalogues and create `final-starcat.fits`: - -```bash -export SP_RUN=`pwd` -shapepipe_run -c ~/shapepipe/example/cfis/config_Ms_psfex_conv.ini -``` - -Rename to general PSF and star catalogue used for all ("a") sub-versions: - - -```bash -cp output/run_sp_Ms/merge_starcat_runner/output/full_starcat-0000000.fits \ - unions_shapepipe_psf_2024_v1.6.a.fits -``` - -The FITS file `CATTYPE` (newer version) should be `validation_psf_conf`. - -## Post-processing - -The following post-processing steps are performed with the library `sp_validation`. - -### Extract Information - -First, we extract all information from the final catalogue, per patch. We copy -the parameter file and set links to the catalogues and `ShapePipe` config directory. - -```bash -cd /path/to/version/$patch -cp ~/astro/repositories/github/sp_validation/notebooks/params.py . -ln -s /path/to/final_cat_$patchnum.hdf5 # not relative path ../final_cat_P$patchnum.hdf5 ! -ln -s output/run_sp_MsPl/mccd_merge_starcat_runner/output/full_starcat-0000000.fits -ln -s ~/astro/repositories/github/shapepipe/example/cfis -``` - -Then edit `params.py`: Set patch name; set `wrap_ra` for P2. - -Now we can run the script, recommended via job submission on candide. For large patches, -this requies a job with a large memory, e.g. with `mem=380000` - - -```bash -[squeue] python ~/astro/repositories/github/sp_validation/notebooks/extract_info.py -``` - -This creates a patch-wise comprehensive catalogue. - -### Create global comprehensive catalogues - -```bash -cd /patch/to/version -[squeue] python ~/astro/repositories/github/sp_validation/scripts/create_joint_comprehensive_cat.py \ - -v v1.6.c -v -p P1+P2+P3+P4+P5+P6+P7+P8+P9 -``` - -This creates the file `unions_shapepipe_comprehensive_2024_v1.6.c.hdf5`. - - -### Apply structural masks - -First, edit the Python script `~/astro/repositories/github/sp_validation/notebooks/demo_apply_hsp_masks.py` -to match catalogue name. Check the coverage mask input file (see below). -Run the script to apply the healsparse structural masks: - -```bash -[squeue] python ~/astro/repositories/github/sp_validation/notebooks/demo_apply_hsp_masks.py -``` - -This creates the file `unions_shapepipe_comprehensive_struct_2024_v1.6.c.hdf5`. - - -### Define sample, calibrate catalogue - -We are close to finally perform the last post-processing step, which is the calibration. First, the final galaxy sample -in question needs to be defined, with masks and cuts to apply from a `yaml` config file. A number of pre-defined files -can be found in `~/astro/repositories/github/sp_validation/calibration`. - -For example, to create `v1.6.6`, the steps are: - -```bash -cd /path/to/version -mkdir -p v1.6.6 -cd v1.6.6 -ln -s ~/astro/repositories/github/sp_validation/calibration/mask_v1.X.6.yaml config_mask.yaml -ln -s ..//unions_shapepipe_comprehensive_struct_2024_v1.6.c.hdf5 unions_shapepipe_comprehensive_struct_2024_v1.X.c.hdf5 -[squeue] python ~/astro/repositories/github/sp_validation/calibrate_comprehensive_cat.py -``` - -calibrate_comprehensive - - -### Create matched star catalogue - -For diagnostics, a catalogue with multi-epoch shapes measured by ngmix matched with the validation star catalogue is used. -This is created as follows: - -```bash -cd /path/to/version -merge_psf_cat.py [-V v1.6|-P P1+P2+...] -v -``` - -This creates the joint catalogue unions_shapepipe_star_2024_v1.6.a.fits . - -### Create coverage mask - -First, on canfar, move to the directory that has the patch subdirectories. - -```bash -cd /path/to/version -``` - -#### Get exposure numbers - -If the file `$patch/exp_numbers.txt` does not exist for a given patch, create it with the summary program - -```bash -summary_run $patch 1 -``` - -Now, create the list of CCDs that have PSF information with - -```bash -get_ccds_with_psf -v -V v1.6 -``` - -Next, download exposures headers - -```bash -download_headers -i ccds_with_psfs_v1.6.txt -o headers_v1.6 -v -``` - -From the headers, the CCD corner coordinates are extracted with -```bash -extract_field_corners -i headers_v1.6 -v -``` - -Then, build the healsparse coverage mask file as -```bash -build_coverage_map -``` - -Use `plot_coverage_map` to create plots of the coverage mask. - -## Extra Utilities - -### Run in Terminal in Parallel - -```bash -cat IDs.txt | xargs -I {} -P 16 bash -c 'init_run_exclusive_canfar.sh -j 512 -e {}' -``` diff --git a/docs/source/post_processing.md b/docs/source/post_processing.md index 577ad57b3..e8a3768f4 100644 --- a/docs/source/post_processing.md +++ b/docs/source/post_processing.md @@ -16,7 +16,16 @@ If main ShapePipe processing happened at the old canfar VM system (e.g. CFIS v0 --- -The following steps are required for pre-v1.4 runs performed on the canfar VM system. +```{note} +This page documents the **legacy** post-processing used for pre-v1.4 runs on the +canfar VM system. PSF validation and the scale-dependent diagnostics +(ρ-statistics) now live in +[`sp_validation`](https://github.com/CosmoStat/sp_validation) rather than in +ShapePipe; the steps below are retained for reference and for reproducing older +runs. +``` + +The following steps were used for pre-v1.4 runs performed on the canfar VM system. 1. Optional: Split output into sub-samples @@ -48,30 +57,25 @@ The following steps are required for pre-v1.4 runs performed on the canfar VM sy This script creates a new combined psf run in the ShapePipe `output` directory, by identifying all psf validation files and creating symbolic links. The run log file is updated. - 3. Merge individual psf validation files into one catalogue. Create plots of the PSF and their residuals in the focal plane, - as a diagnostic of the overall PSF model. - As a scale-dependend test, which propagates directly to the shear correlation function, the rho statistics are computed, - see {cite:p}`rowe:10` and {cite:p}`jarvis:16`, + 3. Merge the individual PSF validation files into one catalogue, and create + plots of the PSF and its residuals in the focal plane as a diagnostic of + the overall PSF model: ```bash shapepipe_run -c /path/to/shapepipe/example/cfis/config_MsPl_PSF.ini - ``` - - 4. Prepare output directory - - Create links to all 'final_cat' result files with - ```bash - prepare_tiles_for_final ``` - The corresponding output directory that is created is `output/run_sp_combined/make_catalog_runner/output`. - On success, it contains links to all `final_cat` output catalogues + The scale-dependent PSF diagnostics that propagate to the shear correlation + function (the ρ-statistics, {cite:p}`rowe:10`, {cite:p}`jarvis:16`) are no + longer computed here — they have moved to + [`sp_validation`](https://github.com/CosmoStat/sp_validation). - 5. Merge final output files - - Create a single main shape catalog: + 4. Merge final output files + + Create a single main shape catalogue: ```bash merge_final_cat -i -p -v ``` - Choose as input directory `input_dir` the output of step C. A default - parameter file `` is `/path/to/shapepipe/example/cfis/final_cat.param`. + Choose as input directory `input_dir` the `make_cat` output of the runs + being combined. A default parameter file `` is + `/path/to/shapepipe/example/cfis/final_cat.param`. On success, the file `./final_cat.npy` is created. Depending on the number of input tiles, this file can be several tens of Gb large. diff --git a/docs/source/random_cat.md b/docs/source/random_cat.md index 724ff6e32..a930d40ed 100644 --- a/docs/source/random_cat.md +++ b/docs/source/random_cat.md @@ -6,6 +6,15 @@ masks, and combined randoms and masks for a selection of tiles. The masked regions are obtained on input from ShapePipe pixel mask ("pipeline flag") files. +```{note} +Parts of this procedure use the legacy canfar-VM / `vos` retrieval workflow (see +[VOSpace retrieval](vos_retrieve.md)) and the obsolete `prepare_tiles_for_final` +helper, which is no longer shipped. The `random_cat` module itself is current; +the input-staging and joint-mask steps now overlap with +[`sp_validation`](https://github.com/CosmoStat/sp_validation). The steps are +retained for reference. +``` + ## Set up ### ID file and shell variables @@ -64,7 +73,7 @@ while read p; do tar xvf pipeline_flag_$p.tgz; done