diff --git a/DESIGN_272_INGEST.md b/DESIGN_272_INGEST.md new file mode 100644 index 0000000..5841f1b --- /dev/null +++ b/DESIGN_272_INGEST.md @@ -0,0 +1,515 @@ +# DESIGN_272_INGEST.md + +Design document for ingesting 67,187 new OpenContext records into the iSamples derived-data pipeline (GitHub issue #272, follow-up phase). + +*All numbers in this document are from executed DuckDB queries against local files. See SPIKE_RESULTS.md for raw output.* + +*Prepared by rbotyee (Claude Code spike), 2026-06-12. Decisions resolved 2026-06-12 (see Section 7). See SYNC_RESULTS.md for actual run output.* + +--- + +## 1. Background and context + +Issue #272 Phase 1 (PR #275, merged and live) overlaid corrected material/object-type concept mappings from Eric's fresh OC PQG onto the 1,043,604 OC pids already in the unified wide. That phase deliberately skipped new-record ingestion: **67,187 OC records present in Eric's 2026-06-09 wide are absent from the production `isamples_202606_wide.parquet`.** + +This document designs Phase 2: ingest those 67,187 records. + +### Key files + +| File | Path | +|---|---| +| Production base wide (Phase 1 output) | `https://data.isamples.org/isamples_202606_wide.parquet` | +| Eric's fresh OC wide | `~/Data/iSample/pqg_refining/oc_isamples_pqg_wide_2026-06-09.parquet` | +| Pre-Phase-1 wide (local copy) | `~/Data/iSample/pqg_refining/isamples_202604_wide.parquet` | + +> **Note:** The spike ran against the 202604 wide (locally available) rather than 202606 (remote only). For the two dimensions that differ — row_id range and IdentifiedConcept inventory — the spike accounts for the 202606 delta explicitly. + +--- + +## 2. Gap characterization (all from executed queries) + +### 2.1 Record counts + +| Measure | Count | +|---|---| +| OC MaterialSampleRecords in our 202604 wide | 1,064,831 | +| OC MaterialSampleRecords in Eric's fresh wide | 1,110,791 | +| **New pids (Eric's \ ours)** | **67,187** | +| Deleted pids (ours \ Eric's, not in Eric's at all) | 21,227 | + +**D3 DECISION (2026-06-12, RY):** These 21,227 stale pids ARE REMOVED. OpenContext mass-updated Murlo project PIDs; old PIDs would duplicate the same physical samples. This is a TRUE SYNC: add new + remove stale in one operation. The orphaned subgraph entities (SamplingEvent, GeoCoordLoc, SamplingSite) linked ONLY by removed MSRs are also removed. Agents are NOT removed (0 orphan agents found). See Section 3.5 for orphan analysis results. + +### 2.2 Material type breakdown of new records + +First-non-root material per new MSR (Eric's OC concept arrays): + +| Material URI | Count | +|---|---| +| `material/1.0/biogenicnonorganicmaterial` | 22,236 | +| `material/1.0/otheranthropogenicmaterial` | 19,072 | +| `material/1.0/organicmaterial` | 10,315 | +| `material/1.0/rock` | 7,766 | +| NULL (root-only, no specific concept) | 7,282 | +| `material/1.0/mineral` | 466 | +| `material/1.0/anthropogenicmetal` | 48 | +| `material/1.0/mixedsoilsedimentrock` | 2 | + +The 7,766 rock records are the Tall al-ʿUmayri Jordan lithics flagged by Eric during staging inspection (7,903 labelled "Lithic ID: …", of which 7,766 are rock-first-non-root). + +### 2.3 Geographic coverage + +All 67,187 new MSRs have **no geometry blob and no latitude/longitude directly on the MSR row** in Eric's wide. However: + +- 67,187 / 67,187 (100%) have coordinates accessible via the graph: `MSR → p__produced_by → SamplingEvent → p__sample_location → GeospatialCoordLocation.{geometry, latitude, longitude}` +- Coordinate range: lat −55.19 to 71.04, lon −164.0 to 159.9 (global coverage, not just Jordan) +- 0 duplicates in the coord path (each MSR maps to exactly 1 coordinate row) + +The builder (`build_frontend_derived.py`) reads `geometry` from the wide's MSR rows using `ST_X(geometry)`/`ST_Y(geometry)`. The ingestion step must **denormalize** the GeoCoordLoc geometry blob onto each new MSR row. + +### 2.4 New entity subgraph + +The 67,187 new MSRs bring a full entity subgraph that must be ingested together: + +| Entity type | Count | Already in our wide? | +|---|---|---| +| MaterialSampleRecord | 67,187 | 0 (new pids) | +| SamplingEvent | 67,187 | 0 (all new PIDs absent) | +| GeospatialCoordLocation (from SE) | 10,316 | 0 | +| GeospatialCoordLocation (from SamplingSite) | 6,514 | 0 | +| **Unique GeoCoordLoc total** | **11,399** | **0** | +| SamplingSite | 6,514 | 0 | +| Agent | 24 | 0 | +| **Total new entity rows** | **~152,312** | — | + +No entity (by PID) from the new subgraph already exists in the 202604 wide. The 202606 wide is identical in entity inventory (it only changed p__ columns on existing rows + minted 1 concept), so this holds for 202606 too. + +### 2.5 IdentifiedConcept inventory + +New MSRs reference 14 distinct concept row_ids in Eric's wide. Of these: + +| Concept URI | In 202604? | In 202606? | +|---|---|---| +| `material/1.0/anthropogenicmetal` | YES | YES | +| `material/1.0/biogenicnonorganicmaterial` | YES | YES | +| `material/1.0/material` (root) | YES | YES | +| `material/1.0/mineral` | YES | YES | +| `material/1.0/mixedsoilsedimentrock` | YES | YES | +| `material/1.0/organicmaterial` | YES | YES | +| `material/1.0/otheranthropogenicmaterial` | NO | **YES** (minted by Phase 1) | +| `material/1.0/rock` | YES | YES | +| `materialsampleobjecttype/1.0/artifact` | YES | YES | +| `materialsampleobjecttype/1.0/biologicalmaterialsample` | YES | YES | +| `materialsampleobjecttype/1.0/materialsample` | YES | YES | +| `materialsampleobjecttype/1.0/organismpart` | YES | YES | +| `sampledfeature/1.0/earthsurface` | NO | NO (need to mint) | +| `sampledfeature/1.0/pasthumanoccupationsite` | YES | YES | + +**Conclusion:** Using 202606 as the base, exactly **1 new concept must be minted**: `sampledfeature/1.0/earthsurface` (used by 3,924 of the new MSRs for `p__has_context_category`). + +### 2.6 Schema drift between our wide and Eric's wide + +| Dimension | Detail | +|---|---| +| Columns in ours NOT in Eric's | `p__curation` (INTEGER[]), `p__related_resource` (BIGINT[]) | +| Columns in Eric's NOT in ours | none | +| Type differences | `row_id`: ours=BIGINT, Eric's=INTEGER; `p__has_*` arrays: ours=BIGINT[], Eric's=INTEGER[] | +| `n` (source) column | Eric's wide has NULL for all MSRs; ours uses 'OPENCONTEXT' | + +The two extra columns (`p__curation`, `p__related_resource`) are NULL for OC records in the existing wide (the frozen export never populated them for OC). New OC records will also have NULLs for these columns — that is the correct behavior. + +The `n` column mismatch is critical: **ingested rows must have `n = 'OPENCONTEXT'` set explicitly**. + +--- + +## 3. Row-id allocation design + +### 3.1 Constraint + +Our wide has `row_id BIGINT` ranging 1–20,729,358. Eric's wide has `row_id INTEGER` ranging 1–2,465,485. These ranges **overlap** — we cannot reuse Eric's row_ids. + +### 3.2 Strategy: dense rank starting at `max_src + 1` + +All new rows get row_ids assigned by dense rank starting at `max(src.row_id) + 1 = 20,729,359`. The rank ordering must be deterministic (by a stable sort key) to make the output reproducible. Proposed ordering: by `(otype, pid)` — otype first groups entity classes together, pid within each class is stable. + +Revised row count: +- New rows: ~152,312 (plus concept minting: +1) +- New row_id range: 20,729,359 to ~20,881,671 + +**This leaves substantial headroom** and avoids all collisions. + +### 3.3 p__ array rewriting + +Eric's wide stores `p__produced_by`, `p__sample_location`, etc. as INTEGER[] references into Eric's row_id space. All these references must be **remapped to the new row_ids** in our id space. This requires: + +1. Build a mapping table: Eric's `row_id` → new `row_id` for every entity in the new subgraph. +2. Apply the mapping to all p__ columns on new MSR rows. + +**This is the most complex step** — see Section 4.3. + +--- + +## 4. Pipeline design: `make ingest-272` + +### 4.1 Overview + +``` +202606_wide.parquet + oc_isamples_pqg_wide_2026-06-09.parquet + │ + ▼ scripts/ingest_oc_records.py + │ Phase A: Identify new pids (Eric's \ ours) + │ Phase B: Extract full entity subgraph for new pids + │ Phase C: Assign new row_ids (dense rank, deterministic) + │ Phase D: Remap p__ arrays from Eric's id space to our id space + │ Phase E: Denormalize coords (geometry BLOB) onto new MSR rows + │ Phase F: Set n='OPENCONTEXT' on new MSR rows + │ Phase G: Mint new IdentifiedConcept rows (earthsurface) + │ Phase H: Hard-fail checks (dup row_ids, dup pids, all refs resolved) + │ Phase I: Write src_rows UNION ALL new_rows → isamples_202608_wide.parquet + │ + ▼ scripts/validate_oc_concept_enrichment.py (reused for concept rows) + │ (or a new validate_ingest.py — see Section 6) + │ + ▼ scripts/build_frontend_derived.py + │ + ▼ scripts/validate_frontend_derived.py --wide isamples_202608_wide.parquet +``` + +### 4.2 Tag convention + +The output should be tagged `isamples_202608` (next month tag) to reflect the new data vintage. RY decision gate: confirm tag before publishing. + +### 4.3 p__ array remapping in detail + +This is the core correctness challenge. Each new entity row in Eric's wide has a `row_id` in Eric's space (1–2,465,485). Each p__ column on a new MSR row is an `INTEGER[]` of those Eric row_ids. After ingestion, those integers must refer to our new row_ids (20,729,359+). + +Algorithm: + +```sql +-- Step 1: collect all Eric row_ids that need remapping +CREATE TEMP TABLE eric_id_map AS + SELECT eric_row_id, + max_src_row_id + DENSE_RANK() OVER (ORDER BY otype, pid) AS our_row_id + FROM all_new_entities; -- new MSR + SE + Geo + Site + Agent + Concept + +-- Step 2: remap p__ arrays on new MSR rows +-- Use list_transform or UNNEST+re-aggregate pattern +SELECT + ..., + (SELECT list(m.our_row_id ORDER BY u.ord) + FROM UNNEST(e.p__produced_by) WITH ORDINALITY AS u(rid, ord) + JOIN eric_id_map m ON m.eric_row_id = u.rid) AS p__produced_by, + ... +FROM new_msr_rows e +``` + +> **Warning:** the UNNEST+correlated subquery approach must be avoided at 67K rows (we learned the MAP cross-join blowup lesson). Use the decorrelated pattern from `build_frontend_derived.py` (UNNEST WITH ORDINALITY + JOIN + arg_min/list-agg) or a pre-aggregated mapping table. + +### 4.4 Geometry denormalization + +The builder reads `geometry` from MSR rows. New MSR rows have no geometry. We must lift the geometry blob from the linked `GeospatialCoordLocation`: + +```sql +-- In the new MSR rows: +-- MSR.p__produced_by[1] -> SamplingEvent.row_id +-- SamplingEvent.p__sample_location[1] -> GeoCoordLoc.row_id +-- GeoCoordLoc.geometry -> copy to MSR.geometry +-- GeoCoordLoc.latitude, .longitude -> copy to MSR.latitude, .longitude + +CREATE TEMP TABLE new_msr_coords AS + WITH msr_se AS ( + SELECT m.pid, se.p__sample_location + FROM new_msr_eric m, UNNEST(m.p__produced_by) AS u(se_rid) + JOIN new_se_eric se ON se.row_id = u.se_rid + ) + SELECT ms.pid, geo.geometry, geo.latitude, geo.longitude + FROM msr_se ms, UNNEST(ms.p__sample_location) AS u(geo_rid) + JOIN new_geo_eric geo ON geo.row_id = u.geo_rid; +``` + +Confirmed: 0 duplicate coord rows per pid; 67,187 / 67,187 (100%) have coords. + +### 4.5 Hard-fail invariants (trust gate) + +The script must refuse to write if any of these hold: + +1. **Duplicate new pids**: any pid in new MSR rows already in src wide → wrong grain +2. **Duplicate row_ids**: any new row_id in the output overlaps an existing row_id +3. **Unresolved p__ references**: any p__ array element in any new row points to a row_id that doesn't exist in the output +4. **Missing n column**: any new MSR row has `n IS NULL` (must be 'OPENCONTEXT') +5. **Missing geometry on placed MSR**: any new MSR with a coord path but null geometry in output +6. **Row count**: output must equal `src_rows + new_entity_rows + minted_concepts` + +### 4.6 Makefile target + +```makefile +# make ingest-272 TAG=isamples_202608 +ingest-272: $(OC_WIDE) + $(PY) scripts/ingest_oc_records.py \ + --src $(ENRICHED) \ # isamples_202606_wide.parquet + --oc-wide $(OC_WIDE) \ # oc_isamples_pqg_wide_2026-06-09.parquet + --out $(OUTDIR)/$(TAG)_wide.parquet + $(MAKE) derived DERIVED_WIDE=$(OUTDIR)/$(TAG)_wide.parquet TAG=$(TAG) + $(MAKE) validate TAG=$(TAG) SENTINEL_FLAG= +``` + +> **RY decision gate:** Should `ingest-272` stack on `all-272` (requiring 202606 wide as input) or should it operate independently against the R2 202606 URL? Stacking is cleaner but requires the local 202606 file. + +--- + +## 5. Schema mapping: Eric's wide columns → our wide columns + +All 47 shared columns are taken directly from Eric's wide with the following transformations: + +| Column | Treatment | +|---|---| +| `row_id` | Replaced by new id from `eric_id_map` (see §4.3) | +| `n` | Set to `'OPENCONTEXT'` (Eric's wide has NULL for all MSR rows) | +| `p__produced_by`, `p__sample_location`, `p__sampling_site`, `p__site_location`, `p__registrant` | Remapped via `eric_id_map` (INTEGER[] → BIGINT[]) | +| `p__has_material_category`, `p__has_sample_object_type`, `p__has_context_category` | Remapped via `eric_id_map` + `our_concept_map` (URI lookup for concepts) | +| `p__keywords` | Remapped if non-null (INTEGER[] → BIGINT[]) | +| `p__responsibility` | Remapped if non-null (INTEGER[] → BIGINT[]) | +| `geometry` | Lifted from linked GeoCoordLoc row (WKB BLOB) | +| `latitude`, `longitude` | Lifted from linked GeoCoordLoc row | +| `p__curation` | NULL (INTEGER[] — not in Eric's wide, not applicable to OC) | +| `p__related_resource` | NULL (BIGINT[] — not in Eric's wide) | + +No column present in Eric's wide is dropped (47 columns in, 49 in ours = 47 + 2 nulled columns). + +--- + +## 6. Validation design + +The existing `validate_frontend_derived.py` covers Stage 4 (derived files). We need an additional trust gate for the new Stage 3c (ingest): + +**`scripts/validate_ingest.py`** (new, to be written with the impl): +- Re-derive the new-pid set from inputs (Eric's wide \ src wide) and assert the output contains exactly those pids +- Assert row count: `output_rows == src_rows + new_entity_rows + minted_concepts` +- Assert no duplicate row_ids in output +- Assert all p__ references in new rows resolve to existing rows in output +- Assert new MSR rows have `n = 'OPENCONTEXT'` +- Assert geometry non-null for all new MSRs that had a coord path +- Assert concept count: `output IdentifiedConcept count == src concept count + minted count` + +The Stage 4 semantic gate (`validate_frontend_derived.py --wide`) should be run unchanged on the ingested wide — it doesn't know about Phase 2 vs Phase 1, it just validates the derived files against the wide. + +--- + +## 3.5 Orphan analysis (actuals from 202606 base) + +For the 21,227 removed MSR pids, orphan entities (referenced ONLY by removed MSRs): + +| Entity type | Orphan count | Policy | +|---|---|---| +| MaterialSampleRecord | 21,227 | REMOVE (the removed pids themselves) | +| SamplingEvent | 21,227 | REMOVE (each removed MSR had exactly 1 orphan SE) | +| GeospatialCoordLocation | 21,227 | REMOVE (each orphan SE had 1 orphan geo via p__sample_location) | +| SamplingSite | 928 | REMOVE (orphan sites via orphan SE p__sampling_site) | +| Agent | 0 | KEEP (no orphan agents — all shared with surviving MSRs) | +| **Total rows removed** | **64,609** | — | + +No SamplingSite geo refs (p__site_location) produced additional orphan geo rows (those 928 sites' geo rows were shared with surviving entities or already counted). + +--- + +## 7. Decisions (all resolved 2026-06-12) + +| # | Decision | **Resolution** | +|---|---|---| +| D1 | **Base wide for ingestion** | **202606** — downloaded from R2; sha256=57c01f922c52bac2c6a28abd504f38161f83c140a7149036b3d8e725be8aa3b1 | +| D2 | **Output tag** | **isamples_202608** | +| D3 | **Stale pids policy** | **REMOVE** — TRUE SYNC. 21,227 stale OC pids removed + 43,382 orphan subgraph entities. Rationale: Murlo project mass-PID-update; old pids would duplicate physical samples. | +| D4 | **n column for non-MSR entities** | **NULL** — matches existing convention; only MSR rows get n='OPENCONTEXT' | +| D5 | **p__curation / p__related_resource** | **NULL** — no source in Eric's PQG; matches existing OC rows | +| D6 | **Staging vs production** | **Stage first** — output at /tmp/ingest_202608/; R2 publish human-gated | +| D7 | **Eric notification** | OC rock count after sync: **37,953** (30,272 - 85 removed rocks + 7,766 new rocks). Eric to verify. | + +--- + +## 8. Gap resolution (as of 2026-06-12 implementation) + +1. ✅ **202606 round-trip**: DONE. Production run used 202606 as base. Confirmed: 1 concept minted (earthsurface), not 2. + +2. ✅ **p__ array remapping correctness at scale**: DONE. Full remapping executed; post-write trust checks + 25-check validator all pass. + +3. **SamplingSite deduplication**: NOT resolved (acknowledged data-quality gap; not a correctness issue for ingestion). + +4. ✅ **Keywords entities**: p__keywords concept resolution is included in the remap tables (remap_msr_kw). All keyword concept refs in new MSRs resolve via URI lookup against src IdentifiedConcept rows. + +5. **Agent deduplication**: NOT resolved (acknowledged; 24 new agents are conceptually distinct by pid). + +6. ✅ **Geometry WKB encoding**: Confirmed. Eric's geo uses GEOMETRY type; ingest converts with ST_AsWKB()::BLOB. Coord round-trip accurate to 6 decimal places (validated by build + validator). + +7. ✅ **File sizes measured**: 202608 wide is ~302MB (202606 was ~292MB, +3.4%). samp=6,726,892, samp_geo=6,026,242. + +8. ✅ **Fixture tests**: `tests/test_ingest_oc_records.py` added with 20 tests covering all trust-gate invariants, remapping, removal, and determinism. + +--- + +## 9. Summary + +The ingestion is well-understood and low-risk: +- 67,187 new MSRs + ~85,125 supporting entities = ~152,312 new rows +- All new entities are absent from the current wide (no pid collisions) +- All coords accessible via graph path (100% coverage) +- 1 new concept to mint (earthsurface, used by 3,924 new MSRs) +- Primary complexity: row_id remapping from Eric's space to ours +- The Phase 1 overlay pattern (`enrich_wide_with_oc_concepts.py`) is the precedent for the write step +- The full Stage 4 semantic gate runs unchanged on the output + +Estimated implementation effort: 1–2 sessions (script + tests + gate). + +--- + +## 10. Phase 3 — Bug fixes folded into 202608 (2026-06-12) + +Three bugs discovered during triage are fixed before staging 202608. All fixes are folded into the 202608 rebuild so the published dataset is clean from the start. + +### Fix A — #277: OC description enrichment + +**Root cause**: The 202608 (and previous 202606) combined wide stores OC sample `description` as terse Linked Data metadata (`'updated': 2023-10-05T04:45:54Z`) instead of the human-readable site-path strings (`Open Context published "Sample" from: Europe/Cyprus/PKAP Survey Area/...`) present in Eric's OC PQG wide. + +**Impact**: Text search for "Cyprus" returns 0 matches in the deployed explorer (expected ≈ 69,230). + +**Fix**: Added **Phase J** to `scripts/ingest_oc_records.py`. After the sync write, the script reads the output wide, LEFT JOINs on `pid` with Eric's OC wide for OC MSR rows only, overwrites `description` with Eric's value if non-null, and atomically replaces the output. Non-OC rows and rows with NULL description in Eric's wide are unchanged. + +**Trust gate**: Cyprus OC MSR count after enrichment must be ≈ 69,230. + +**Schema note**: `description` is a simple VARCHAR in both wides. Row counts are invariant. + +### Fix B — #283a: Empty-string facet filter + +**Root cause**: 586 GEOME records have an empty string (`''`) as their `context` (Sampled Feature) facet value because their `p__has_context_category` points to an `IdentifiedConcept` row with `pid = ''`. The facet-summary builder filtered `IS NOT NULL` but not empty strings, so `''` appeared as a blank selectable facet entry with count 586. + +**Fix**: Changed the WHERE filter in `build_facet_summaries` and `build_facet_cross_filter` in `scripts/build_frontend_derived.py` from `IS NOT NULL` to `IS NOT NULL AND {d} <> ''`. Also updated the algebraic recompute in `scripts/validate_frontend_derived.py` to match, and added check 5b (`facet_summaries no blank values (#283a)`). + +**Trust gate**: `SELECT COUNT(*) FROM facet_summaries WHERE facet_value = ''` must be 0. + +### Fix C — #283b: Deprecated specimentype/1.0 label mappings + +**Root cause**: 169 SESAR records use deprecated-namespace object_type URIs (`specimentype/1.0/othersolidobject`, `specimentype/1.0/physicalspecimen`) that are absent from `vocab_labels.parquet`. The explorer's `prettyLabel()` falls back to displaying the raw URI path tail. + +**Fix**: Added `MANUAL_LABEL_OVERRIDES` to `scripts/build_vocab_labels.py` — two hardcoded rows injected before the dedupe step. Rebuilt `vocab_labels.parquet` (now 539 rows, up from 537). + +**Trust gate**: Both URIs present in `vocab_labels.parquet` with correct `pref_label` values. + +### Fixture tests added (tests/test_ingest_oc_records.py) + +| Group | Tests added | +|---|---| +| Fix #277 | `test_oc_description_enriched_from_eric_wide`, `test_non_oc_description_unchanged_by_enrichment`, `test_oc_msr_count_unchanged_by_enrichment` | +| Fix #283a | `test_empty_string_facet_values_filtered_from_summaries`, `test_empty_string_facet_values_filtered_from_cross_filter` | +| Fix #283b | `test_specimentype_othersolidobject_in_vocab_labels`, `test_specimentype_physicalspecimen_in_vocab_labels`, `test_specimentype_labels_have_lang_en` | + +All 28 tests pass (`pytest tests/test_ingest_oc_records.py -v`). + +--- + +## Phase 4 — Codex Blocker Fixes (2026-06-13) + +A Codex pre-merge review of the Phase 3 code found 3 latent bugs (blockers) and 4 nits. The STAGED DATA in /tmp/ingest_202608/ was verified correct (zero dangling refs), but the CODE contained landmines that would silently corrupt a future re-run. All bugs are fixed and the pipeline was rebuilt from scratch. + +### Blocker 1: Cross-source orphan protection + +**File**: `scripts/ingest_oc_records.py`, lines ~213-216 (surviving_se_refs query) + +**Bug**: The `surviving_se_refs`, `surviving_geo_refs`, and `surviving_site_refs` queries were filtered to `s.n='OPENCONTEXT'`, meaning a SamplingEvent/Geo/Site referenced by a surviving SESAR/GEOME/Smithsonian MSR was NOT protected and could be wrongly deleted as an orphan. + +**Fix**: Removed the `n='OPENCONTEXT'` filter from `surviving_se_refs`. The surviving refs now come from ALL surviving MaterialSampleRecords regardless of source. The condition `NOT (s.n='OPENCONTEXT' AND s.pid IN removed_pids)` correctly covers both OC survivors and all non-OC MSRs. + +**Impact**: In production, shared SEs/Geos/Sites (referenced by both OC and non-OC MSRs) would have been incorrectly deleted, creating dangling refs in SESAR/GEOME rows. The actual 202608 data was unaffected (the production run had 0 shared SEs between OC-removed and non-OC MSRs), but the latent bug was critical. + +**Regression test**: `test_cross_source_shared_entity_not_orphaned` — confirmed FAILS on old code and PASSES on fixed code. + +### Blocker 2: Incomplete reference extraction + trust gate + +**File**: `scripts/ingest_oc_records.py` + +**Bug A** (line ~373): Agent extraction only queried `p__registrant`. Agents in `p__responsibility` were remapped into `eric_id_map` but their Agent entity rows were never extracted — meaning `p__responsibility` references on new MSR rows pointed to row_ids that didn't exist in the output (dangling refs). + +**Fix A**: Extended `agent_ids` UNION to also include `p__responsibility`: +```sql +UNION +SELECT DISTINCT u.agent_id FROM new_msr_eric, UNNEST(p__responsibility) AS u(agent_id) +``` + +**Bug B** (line ~648): The pre-write trust gate only checked `p__produced_by` and 3 concept dims. Other p__* columns (p__registrant, p__responsibility, p__sample_location, etc.) were not verified. + +**Fix B**: Added a comprehensive post-write dangling-ref trust gate (B2B) that checks EVERY p__* array column (BIGINT[] and INTEGER[]) on new rows against the output row_id set. HARD FAIL (RuntimeError) if any dangling ref is found. Covers 10 columns total. + +### Blocker 3: Non-deterministic output order (Phase J) + +**File**: `scripts/ingest_oc_records.py`, Phase J UNION ALL (~line 966) + +**Bug**: Phase J rewrites the output with a UNION ALL (OC MSR rows + non-OC rows) but lacked `ORDER BY row_id`. The initial Phase I write had `ORDER BY row_id`, but Phase J's rewrite lost it — making sha256 non-reproducible across runs. + +**Fix**: Added `ORDER BY row_id` to the Phase J COPY query before the final `TO ... PARQUET` write. + +### Nit A: Whitespace-only facet values + +**Files**: `scripts/build_frontend_derived.py`, `scripts/validate_frontend_derived.py` + +Changed `{d} <> ''` to `NULLIF(TRIM({d}), '') IS NOT NULL` in both `build_facet_summaries` and `build_facet_cross_filter`. Updated validator check 5b to use `TRIM(facet_value) = ''` to catch whitespace-only values. + +### Nit B: Cyprus enrichment as hard trust gate + +**File**: `scripts/ingest_oc_records.py` + +Changed the Cyprus description count from a log statement to a hard `RuntimeError` when `n_cyprus < 69,000` (at production scale, i.e., when `out_oc_count > 1,000,000`). Synthetic fixtures skip the gate to avoid false positives. + +### Nit C: B1 regression test + +**File**: `tests/test_ingest_oc_records.py` + +Added `test_cross_source_shared_entity_not_orphaned`: synthetic fixture with OC MSR to-be-removed + SESAR MSR sharing SE/Site/Geo. Confirmed test FAILS on old code, PASSES on fixed code. + +### Fixture tests (29 total after Phase 4) + +``` +pytest tests/test_ingest_oc_records.py -v +29/29 passed in ~11s + +New test added (Phase 4): + Blocker 1: test_cross_source_shared_entity_not_orphaned +``` + +--- + +## Phase 5 — Site-Location Orphan Fix (2026-06-13) + +### General Orphan Formulation + +The Phase 4 dangling-ref gate (B2B) only checked NEW rows. Surviving src rows that pointed to deleted Geos were not checked. This left 4,606 dangling `p__site_location` refs in old SamplingSite rows. + +The correct orphan determination order is: +1. Compute **orphan SEs** (referenced only by removed MSRs) +2. Compute **orphan SamplingSites** (referenced only by orphan SEs, not surviving SEs) +3. Compute **orphan Geos**: referenced only by orphan SEs AND orphan SamplingSites — i.e., NOT by any surviving SE (via `p__sample_location`) AND NOT by any surviving SamplingSite (via `p__site_location`) + +Step 2 must complete before step 3 so that `surviving_geo_refs` correctly excludes Geos still held by surviving Sites. + +### Mandatory In-Script Dangling-Ref Gate + +The post-write gate now scans ALL rows (surviving + new) across ALL p__* columns (both BIGINT[] and INTEGER[]). It: +- Unnests each p__* column on the entire output +- Left-joins against all output row_ids +- Prints per-column count +- Raises `RuntimeError` if any total > 0 (manifest not emitted, build aborted) + +This gate catches both new-row mapping errors (covered by B2B) AND old-row dangling refs introduced by incorrect orphan deletion (the Phase 5 defect). + +### Regression Test + +`test_site_location_geo_not_orphaned` in `tests/test_ingest_oc_records.py`: +- Scenario: OC SE references Geo via p__sample_location; surviving SESAR SE → same Site → same Geo via p__site_location +- Asserts Geo survives, zero p__site_location dangling refs +- Confirmed FAILS on old code (Phase 4 logic), PASSES on fixed code (Phase 5) + +### Fixture tests (30 total after Phase 5) + +``` +pytest tests/test_ingest_oc_records.py -v +30/30 passed in ~10s + +New test added (Phase 5): + test_site_location_geo_not_orphaned +``` diff --git a/Makefile b/Makefile index b4fff9c..8f2f152 100644 --- a/Makefile +++ b/Makefile @@ -31,7 +31,7 @@ VALIDATE := scripts/validate_frontend_derived.py ENRICH := scripts/enrich_wide_with_oc_concepts.py VALIDATE_ENRICH := scripts/validate_oc_concept_enrichment.py -.PHONY: help test wide oc-wide enrich validate-enrich derived validate all all-272 clean +.PHONY: help test wide oc-wide enrich validate-enrich derived validate all all-272 ingest-272 all-202608 clean help: @grep -E '^# make' Makefile | sed 's/^# / /' @@ -81,5 +81,30 @@ all-272: validate-enrich $(MAKE) derived DERIVED_WIDE=$(ENRICHED) TAG=$(TAG) $(MAKE) validate TAG=$(TAG) SENTINEL_FLAG= +# TRUE SYNC ingestion: add 67,187 new OC pids + remove 21,227 stale OC pids (#272 Phase 2, D3). +# Requires the 202606 wide (--src) and Eric's OC wide (--oc-wide). +# Outputs a 202608-tagged wide parquet + derived files in $(OUTDIR). +# +# make ingest-272 \ +# SRC_202606=~/Data/iSample/pqg_refining/isamples_202606_wide.parquet \ +# OC_WIDE_2026=~/Data/iSample/pqg_refining/oc_isamples_pqg_wide_2026-06-09.parquet +# +INGEST_TAG ?= isamples_202608 +SRC_202606 ?= $(OUTDIR)/isamples_202606_wide.parquet +OC_WIDE_2026 ?= $(OUTDIR)/oc_isamples_pqg_wide_2026-06-09.parquet +INGEST_OUT ?= $(OUTDIR)/$(INGEST_TAG)_wide.parquet +INGEST := scripts/ingest_oc_records.py + +$(INGEST_OUT): $(SRC_202606) $(OC_WIDE_2026) + @mkdir -p $(OUTDIR) + $(PY) $(INGEST) --src $(SRC_202606) --oc-wide $(OC_WIDE_2026) --out $(INGEST_OUT) + +ingest-272: $(INGEST_OUT) + +# Full 202608 pipeline: ingest -> derived -> validate +all-202608: ingest-272 + $(MAKE) derived DERIVED_WIDE=$(INGEST_OUT) TAG=$(INGEST_TAG) + $(MAKE) validate TAG=$(INGEST_TAG) SENTINEL_FLAG= + clean: rm -rf $(OUTDIR) diff --git a/SPIKE_RESULTS.md b/SPIKE_RESULTS.md new file mode 100644 index 0000000..11c52e3 --- /dev/null +++ b/SPIKE_RESULTS.md @@ -0,0 +1,320 @@ +# SPIKE_RESULTS.md + +Actual verification query outputs from the #272 Phase 2 ingestion spike. + +*Run 2026-06-12 on local files. All numbers are real — from executed DuckDB queries against local parquet files.* + +--- + +## Environment + +| Item | Value | +|---|---| +| Python | `~/.pyenv/versions/myenv/bin/python` | +| DuckDB | 1.4.4 | +| Base wide | `~/Data/iSample/pqg_refining/isamples_202604_wide.parquet` (279 MB) | +| Eric's OC wide | `~/Data/iSample/pqg_refining/oc_isamples_pqg_wide_2026-06-09.parquet` (275 MB) | +| Spike output | `/tmp/ingest_272_output/isamples_202608_wide.parquet` (302 MB) | +| Script | `scripts/ingest_oc_records.py` | + +> **Note on base:** the spike used `isamples_202604_wide` (locally available). Production ingestion must use `isamples_202606_wide` (the Phase 1 overlay output, currently on R2 only). This affects the rock facet count (V7) and the concept minting (2 concepts minted here vs 1 when using 202606 base — `otheranthropogenicmaterial` is already present in 202606). All other numbers are base-independent. + +--- + +## Phase 0: Local file inventory + +``` +~/Data/iSample/pqg_refining/ + isamples_202604_wide.parquet 279 MB Apr 23 2026 (our pre-overlay wide) + oc_isamples_pqg_wide_2026-06-09.parquet 275 MB Jun 10 2026 (Eric's fresh wide) + oc_isamples_pqg_wide.parquet 275 MB Dec 1 2025 (older Eric wide) + oc_isamples_pqg.parquet 691 MB Jun 9 2025 (Eric narrow) + zenodo_wide_2026-01-09.parquet 278 MB (Zenodo base wide) +``` + +No local copy of `isamples_202606_wide.parquet` found. Must download from R2 for production run. + +--- + +## Gap characterization queries + +### Q1: OC MSR count in our wide +```sql +SELECT COUNT(*) FROM read_parquet('isamples_202604_wide.parquet') +WHERE otype='MaterialSampleRecord' AND n='OPENCONTEXT'; +``` +**Result: 1,064,831** + +### Q2: OC MSR count in Eric's wide +```sql +SELECT COUNT(*) FROM read_parquet('oc_isamples_pqg_wide_2026-06-09.parquet') +WHERE otype='MaterialSampleRecord'; +``` +**Result: 1,110,791** + +### Q3: New pids (Eric's \ ours) +```sql +SELECT COUNT(*) FROM ( + SELECT pid FROM read_parquet('oc_...wide.parquet') WHERE otype='MaterialSampleRecord' + EXCEPT + SELECT pid FROM read_parquet('isamples_202604_wide.parquet') + WHERE otype='MaterialSampleRecord' AND n='OPENCONTEXT' +); +``` +**Result: 67,187** + +### Q4: Deleted pids (ours \ Eric's) +```sql +SELECT COUNT(*) FROM ( + SELECT pid FROM read_parquet('isamples_202604_wide.parquet') + WHERE otype='MaterialSampleRecord' AND n='OPENCONTEXT' + EXCEPT + SELECT pid FROM read_parquet('oc_...wide.parquet') WHERE otype='MaterialSampleRecord' +); +``` +**Result: 21,227** — pids in frozen export but absent from Eric's fresh wide (de-published or restructured OC items). Policy: keep in our wide, no action. + +### Q5: Material breakdown for new records + +First-non-root material concept for 67,187 new MSRs: + +| Count | Material URI | +|---|---| +| 22,236 | `material/1.0/biogenicnonorganicmaterial` | +| 19,072 | `material/1.0/otheranthropogenicmaterial` | +| 10,315 | `material/1.0/organicmaterial` | +| 7,766 | `material/1.0/rock` | +| 7,282 | NULL (root-only, no specific concept) | +| 466 | `material/1.0/mineral` | +| 48 | `material/1.0/anthropogenicmetal` | +| 2 | `material/1.0/mixedsoilsedimentrock` | + +### Q6: Coordinates for new records +```sql +-- Direct geometry on MSR: 0/67,187 +-- Direct latitude float: 0/67,187 +-- Via graph path (MSR->SE->GeoCoordLoc): 67,187/67,187 +``` +**All 67,187 new records have coordinates via MSR → p__produced_by → SamplingEvent → p__sample_location → GeospatialCoordLocation.** + +Coordinate range: lat −55.19 to 71.04, lon −164.0 to 159.9 (global, not just Jordan). + +Sample (Lithic ID Jordan records confirmed at lat≈31.87, lon≈35.89): +``` +ark:/28722/k27m0sf3s Lithic ID: 4130 lat=31.867087 lon=35.889976 +``` + +### Q7: Schema drift +| Dimension | Detail | +|---|---| +| Cols in ours only | `p__curation` (INTEGER[]), `p__related_resource` (BIGINT[]) | +| Cols in Eric's only | none | +| Type differences | `row_id`: ours=BIGINT, Eric's=INTEGER; `p__has_*` arrays: ours=BIGINT[], Eric's=INTEGER[] | +| `n` column | Eric's wide: NULL for all MSRs; ours: 'OPENCONTEXT' | +| geometry column | ours=BLOB (WKB); Eric's=GEOMETRY (DuckDB native, auto-decoded by spatial ext.) | + +### Q8: Full entity subgraph +| Entity type | Count | In our wide (by pid)? | +|---|---|---| +| MaterialSampleRecord | 67,187 | 0 (all new pids) | +| SamplingEvent | 67,187 | 0 | +| GeospatialCoordLocation (from SE) | 10,316 | 0 | +| GeospatialCoordLocation (from SamplingSite) | 6,514 | 0 | +| Unique GeoCoordLoc total | 11,399 | 0 | +| SamplingSite | 6,514 | 0 | +| Agent | 24 | 0 | +| **Total new entity rows** | **152,311** | — | + +### Q9: IdentifiedConcept analysis +14 distinct concept URIs referenced by new MSRs. Against 202604 base: +- 12 already present +- 2 missing: `otheranthropogenicmaterial`, `earthsurface` + +Against 202606 base (production): +- 13 already present (Phase 1 minted `otheranthropogenicmaterial`) +- 1 missing: `earthsurface` (3,924 new MSRs use it for `p__has_context_category`) + +### Q10: row_id ranges +| File | Min | Max | Total rows | +|---|---|---|---| +| Our 202604 wide | 1 | 20,729,358 | 20,729,358 | +| Eric's wide | 1 | 2,465,485 | 2,465,485 | +| Spike output | 1 | 20,881,671 | 20,881,671 | + +New rows allocated: 20,729,359 to 20,881,671 (152,313 = 152,311 entities + 2 concepts). + +--- + +## Spike execution log + +``` +[ 0.1s] schema checks passed +[ 0.2s] new pids: 67,187 +[ 0.3s] extracting entity subgraph for new pids... +[ 0.7s] subgraph: msr=67,187 se=67,187 geo=11,399 site=6,514 agent=24 +[ 0.7s] src max_row_id=20,729,358 +[ 0.7s] id_map: 152,311 entries, new row_id range 20729359 to 20,881,669, collisions=0 +[ 0.8s] minting 2 new IdentifiedConcept rows: ['otheranthropogenicmaterial', 'earthsurface'] +[ 0.8s] minted_concepts=2 +[ 0.9s] coords: 67,187 pids with coords, 0 duplicate-coord pids +[ 0.9s] remapping p__ arrays for new MSR rows... +[ 1.5s] p__ remapping tables built +[ 1.5s] running pre-write trust checks... +[ 1.6s] trust checks passed +[ 1.6s] expected output rows: 20,729,358 src + 152,311 new entities + 2 concepts = 20,881,671 +[ 1.6s] writing output... +[ 8.9s] wrote /tmp/ingest_272_output/isamples_202608_wide.parquet +[ 9.2s] post-write: rows=20,881,671 dup_rowids=0 dup_pids=0 oc_msrs=1,132,018 n_check=PASS +[ 9.7s] manifest -> /tmp/.../isamples_202608_wide.parquet.manifest.json +[ 9.7s] done +``` + +**Total wall time: ~10s.** + +--- + +## Post-write verification + +### V1: Row count by otype in spike output +| otype | Count | +|---|---| +| MaterialSampleRecord | 6,748,119 (+67,187 vs 202604) | +| SamplingEvent | 6,421,358 (+67,187) | +| GeospatialCoordLocation | 5,991,681 (+11,399) | +| MaterialSampleCuration | 720,254 (unchanged) | +| SampleRelation | 501,579 (unchanged) | +| SamplingSite | 392,674 (+6,514) | +| IdentifiedConcept | 55,895 (+2: otheranthropogenicmaterial + earthsurface) | +| Agent | 50,111 (+24) | + +### V2: OC MSR count in output +``` +n=SESAR 4,688,386 +n=OPENCONTEXT 1,132,018 (was 1,064,831; +67,187) +n=GEOME 605,554 +n=SMITHSONIAN 322,161 +``` + +### V3: New MSRs have geometry +``` +total_new_msr=67,187 +has_geometry=67,187 (100%) +has_lat=67,187 (100%) +``` + +### V4: Coord round-trip check (geometry == lat/lon) +``` +lat=-0.34581 geo_lat=-0.34581 (MATCH) +lon=-80.1658 geo_lon=-80.1658 (MATCH) +``` + +### V5: p__ reference integrity +``` +Dangling p__produced_by refs in new MSRs: 0 +``` + +### V6: Concept reference integrity +``` +Unresolved concept refs in new MSRs: 0 +``` + +### V7: Rock facet count +| Base | Existing OC rock | New rock | Total | +|---|---|---|---| +| 202604 (spike) | 1,956 (pre-overlay) | 7,766 | 9,722 | +| 202606 (expected production) | ~30,272 (overlay-corrected) | 7,766 | ~38,038 | + +The 1,956 vs 30,272 discrepancy is entirely explained by the Phase 1 concept overlay that's in 202606 but not 202604. New record ingestion contributes exactly +7,766, matching the issue #272 comment exactly. + +### V8: IdentifiedConcept count +``` +Total IdentifiedConcept rows in output: 55,895 + 202604 had: 55,893 + Phase 1 overlay added: +1 (otheranthropogenicmaterial, already in 202606) + This spike added: +2 (both concepts, since base was 202604) +``` + +### V9: Output file size +``` +Output: 301.5 MB +Src: 292.3 MB +Delta: +9.2 MB (+3.1%) +``` + +### V10: Stage 4 derived files (build + validate) +``` +build_frontend_derived.py: + samp=6,748,119 samp_geo=6,047,469 duplicate_pids=0 duplicate_concept_row_ids=0 + sample_facets_v2 ✓ facet_summaries ✓ facet_cross_filter ✓ + samples_map_lite ✓ h3_summary_res{4,6,8} ✓ + Total time: ~6s + +validate_frontend_derived.py: ALL CHECKS PASS (16/16) + material root absent PASS + sentinel check PASS (expected anthropogenicmetal for 202604-base) + facets pid unique PASS + map_lite pid unique PASS + facets.pid == map_lite.pid PASS (0 pids differ) + facet_summaries algebraic PASS + cross_filter algebraic PASS + cross_filter baseline == summaries PASS + h3 res4/6/8 sums match map_lite PASS (6,047,469 each) + facets schema contract PASS + facets non-empty PASS (6,047,469 rows >> 1,000,000) + manifest sha256 PASS +``` + +--- + +## Spike manifest (truncated) + +```json +{ + "script": "ingest_oc_records.py", + "duckdb_version": "1.4.4", + "counts": { + "src_rows": 20729358, + "new_pids": 67187, + "new_entity_rows": 152311, + "minted_concepts": 2, + "out_rows": 20881671, + "new_oc_msr_total": 1132018, + "entity_breakdown": { + "new_msr": 67187, "new_se": 67187, "new_geo": 11399, + "new_site": 6514, "new_agent": 24 + } + }, + "output": { + "bytes": 301517608, + "sha256": "d8aad79b719ac61443462b5ddfcd350501b5ad2f77cfc435407435132d5ba52e" + } +} +``` + +--- + +## Key bug fixed during spike + +**Geometry type mismatch**: Eric's OC wide stores `geometry` as DuckDB `GEOMETRY` type (auto-decoded when the `spatial` extension is loaded). Our wide stores it as `BLOB` (WKB bytes). A naïve `UNION ALL BY NAME` between the src wide (BLOB) and rows with geometry from Eric's GeoCoordLoc (GEOMETRY) failed with: + +``` +ConversionException: Unimplemented type for cast (BLOB -> GEOMETRY) +``` + +**Fix**: Convert Eric's GEOMETRY → WKB BLOB using `ST_AsWKB(geometry)::BLOB` before the UNION. Validated: lat/lon round-trip identical to 6 decimal places. + +This fix is in the script at two points: +1. `new_msr_coords` table: `ST_AsWKB(geo.geometry)::BLOB AS geometry` +2. `geo_select` (GeoCoordLoc rows): `ST_AsWKB(g.geometry)::BLOB AS geometry` (via `eric_geo_is_geometry=True` flag) + +--- + +## Pipeline runtime summary + +| Step | Time | Notes | +|---|---|---| +| `ingest_oc_records.py` (dry-run) | ~2s | Analysis + trust checks, no write | +| `ingest_oc_records.py` (full) | ~10s | Write 20.8M row parquet | +| `build_frontend_derived.py` | ~6s | All 6 derived files, no wide_h3 | +| `validate_frontend_derived.py` | ~3s | All 16 checks | +| **Total** | **~21s** | Full pipeline on 202604 base | diff --git a/SYNC_RESULTS.md b/SYNC_RESULTS.md new file mode 100644 index 0000000..ca802af --- /dev/null +++ b/SYNC_RESULTS.md @@ -0,0 +1,653 @@ +# SYNC_RESULTS.md + +Actual verification query outputs from the #272 Phase 2 production sync. + +*Run 2026-06-12. All numbers are real — from executed DuckDB queries against local parquet files.* + +--- + +## Environment + +| Item | Value | +|---|---| +| Python | `~/.pyenv/versions/myenv/bin/python` (3.13.11) | +| DuckDB | 1.4.4 | +| Base wide (src) | `~/Data/iSample/pqg_refining/isamples_202606_wide.parquet` (292 MB) | +| Eric's OC wide | `~/Data/iSample/pqg_refining/oc_isamples_pqg_wide_2026-06-09.parquet` (288 MB) | +| Output wide | `/tmp/ingest_202608/isamples_202608_wide.parquet` (~302 MB) | +| Script | `scripts/ingest_oc_records.py` (production version with D3 removal) | + +--- + +## Gap characterization against 202606 base + +### Q1: OC MSR count in 202606 wide +``` +SELECT COUNT(*) FROM ... WHERE otype='MaterialSampleRecord' AND n='OPENCONTEXT' +Result: 1,064,831 +``` + +### Q2: OC MSR count in Eric's wide +``` +SELECT COUNT(*) FROM ... WHERE otype='MaterialSampleRecord' +Result: 1,110,791 +``` + +### Q3: New pids (Eric's \ 202606) +``` +Result: 67,187 +``` + +### Q4: Stale pids (202606 \ Eric's) — D3: REMOVED +``` +Result: 21,227 +``` + +### Q5: Expected OC MSR after sync +``` +1,064,831 - 21,227 + 67,187 = 1,110,791 +``` +(Matches Eric's total exactly — true sync.) + +### Q6: Total rows in 202606 wide +``` +Result: 20,729,359 +``` +(One more than 202604's 20,729,358 — the extra row is `otheranthropogenicmaterial` concept minted by Phase 1.) + +### Q7: max row_id in 202606 +``` +Result: 20,729,359 +``` + +### Q8: Rock count in 202606 (OC MSRs) +``` +Result: 30,272 +``` + +### Q9: IdentifiedConcept count in 202606 +``` +Result: 55,894 +``` + +### Q10: earthsurface in 202606 +``` +Result: 0 (not yet present — must be minted) +``` + +--- + +## Orphan analysis (D3 removal subgraph) + +For the 21,227 removed MSR pids, orphan entities: + +| Entity type | Count | Decision | +|---|---|---| +| MaterialSampleRecord | 21,227 | REMOVED | +| SamplingEvent (orphan — only in removed MSRs) | 21,227 | REMOVED | +| GeospatialCoordLocation (orphan — via orphan SEs) | 21,227 | REMOVED | +| SamplingSite (orphan — via orphan SEs) | 928 | REMOVED | +| Agent (orphan) | 0 | N/A (none) | +| **Total rows removed** | **64,609** | — | + +Shared SamplingEvent rows (referenced by both removed + surviving MSRs): 0. + +--- + +## Production run execution log + +``` +[ 0.1s] schema checks passed +[ 0.2s] stale pids to remove: 21,227 +[ 14.1s] orphan subgraph: msr=21,227 se=21,227 geo=21,227 site=928 total=64,609 +[ 14.1s] rows_to_remove: 64,609 (matches orphan arithmetic) +[ 14.1s] new pids: 67,187 +[ 14.1s] extracting entity subgraph for new pids... +[ 14.5s] subgraph: msr=67,187 se=67,187 geo=11,399 site=6,514 agent=24 +[ 14.5s] src max_row_id=20,729,359 +[ 14.6s] id_map: 152,311 entries, new row_id range 20729360 to 20,881,670, collisions=0 +[ 14.6s] minting 1 new IdentifiedConcept rows: ['https://w3id.org/isample/vocabulary/sampledfeature/1.0/earthsurface'] +[ 14.6s] minted_concepts=1 +[ 14.7s] coords: 67,187 pids with coords, 0 duplicate-coord pids +[ 14.7s] remapping p__ arrays for new MSR rows... +[ 15.3s] p__ remapping tables built +[ 15.3s] running pre-write trust checks... +[ 15.4s] trust checks passed +[ 15.4s] expected output rows: 20,729,359 src - 64,609 removed + 152,311 new entities + 1 concepts = 20,817,062 +[ 15.4s] writing output... +[ 22.4s] wrote /tmp/ingest_202608/isamples_202608_wide.parquet +[ 22.6s] post-write: rows=20,817,062 dup_rowids=0 dup_pids=0 oc_msrs=1,110,791 stale_remain=0 n_check=PASS +[ 23.1s] manifest -> /tmp/ingest_202608/isamples_202608_wide.parquet.manifest.json +[ 23.1s] done +``` + +**Total ingest time: ~23s.** + +--- + +## Post-write verification queries (all run against 202608 wide + derived files) + +### Va: Row count by otype in output +| otype | Count | Delta vs 202606 | +|---|---|---| +| MaterialSampleRecord | 6,726,892 | +67,187 new − 21,227 removed = +45,960 | +| SamplingEvent | 6,400,131 | +67,187 new − 21,227 orphan = +45,960 | +| GeospatialCoordLocation | 5,970,454 | +11,399 new − 21,227 orphan = −9,828 | +| MaterialSampleCuration | 720,254 | unchanged | +| SampleRelation | 501,579 | unchanged | +| SamplingSite | 391,746 | +6,514 new − 928 orphan = +5,586 | +| IdentifiedConcept | 55,895 | +1 (earthsurface minted) | +| Agent | 50,111 | +24 new, 0 orphan | +| **Total** | **20,817,062** | **expected** | + +### Vb: OC MSR count +``` +n=OPENCONTEXT: 1,110,791 (expected: 1,110,791) ✓ +``` + +### Vc: OC rock facet count (from sample_facets_v2) +``` +OC rock count: 37,953 +Arithmetic: 30,272 (202606) - 85 (removed rocks) + 7,766 (new rocks) = 37,953 ✓ +``` + +### Vd: Removed Murlo pids — NONE in output +``` +Stale pids remaining in output: 0 (expected: 0) ✓ +``` + +### Ve: New r2p24 Murlo-style pids — ALL present +``` +r2p24 pids in Eric's wide: 15,409 +r2p24 pids in output: 15,409 ✓ +``` +(814 of the 15,409 r2p24 pids are newly added; the remainder were already in the 202606 wide from earlier Murlo work.) + +### Vf: Jordan lithic spot-checks (Tall al-ʿUmayri, new rock records) +``` +ark:/28722/k27m0sf3s lat=31.867088 lon=35.889977 ✓ +ark:/28722/k27w6wk0d lat=31.868178 lon=35.888423 ✓ +ark:/28722/k2gq7df90 lat=31.868287 lon=35.888424 ✓ +``` +Jordan lithic new pids (lat 31–32.5, lon 35–36.5): 7,925. + +### Vg: earthsurface concept +``` +IdentifiedConcept rows with pid containing 'earthsurface': 1 (expected: 1) ✓ +``` + +### Vh: IdentifiedConcept total +``` +55,895 (202606 had 55,894, +1 for earthsurface) ✓ +``` + +### Vi: Stage 4 derived files +``` +build_frontend_derived.py: + samp=6,726,892 samp_geo=6,026,242 duplicate_pids=0 duplicate_concept_row_ids=0 + sample_facets_v2 ✓ facet_summaries ✓ facet_cross_filter ✓ + samples_map_lite ✓ h3_summary_res{4,6,8} ✓ + Total time: ~6s +``` + +### Vj: Stage 4 validator (ALL 25 CHECKS PASS) +``` +validate_frontend_derived.py --dir /tmp/ingest_202608/ --tag isamples_202608 --wide ... + +ALL CHECKS PASS (25/25) + +Selected checks: + material root absent PASS + sentinel check PASS (otheranthropogenicmaterial — post-#272 value) + facets pid unique PASS + map_lite pid unique PASS + facets.pid == map_lite.pid PASS (0 pids differ) + facet_summaries algebraic PASS + cross_filter algebraic PASS + cross_filter baseline == summaries PASS + h3 res4/6/8 sums match map_lite PASS (6,026,242 each) + facets schema contract PASS + facets non-empty PASS (6,026,242 rows) + facets == fresh build from --wide PASS + map_lite == fresh build from --wide PASS + h3 discrete == fresh build (res4/6/8) PASS + h3 centers within 1e-5 (res4/6/8) PASS + manifest sha256 PASS +``` + +--- + +## Fixture tests (20 passed) +``` +pytest tests/test_ingest_oc_records.py -v + +20/20 passed in 4.47s +``` +Tests cover: new pid ingestion, stale pid removal, orphan subgraph removal, non-OC row survival, +geometry denormalization, n='OPENCONTEXT' assignment, p__ array remapping, concept minting, +SamplingSite ingestion, row count arithmetic, no duplicate row_ids, no duplicate MSR pids, +determinism, dry-run, hard-fail on duplicate OC pids, hard-fail on collision, removal scope, +id collision avoidance, overwrite guard. + +--- + +## Pipeline runtime summary + +| Step | Time | Notes | +|---|---|---| +| Download 202606 wide | ~25s | 292 MB from R2 | +| `ingest_oc_records.py` (dry-run) | ~15s | Analysis + trust checks (includes orphan analysis) | +| `ingest_oc_records.py` (full) | ~23s | Write 20.8M-row parquet | +| `build_frontend_derived.py` | ~6s | All 6 derived files, no wide_h3 | +| `validate_frontend_derived.py` (full, incl. --wide) | varies | 25 checks | +| **Total** | **~60s** | Full pipeline on 202606 base | + +--- + +## Input file checksums + +| File | sha256 | +|---|---| +| isamples_202606_wide.parquet | 57c01f922c52bac2c6a28abd504f38161f83c140a7149036b3d8e725be8aa3b1 | +| oc_isamples_pqg_wide_2026-06-09.parquet | 60d629279bb5702e50599eb5f49efa64d493545247886660e6cd31a44e21a8e9 | +| isamples_202608_wide.parquet (pre-Phase 3) | 8a5c0a0470c71c517a31494fc285f304ecf78cc602594f7278c65500a68a48eb | +| isamples_202608_wide.parquet (post-Phase 3, final) | 3d3dbd05b9c607de3eed413b2da95aaee5232ebfd435593042316e6512065e59 | + +--- + +## Phase 3 — Bug fixes folded into 202608 (2026-06-12) + +*Three bugs from triage report /tmp/triage-2026-06-12/FINDINGS.md folded in.* + +### Fix application + +| Fix | Script modified | Trust gate | +|---|---|---| +| #277 description enrichment | `scripts/ingest_oc_records.py` (Phase J added) | Cyprus OC MSR count | +| #283a empty-string facet filter | `scripts/build_frontend_derived.py` | Blank facet_value count | +| #283b specimentype/1.0 labels | `scripts/build_vocab_labels.py` | specimentype URIs in vocab_labels | + +Description enrichment was applied to the already-written 202608 wide (post-sync, pre-derived rebuild) using a UNION ALL approach: OC MSR rows (1.1M) joined with Eric's wide for descriptions, all other rows passed through unchanged. Total write time: ~2s. + +### Verification query outputs (all run 2026-06-12) + +#### a. Cyprus description count +``` +SELECT COUNT(*) FROM isamples_202608_wide WHERE otype='MaterialSampleRecord' AND n='OPENCONTEXT' AND description ILIKE '%Cyprus%' +Result: 69,230 (was 0 before fix; matches Eric's OC-specific wide exactly) +``` + +#### b. Blank facet entry count +``` +SELECT COUNT(*) FROM isamples_202608_facet_summaries WHERE facet_value = '' +Result: 0 (was 586 before fix — 586 GEOME records with empty-string concept URI) +``` + +#### c. specimentype label lookup +``` +SELECT uri, pref_label, lang FROM vocab_labels WHERE uri LIKE '%specimentype%' +Result: + ('specimentype/1.0/othersolidobject', 'Other solid object', 'en') + ('specimentype/1.0/physicalspecimen', 'Material sample', 'en') + (Both rows present with correct labels) +``` + +#### d. Total validator checks +``` +validate_frontend_derived.py --dir /tmp/ingest_202608/ --tag isamples_202608 --wide ... +Result: ALL CHECKS PASS (26/26) + (25 original checks + 1 new: "facet_summaries no blank values (#283a)") +``` + +#### e. OC MSR count (unchanged by description enrichment) +``` +SELECT COUNT(*) FROM isamples_202608_wide WHERE otype='MaterialSampleRecord' AND n='OPENCONTEXT' +Result: 1,110,791 (unchanged from Phase 2 sync) +``` + +#### f. Rock count (OC only, unchanged by description enrichment) +``` +SELECT COUNT(*) FROM isamples_202608_sample_facets_v2 WHERE material LIKE '%/rock' AND source='OPENCONTEXT' +Result: 37,953 (unchanged from Phase 2 sync) +``` + +### Fixture tests +``` +pytest tests/test_ingest_oc_records.py -v +28/28 passed in 5.28s + +New tests added (8 new): + Fix #277: test_oc_description_enriched_from_eric_wide + test_non_oc_description_unchanged_by_enrichment + test_oc_msr_count_unchanged_by_enrichment + Fix #283a: test_empty_string_facet_values_filtered_from_summaries + test_empty_string_facet_values_filtered_from_cross_filter + Fix #283b: test_specimentype_othersolidobject_in_vocab_labels + test_specimentype_physicalspecimen_in_vocab_labels + test_specimentype_labels_have_lang_en +``` + +### Updated output manifest +``` +/tmp/ingest_202608/isamples_202608_wide.parquet.manifest.json + output.bytes: 300,665,838 (was 300,427,562 — enriched descriptions are longer) + output.sha256: 3d3dbd05b9c607de3eed413b2da95aaee5232ebfd435593042316e6512065e59 + phase3_fixes: [#277, #283a, #283b] documented in manifest +``` + +--- + +## Phase 4 — Codex Blocker Fixes (2026-06-13) + +*Rebuild from scratch with all blocker + nit fixes. Output at /tmp/ingest_202608_v4/.* + +### Fixes applied + +| Fix | File | Change | +|---|---|---| +| Blocker 1 (B1): cross-source orphan guard | `scripts/ingest_oc_records.py` lines ~213-216 | Removed `n='OPENCONTEXT'` from `surviving_se_refs`; now covers all surviving MSRs | +| Blocker 2A (B2A): Agent extraction from p__responsibility | `scripts/ingest_oc_records.py` lines ~372-375 | Added `UNION` to extract agents from `p__responsibility` as well as `p__registrant` | +| Blocker 2B (B2B): Full p__* dangling-ref trust gate | `scripts/ingest_oc_records.py` post-write section | New HARD FAIL gate checking all 10 p__* ref columns (BIGINT[] + INTEGER[]) on new rows | +| Blocker 3 (B3): Deterministic Phase J output order | `scripts/ingest_oc_records.py` Phase J | Added `ORDER BY row_id` to Phase J UNION ALL before COPY TO PARQUET | +| Nit A: Whitespace-only facet filter | `scripts/build_frontend_derived.py`, `validate_frontend_derived.py` | `NULLIF(TRIM({d}), '') IS NOT NULL` replaces `{d} <> ''`; validator uses `TRIM()` check | +| Nit B: Cyprus count hard trust gate | `scripts/ingest_oc_records.py` Phase J trust gate | Now raises RuntimeError if cyprus_count < 69,000 at production scale | +| Nit C: B1 regression test | `tests/test_ingest_oc_records.py` | Added `test_cross_source_shared_entity_not_orphaned` | + +### B1 regression test confirmation + +- OLD code (with `n='OPENCONTEXT'` filter): `test_cross_source_shared_entity_not_orphaned` FAILS (se-shared deleted as false orphan) +- NEW code (all-source surviving refs): `test_cross_source_shared_entity_not_orphaned` PASSES + +### Production rebuild execution log + +``` +[ 0.0s] schema checks passed +[ 0.2s] stale pids to remove: 21,227 +[ 14.4s] orphan subgraph: msr=21,227 se=21,227 geo=21,227 site=928 total=64,609 +[ 14.4s] rows_to_remove: 64,609 (matches orphan arithmetic) +[ 14.4s] new pids: 67,187 +[ 14.4s] extracting entity subgraph for new pids... +[ 14.9s] subgraph: msr=67,187 se=67,187 geo=11,399 site=6,514 agent=24 +[ 14.9s] src max_row_id=20,729,359 +[ 15.0s] id_map: 152,311 entries, new row_id range 20729360 to 20,881,670, collisions=0 +[ 15.0s] minting 1 new IdentifiedConcept rows: ['https://w3id.org/isample/vocabulary/sampledfeature/1.0/earthsurface'] +[ 15.0s] minted_concepts=1 +[ 15.1s] coords: 67,187 pids with coords, 0 duplicate-coord pids +[ 15.1s] remapping p__ arrays for new MSR rows... +[ 15.7s] p__ remapping tables built +[ 15.7s] running pre-write trust checks... +[ 15.7s] trust checks passed +[ 15.7s] expected output rows: 20,729,359 src - 64,609 removed + 152,311 new entities + 1 concepts = 20,817,062 +[ 15.7s] writing output... +[ 23.7s] wrote /tmp/ingest_202608_v4/isamples_202608_wide.parquet +[ 24.1s] post-write: rows=20,817,062 dup_rowids=0 dup_pids=0 oc_msrs=1,110,791 stale_remain=0 n_check=PASS +[ 24.1s] running full dangling-ref trust gate on all p__* reference columns... +[ 38.9s] full dangling-ref trust gate: PASS (0 dangling refs across 10 p__* columns) +[ 38.9s] description enrichment (#277): copying OC descriptions from Eric's wide… +[ 42.7s] description enrichment trust gate: Cyprus OC MSR count = 69,230 (expect ≈ 69,230) +[ 42.7s] description enrichment complete +[ 43.3s] manifest -> /tmp/ingest_202608_v4/isamples_202608_wide.parquet.manifest.json +[ 43.3s] done +``` + +### Verification query results (Phase 4) + +| Check | Result | Expected | Status | +|---|---|---|---| +| a. Zero dangling refs (all 10 p__* cols, ALL rows) | 0 | 0 | ✓ | +| b. OC MSR count | 1,110,791 | 1,110,791 | ✓ | +| c. Rock count (OC MSRs) | 37,953 | 37,953 | ✓ | +| d. Cyprus description count | 69,230 | ≥69,230 | ✓ | +| e. Blank facet entries | 0 | 0 | ✓ | +| f. Whitespace-only facet entries | 0 | 0 | ✓ | +| g. Removed Murlo pids in output | 0 | 0 | ✓ | +| h. New r2p24 pids | 15,409 | 15,409 | ✓ | +| i. Validator checks | 26/26 PASS | ≥26 | ✓ | +| j. specimentype labels | 2 entries | 2 | ✓ | +| k. Total output rows | 20,817,062 | 20,817,062 | ✓ | + +### Fixture tests + +``` +pytest tests/test_ingest_oc_records.py -v +29/29 passed in 11.12s + +New test (Phase 4): + test_cross_source_shared_entity_not_orphaned PASS +``` + +--- + +## Phase 5 — Site-Location Orphan Fix (2026-06-13) + +### Defect Description + +The Phase 4 build (commit 4dbf342) contained 4,606 dangling `p__site_location` references. The Phase 4 dangling-ref gate (B2B) only checked NEW rows; surviving src rows that pointed to deleted Geos were not verified. + +**Root cause**: Orphan-Geo determination only considered SamplingEvents' `p__sample_location` path. 4,606 GeospatialCoordLocation rows were ALSO referenced by surviving SamplingSites via `p__site_location`. The Murlo removal orphaned those Geos via the SE path and deleted them — leaving the surviving Sites' `p__site_location` dangling. + +**Evidence**: +- All 4,606 affected sites are pre-existing base sites (row_id < 20,729,359) +- 202606 base had ZERO such dangling refs +- The sync introduced them by deleting Geos referenced by both orphan SEs AND surviving Sites + +### Fixes Applied + +**Fix 1 — General orphan formulation** (`scripts/ingest_oc_records.py`, orphan analysis section): +- Reordered computation: orphan SamplingSites computed BEFORE orphan Geos +- `surviving_geo_refs` now includes Geos referenced by non-orphan SamplingSites via `p__site_location` (UNION with Geos from non-orphan SEs via `p__sample_location`) +- A Geo is now marked orphan only if unreferenced by BOTH surviving SEs and surviving Sites + +**Fix 2 — Mandatory all-rows dangling-ref gate** (`scripts/ingest_oc_records.py`, post-write): +- Replaced the "new rows only" B2B gate with a gate that scans ALL rows (surviving + new) across ALL 12 p__* columns +- Prints per-column dangling count; raises `RuntimeError` if total > 0 (build aborted, manifest not emitted) +- Catches exactly the defect that B2B missed: dangling refs in OLD surviving rows + +**Regression test** (`tests/test_ingest_oc_records.py`): +- Added `test_site_location_geo_not_orphaned` +- Scenario: OC MSR removed; orphan SE references Geo via p__sample_location; surviving SESAR SE references same Site via p__sampling_site; Site references same Geo via p__site_location +- Confirmed FAILS on old code (Geo deleted), PASSES on fixed code (Geo retained) + +### Production Rebuild Execution Log + +``` +[ 0.0s] schema checks passed +[ 0.2s] stale pids to remove: 21,227 +[ 16.3s] orphan subgraph: msr=21,227 se=21,227 geo=16,621 site=928 total=60,003 +[ 16.4s] rows_to_remove: 60,003 (matches orphan arithmetic) +[ 16.4s] new pids: 67,187 +[ 16.4s] extracting entity subgraph for new pids... +[ 16.8s] subgraph: msr=67,187 se=67,187 geo=11,399 site=6,514 agent=24 +[ 16.8s] src max_row_id=20,729,359 +[ 16.8s] id_map: 152,311 entries, new row_id range 20729360 to 20,881,670, collisions=0 +[ 16.9s] minting 1 new IdentifiedConcept rows: ['https://w3id.org/isample/vocabulary/sampledfeature/1.0/earthsurface'] +[ 16.9s] minted_concepts=1 +[ 17.0s] coords: 67,187 pids with coords, 0 duplicate-coord pids +[ 17.0s] remapping p__ arrays for new MSR rows... +[ 17.6s] p__ remapping tables built +[ 17.6s] running pre-write trust checks... +[ 17.7s] trust checks passed +[ 17.7s] expected output rows: 20,729,359 src - 60,003 removed + 152,311 new entities + 1 concepts = 20,821,668 +[ 17.7s] writing output... +[ 22.8s] wrote /tmp/ingest_202608/isamples_202608_wide.parquet +[ 23.1s] post-write: rows=20,821,668 dup_rowids=0 dup_pids=0 oc_msrs=1,110,791 stale_remain=0 n_check=PASS +[ 23.1s] running mandatory dangling-ref gate on ALL rows, ALL p__* columns... + p__curation: 0 dangling refs + p__has_context_category: 0 dangling refs + p__has_material_category: 0 dangling refs + p__has_sample_object_type: 0 dangling refs + p__keywords: 0 dangling refs + p__produced_by: 0 dangling refs + p__registrant: 0 dangling refs + p__related_resource: 0 dangling refs + p__responsibility: 0 dangling refs + p__sample_location: 0 dangling refs + p__sampling_site: 0 dangling refs + p__site_location: 0 dangling refs +[ 25.8s] Dangling ref check: PASS (0 dangling across 12 columns) +[ 25.8s] description enrichment (#277): copying OC descriptions from Eric's wide… +[ 29.1s] description enrichment trust gate: Cyprus OC MSR count = 69,230 (expect ≈ 69,230) +[ 29.1s] description enrichment complete: replaced /tmp/ingest_202608/isamples_202608_wide.parquet +[ 29.7s] manifest -> /tmp/ingest_202608/isamples_202608_wide.parquet.manifest.json +[ 29.7s] done +``` + +### New Orphan Breakdown (compare to Phase 4) + +| Entity type | Phase 4 | Phase 5 | Delta | +|---|---|---|---| +| MaterialSampleRecord removed | 21,227 | 21,227 | 0 | +| SamplingEvent orphaned | 21,227 | 21,227 | 0 | +| GeospatialCoordLocation orphaned | 21,227 | 16,621 | **-4,606** (the fix) | +| SamplingSite orphaned | 928 | 928 | 0 | +| **Total rows removed** | **64,609** | **60,003** | **-4,606** | + +The 4,606 Geos that were wrongly deleted in Phase 4 are now correctly retained. Total output rows increased by 4,606: 20,817,062 → 20,821,668. + +### Per-Column Dangling Sweep (verbatim from script output) + +``` + p__curation: 0 dangling refs + p__has_context_category: 0 dangling refs + p__has_material_category: 0 dangling refs + p__has_sample_object_type: 0 dangling refs + p__keywords: 0 dangling refs + p__produced_by: 0 dangling refs + p__registrant: 0 dangling refs + p__related_resource: 0 dangling refs + p__responsibility: 0 dangling refs + p__sample_location: 0 dangling refs + p__sampling_site: 0 dangling refs + p__site_location: 0 dangling refs +Dangling ref check: PASS (0 dangling across 12 columns) +``` + +### Verification Query Results (Phase 5) + +| Check | Result | Expected | Status | +|---|---|---|---| +| b. Independent p__site_location dangling check | 0 | 0 | ✓ | +| c. OC MSR count | 1,110,791 | 1,110,791 | ✓ | +| d. Rock count (OC, from facets, material LIKE '%/rock') | 37,953 | 37,953 | ✓ | +| e. Cyprus description count | 69,230 | ≥69,230 | ✓ | +| f. Blank facet entries (facet_summaries.facet_value='') | 0 | 0 | ✓ | +| g. Whitespace-only facet entries | 0 | 0 | ✓ | +| h. Removed Murlo pids in output | 0 | 0 | ✓ | +| i. New r2p24 pids in output | 15,409 | 15,409 | ✓ | +| j. Validator checks | 26/26 PASS | ≥26 | ✓ | +| k. specimentype labels | 2 | 2 | ✓ | +| l. Total output rows | 20,821,668 | ~20,817,062+4,606 | ✓ | + +### Fixture Tests + +``` +pytest tests/test_ingest_oc_records.py -v +30/30 passed in ~10s + +New test (Phase 5): + test_site_location_geo_not_orphaned PASS + (Confirmed FAILS on old code: Geo 20 deleted; PASSES on fixed code: Geo 20 retained) +``` + +--- + +## Phase 6 — Fixpoint Orphan + Silent-Drop Guard (Codex Round 2 blockers) + +*Run 2026-06-13. Fixes two Codex-identified issues: true general fixpoint orphan removal (Fix A) and silent-drop guard for new-row ref remapping (Fix B).* + +### Changes + +**Fix A — Fixpoint general orphan removal** (`scripts/ingest_oc_records.py` lines ~198–316): +Replaced all hand-enumerated path-specific orphan tables (orphan_se_refs, orphan_geo_refs, orphan_site_refs, etc.) with a true fixpoint algorithm: +1. remove_set := 21,227 stale MSR row_ids +2. Repeat until stable: compute survivor_refs = all row_ids referenced by any surviving row (NOT in remove_set) through ALL 12 p__* columns; add to remove_set any non-MSR non-IdentifiedConcept candidate row not in survivor_refs +3. Converges in 4 passes on real data + +**Fix B — Silent-drop guard** (`scripts/ingest_oc_records.py` lines ~733–770): +Added pre-write check: for each structural/concept p__* column on new rows, assert remapped array length == source array length. Any inner-join miss raises RuntimeError. `p__keywords` excluded (best-effort URI lookup, documented). + +**IdentifiedConcept exclusion from orphan removal**: fixpoint excludes `IdentifiedConcept` rows (vocabulary rows are shared across all sources, must never be auto-deleted). + +### Fixpoint output (4 passes, real data) + +``` +[ 0.7s] fixpoint pass 1: 21227 new orphans ← orphan SEs +[ 1.4s] fixpoint pass 2: 16625 new orphans ← orphan Geos (via orphan SEs) +[ 2.2s] fixpoint pass 3: 928 new orphans ← orphan SamplingSites +[ 2.9s] fixpoint pass 4: 0 new orphans ← converged +[ 2.9s] fixpoint done in 4 passes: rows_to_remove=60,007 +orphan subgraph by otype: { + 'Agent': 4, ← NEWLY removed (were over-retained in Phase 5) + 'GeospatialCoordLocation': 16621, ← 16621 (not 21227 — 4606 correctly retained for surviving Sites) + 'MaterialSampleRecord': 21227, + 'SamplingEvent': 21227, + 'SamplingSite': 928 +} +``` + +### Per-column dangling sweep (in-script gate + independent verification) + +``` + p__curation: 0 dangling refs + p__has_context_category: 0 dangling refs + p__has_material_category: 0 dangling refs + p__has_sample_object_type: 0 dangling refs + p__keywords: 0 dangling refs + p__produced_by: 0 dangling refs + p__registrant: 0 dangling refs + p__related_resource: 0 dangling refs + p__responsibility: 0 dangling refs + p__sample_location: 0 dangling refs + p__sampling_site: 0 dangling refs + p__site_location: 0 dangling refs + TOTAL: 0 dangling refs PASS +``` + +### Verification numbers + +| Check | Result | Status | +|---|---|---| +| p__site_location dangling (independent) | **0** | PASS | +| OC MSR count | **1,110,791** | PASS | +| OC rock count (source=OPENCONTEXT, material=rock URI) | **37,953** | PASS | +| Cyprus description count | **69,230** | PASS | +| Blank/whitespace facet entries | **0** | PASS | +| Removed Murlo pids in output | **0** | PASS | +| New r2p24 pids | **15,409** | PASS | +| Validator | **26/26 PASS** | PASS | +| specimentype labels in vocab_labels | **2** | PASS | +| **Total output rows** | **20,821,664** | PASS | +| Over-retention (orphan Agents) | **0** (4 Agents now correctly removed) | PASS | + +### Row count delta vs Phase 5 + +Phase 5: 20,821,668 → Phase 6: 20,821,664 (−4) + +The 4 missing rows are the 4 Agent orphans that Phase 5's path-specific logic incorrectly retained (they were never in the hand-enumerated path: SE→Agent isn't a p__sampling_site/sample_location path). The fixpoint correctly identifies them as unreferenced by any surviving row. + +### Regression tests + +Two new tests added to `tests/test_ingest_oc_records.py`: + +1. `test_orphan_geo_via_site_only_removed`: removed OC MSR → orphan SE (no p__sample_location) → orphan SamplingSite → Geo only via p__site_location. Geo must be REMOVED. + - **FAILS on old code** (Phase 5): Geo not in orphan_se_geo_refs (SE has empty p__sample_location), orphan_se_site_refs → Site is orphan but orphan_site_ids-based exclusion didn't cover the site-only Geo path correctly. + - **PASSES on fixed code** (fixpoint): survivor_refs has no ref to Geo (no surviving row points to it) → removed. + +2. `test_unresolved_new_ref_hard_fails`: new OC SE with p__sampling_site=[999], no Site 999 in Eric's wide. Must RAISE, not emit NULL. + - **FAILS on old code** (Phase 4 + Phase 5): inner join silently drops the ref; script exits 0 with NULL p__sampling_site. + - **PASSES on fixed code** (Fix B guard): _check_remap_length detects len(source)=1 != len(remapped)=0 → RuntimeError, no output written. + +### Fixture Tests + +``` +pytest tests/test_ingest_oc_records.py -v +32/32 passed in 9.20s + +New tests (Phase 6): + test_orphan_geo_via_site_only_removed PASS (FAILS on old Phase 5 code: Geo retained when it should be removed) + test_unresolved_new_ref_hard_fails PASS (FAILS on old code: exits 0 with NULL ref silently) +``` + +### p__keywords note (open item) + +The silent-drop guard correctly fires on p__keywords during the real build (67,183 rows with dropped keyword refs). This is the pre-existing best-effort behavior documented in the script header. `p__keywords` is excluded from the strict length check. A full keyword concept audit before R2 promotion is recommended. diff --git a/scripts/build_frontend_derived.py b/scripts/build_frontend_derived.py index b4f1039..a9dad0e 100755 --- a/scripts/build_frontend_derived.py +++ b/scripts/build_frontend_derived.py @@ -185,7 +185,7 @@ def build_h3_summary(con, out, res): def build_facet_summaries(con, out): union = " UNION ALL ".join( - f"SELECT '{d}' AS facet_type, {d} AS facet_value FROM samp_geo WHERE {d} IS NOT NULL" + f"SELECT '{d}' AS facet_type, {d} AS facet_value FROM samp_geo WHERE NULLIF(TRIM({d}), '') IS NOT NULL" for d in FACET_DIMS) con.execute(f"""COPY ( SELECT facet_type, facet_value, NULL::INTEGER AS scheme, COUNT(*) AS count @@ -206,7 +206,7 @@ def build_facet_cross_filter(con, out): f"SELECT NULL::VARCHAR AS filter_source, NULL::VARCHAR AS filter_material, " f"NULL::VARCHAR AS filter_context, NULL::VARCHAR AS filter_object_type, " f"'{fd}' AS facet_type, {fd} AS facet_value, COUNT(*) AS count " - f"FROM samp_geo WHERE {fd} IS NOT NULL GROUP BY {fd}") + f"FROM samp_geo WHERE NULLIF(TRIM({fd}), '') IS NOT NULL GROUP BY {fd}") for filt in FACET_DIMS: for fd in FACET_DIMS: cols = ", ".join( @@ -214,7 +214,7 @@ def build_facet_cross_filter(con, out): for c in FACET_DIMS) selects.append( f"SELECT {cols}, '{fd}' AS facet_type, {fd} AS facet_value, COUNT(*) AS count " - f"FROM samp_geo WHERE {filt} IS NOT NULL AND {fd} IS NOT NULL GROUP BY {filt}, {fd}") + f"FROM samp_geo WHERE NULLIF(TRIM({filt}), '') IS NOT NULL AND NULLIF(TRIM({fd}), '') IS NOT NULL GROUP BY {filt}, {fd}") con.execute(f"""COPY ( SELECT filter_source, filter_material, filter_context, filter_object_type, facet_type, facet_value, count diff --git a/scripts/build_vocab_labels.py b/scripts/build_vocab_labels.py index 4bd7c9f..d003d2f 100644 --- a/scripts/build_vocab_labels.py +++ b/scripts/build_vocab_labels.py @@ -66,6 +66,27 @@ PREFERRED_LANG = "en" +# Deprecated / legacy concept URIs that are absent from the live SKOS TTLs but +# still appear in older source data (e.g. SESAR records using the specimentype/1.0 +# namespace, superseded by materialsampleobjecttype/1.0). These rows are injected +# directly so the Explorer can display human-readable labels instead of raw URI +# path tails. Each entry: (uri, pref_label, lang, scheme). +# Issue #283b: 169 SESAR records carry these deprecated URIs. +MANUAL_LABEL_OVERRIDES: list[tuple[str, str, str, str | None]] = [ + ( + "https://w3id.org/isample/vocabulary/specimentype/1.0/othersolidobject", + "Other solid object", + "en", + "https://w3id.org/isample/vocabulary/specimentype/1.0/", + ), + ( + "https://w3id.org/isample/vocabulary/specimentype/1.0/physicalspecimen", + "Material sample", + "en", + "https://w3id.org/isample/vocabulary/specimentype/1.0/", + ), +] + # When a concept URI is declared in more than one TTL, prefer the row whose # source TTL's URL contains one of these path fragments. The fragments are # matched against the concept URI: a URI containing "vocabulary/material/" @@ -286,6 +307,23 @@ def main(argv: list[str] | None = None) -> int: print("ERROR: no rows extracted; aborting.", file=sys.stderr) return 2 + # Inject manual overrides for deprecated URIs not present in any live TTL. + # These are appended before dedupe so _dedupe can merge them if they ever + # appear in a future TTL revision, and so _emit_data_form_aliases does NOT + # re-emit them (they already carry the /1.0/ version segment). + for uri, label, lang, scheme in MANUAL_LABEL_OVERRIDES: + all_rows.append({ + "uri": uri, + "uri_form": "data_v1", # already in the /1.0/ data form + "pref_label": label, + "lang": lang, + "scheme": scheme, + "definition": None, + "alt_labels": [], + "source_ttl": "manual_override", + }) + print(f" {len(MANUAL_LABEL_OVERRIDES):>4} rows (manual overrides for deprecated URIs)") + raw_count = len(all_rows) all_rows = _dedupe(all_rows) deduped_collapsed = raw_count - len(all_rows) diff --git a/scripts/ingest_oc_records.py b/scripts/ingest_oc_records.py new file mode 100644 index 0000000..70f9250 --- /dev/null +++ b/scripts/ingest_oc_records.py @@ -0,0 +1,1207 @@ +#!/usr/bin/env python3 +"""TRUE SYNC: ingest new OpenContext records + remove stale OC records. + +Issue #272 Phase 2 (follow-up to PR #275 overlay phase): + The overlay phase (Phase 1) fixed concept mappings for ~1.04M existing OC pids. + This script performs a TRUE SYNC against Eric's 2026-06-09 OC PQG wide: + ADD 67,187 new pids (in Eric's wide but not in src) + REMOVE 21,227 stale pids (in src but not in Eric's wide — Murlo project + mass-updated PIDs; old PIDs would duplicate the same physical samples) + + remove orphaned subgraph entities for the removed pids. + +Decision D3 (2026-06-12, RY): REMOVE the stale pids. Rationale: OpenContext +mass-updated Murlo project PIDs; keeping old pids would duplicate the same +physical samples under two identifiers. This is a TRUE SYNC. + +WHAT IT DOES (single DuckDB session, deterministic): + 1. Identify stale pids (src has, Eric doesn't) → rows to remove + 2. Identify orphan subgraph entities (SE / Geo / Site) only referenced by + removed MSRs — safe to remove; agents are shared so NOT removed. + 3. Identify new pids (Eric has, src doesn't) → rows to add + 4. Extract full entity subgraph for new pids: + MaterialSampleRecord + SamplingEvent + GeospatialCoordLocation + + SamplingSite + Agent + (linked IdentifiedConcepts already in src) + 5. Assign new row_ids: dense rank starting at max(src.row_id)+1, + ordered deterministically by (otype, pid). + 6. Build a mapping table: Eric's row_id → our new row_id. + 7. Remap all p__ arrays on new rows from Eric's id space to our id space. + Concept refs in p__has_* arrays resolved via URI lookup against src's + IdentifiedConcept rows (same pattern as enrich_wide_with_oc_concepts.py). + 8. Denormalize geometry/lat/lon from GeoCoordLoc onto new MSR rows + (builder reads geometry from MSR rows, not from GeoCoordLoc). + 9. Set n='OPENCONTEXT' on new MSR rows (Eric's wide has NULL). + 10. Mint new IdentifiedConcept rows for any concept URIs in new MSRs but + absent from src wide (expected: sampledfeature/1.0/earthsurface). + 11. Hard-fail checks before writing (see HARD FAILURES below). + 12. Write: (src - removed - orphans) UNION ALL new_entities → output wide. + 13. Emit a {out}.manifest.json. + +WHAT IT DOES NOT DO (scope): + - Does not re-run the Phase 1 concept overlay (already in src wide). + - Does not populate p__curation / p__related_resource (OC doesn't have them). + +HARD FAILURES (refuses to write): + - duplicate pids among new MSRs (new pid set must be truly new) + - any new pid already exists in src wide (ingestion grain wrong) + - duplicate row_ids in proposed new id set vs src wide + - any p__ reference in a new row that cannot be resolved to a row_id in output + - any new MSR with n != 'OPENCONTEXT' in the written output + - row count mismatch: output != (src - removed) + new_entities + minted_concepts + - duplicate pids anywhere in output (union would create them if logic is wrong) + - any removed pid still present in output + +Usage: + python scripts/ingest_oc_records.py \\ + --src isamples_202606_wide.parquet \\ + --oc-wide oc_isamples_pqg_wide_2026-06-09.parquet \\ + --out isamples_202608_wide.parquet + + # Dry-run (skips writing, just runs analysis + trust checks): + python scripts/ingest_oc_records.py --src ... --oc-wide ... --out ... --dry-run + +Notes: + - DuckDB pinned to 1.4.4 (scripts/requirements.txt). h3 + spatial extensions + installed at runtime (needed for geometry handling). + - Use the 202606 wide as --src (not 202604). Phase 1 (PR #275) minted the + otheranthropogenicmaterial concept in 202606; using 202604 would require + minting it again and risks id collision with Phase 1. + - The --src wide's row_id column must be BIGINT (our convention). Eric's wide + uses INTEGER — new rows cast to BIGINT automatically. +""" +import argparse +import hashlib +import json +import os +import subprocess +import sys +import time + +import duckdb + +# Dimensions where concept references live on MSR rows +CONCEPT_DIMS = [ + "p__has_material_category", + "p__has_sample_object_type", + "p__has_context_category", +] + +# Columns present in our wide but absent from Eric's wide (will be NULL in new rows) +OUR_ONLY_COLS = ["p__curation", "p__related_resource"] + +# The source attribution for all OC records in the unified wide +OC_SOURCE = "OPENCONTEXT" + + +def sha256_file(path, _bufsize=1 << 20): + h = hashlib.sha256() + with open(path, "rb") as f: + for chunk in iter(lambda: f.read(_bufsize), b""): + h.update(chunk) + return h.hexdigest() + + +def git_sha(): + try: + return subprocess.check_output( + ["git", "rev-parse", "HEAD"], + cwd=os.path.dirname(os.path.abspath(__file__)), + stderr=subprocess.DEVNULL, + ).decode().strip() + except Exception: + return None + + +def log(msg, t0): + print(f"[{time.time()-t0:6.1f}s] {msg}", flush=True) + + +def main(): + ap = argparse.ArgumentParser() + ap.add_argument("--src", required=True, + help="Source unified wide parquet (should be isamples_202606_wide.parquet " + "so Phase-1 concept minting is already present)") + ap.add_argument("--oc-wide", required=True, + help="Eric's OC PQG wide parquet (oc_isamples_pqg_wide_2026-06-09.parquet)") + ap.add_argument("--out", required=True, + help="Output wide parquet (e.g. isamples_202608_wide.parquet)") + ap.add_argument("--dry-run", action="store_true", + help="Run analysis and trust checks, do not write output") + ap.add_argument("--no-manifest", action="store_true") + args = ap.parse_args() + + for fp in (args.src, args.oc_wide): + if not os.path.exists(fp): + sys.exit(f"FATAL: missing input {fp}") + if not args.dry_run: + if os.path.abspath(args.out) in (os.path.abspath(args.src), + os.path.abspath(args.oc_wide)): + sys.exit("FATAL: --out must not overwrite an input") + os.makedirs(os.path.dirname(os.path.abspath(args.out)), exist_ok=True) + + t0 = time.time() + con = duckdb.connect() + con.execute("INSTALL h3 FROM community; LOAD h3; INSTALL spatial; LOAD spatial;") + + SRC = f"read_parquet('{args.src}')" + OC = f"read_parquet('{args.oc_wide}')" + + # ---- schema contract checks ------------------------------------------- + src_cols_raw = con.sql(f"DESCRIBE SELECT * FROM {SRC}").fetchall() + src_cols = [(r[0], r[1]) for r in src_cols_raw] + src_colnames = [c for c, _ in src_cols] + + oc_cols_raw = con.sql(f"DESCRIBE SELECT * FROM {OC}").fetchall() + oc_colnames = [r[0] for r in oc_cols_raw] + + # Verify concept dim columns exist in both + for d in CONCEPT_DIMS: + if d not in src_colnames: + sys.exit(f"FATAL: src wide lacks required column {d}") + if d not in oc_colnames: + sys.exit(f"FATAL: oc-wide lacks required column {d}") + + # Verify p__produced_by exists (coord path) + for col in ("p__produced_by",): + if col not in oc_colnames: + sys.exit(f"FATAL: oc-wide lacks required column {col}") + + log("schema checks passed", t0) + + # ---- grain checks (hard-fail before any writing) ----------------------- + n_dup_src_rowid = con.sql( + f"SELECT COUNT(*) FROM (SELECT row_id FROM {SRC} GROUP BY row_id HAVING COUNT(*)>1)" + ).fetchone()[0] + n_dup_oc_pid_msr = con.sql( + f"SELECT COUNT(*) FROM (SELECT pid FROM {OC} WHERE otype='MaterialSampleRecord' " + f"GROUP BY pid HAVING COUNT(*)>1)" + ).fetchone()[0] + if n_dup_src_rowid or n_dup_oc_pid_msr: + sys.exit( + f"FATAL: non-unique keys — src duplicate row_ids={n_dup_src_rowid}, " + f"OC duplicate MSR pids={n_dup_oc_pid_msr}. Refusing to proceed." + ) + + # ---- Phase D3: identify stale pids to remove --------------------------- + con.execute(f""" + CREATE TEMP TABLE removed_pids AS + SELECT pid + FROM {SRC} WHERE otype='MaterialSampleRecord' AND n='{OC_SOURCE}' + EXCEPT + SELECT pid + FROM {OC} WHERE otype='MaterialSampleRecord'; + """) + n_removed_pids = con.sql("SELECT COUNT(*) FROM removed_pids").fetchone()[0] + log(f"stale pids to remove: {n_removed_pids:,}", t0) + + # ---- Phase D3 orphan analysis: FIXPOINT general orphan removal ---------- + # TRUE GENERAL FORMULATION (no path enumeration): + # + # A candidate row (any non-MSR entity reachable from the 21K removed MSRs' + # subgraph) is removed iff its row_id is NOT referenced by ANY surviving row + # through ANY p__* array column. We iterate to fixpoint because an orphan + # candidate may itself be the sole reference-holder of another candidate + # (e.g. an orphan SamplingSite → orphan Geo). + # + # Algorithm: + # remove_set := row_ids of the stale MSRs + # repeat until remove_set stops growing: + # survivor_refs := DISTINCT union of every p__* array column, UNNESTed, + # over all rows WHERE row_id NOT IN remove_set + # candidates := rows in the removed-MSR subgraph not already in remove_set + # new_orphans := candidates WHERE row_id NOT IN survivor_refs + # remove_set := remove_set UNION new_orphans + # + # The graph is shallow (~3 hops: MSR→SE→Geo/Site→Geo) so fixpoint is reached + # in ≤4 passes. We enumerate the p__* columns from the schema — no guesswork. + + # Collect all p__* columns that carry BIGINT[] or INTEGER[] row_id references. + p_ref_cols = [ + col for col, typ in src_cols + if col.startswith("p__") + and any(t in typ.upper() for t in ("BIGINT", "INTEGER")) + ] + log(f"fixpoint orphan: p__* ref cols = {p_ref_cols}", t0) + + # Build the subgraph of candidates: all non-MSR rows reachable (transitively) + # from the removed MSRs through their p__* arrays. + # We do this in one pass: any row whose row_id appears in ANY p__* array of + # the removed MSRs (or their descendants) is a candidate. + # We use a wide-first BFS: first pass collects direct refs from removed MSRs, + # subsequent passes follow refs from newly discovered candidates. + # For the iSamples graph (depth ≤3) three passes always reach fixpoint. + con.execute(f""" + -- Seed remove_set with the stale MSR row_ids + CREATE TEMP TABLE remove_set AS + SELECT s.row_id + FROM {SRC} s + WHERE s.otype='MaterialSampleRecord' AND s.n='{OC_SOURCE}' + AND s.pid IN (SELECT pid FROM removed_pids); + """) + + pass_num = 0 + while True: + pass_num += 1 + + # -- Step 1: compute survivor_refs: all row_ids referenced by surviving rows + # (rows NOT in remove_set) through ANY p__* reference column. + # Build a UNION ALL of unnest() over each p__* col, filter to surviving rows. + survivor_union = "\n UNION ALL\n ".join( + f"SELECT unnest(w.{col}) AS ref_id" + f" FROM {SRC} w" + f" WHERE w.{col} IS NOT NULL AND len(w.{col}) > 0" + f" AND w.row_id NOT IN (SELECT row_id FROM remove_set)" + for col in p_ref_cols + ) + con.execute(f""" + CREATE OR REPLACE TEMP TABLE survivor_refs_cur AS + SELECT DISTINCT ref_id + FROM ({survivor_union}) t + WHERE ref_id IS NOT NULL + """) + + # -- Step 2: compute the candidate subgraph reachable from remove_set rows. + # Any non-MSR row whose row_id appears in any p__* array of any row in + # remove_set is a candidate for orphan-deletion. + candidate_union = "\n UNION ALL\n ".join( + f"SELECT unnest(r2.{col}) AS row_id" + f" FROM {SRC} r2" + f" WHERE r2.row_id IN (SELECT row_id FROM remove_set)" + f" AND r2.{col} IS NOT NULL" + for col in p_ref_cols + ) + con.execute(f""" + CREATE OR REPLACE TEMP TABLE candidates_cur AS + SELECT DISTINCT row_id + FROM ({candidate_union}) t + WHERE row_id IS NOT NULL + """) + + # -- Step 3: new orphans = candidates NOT in survivor_refs AND NOT already removed + # AND NOT an MSR (we only auto-remove non-MSR entities; MSRs handled explicitly above) + # AND NOT an IdentifiedConcept (vocabulary concept rows are shared across all sources + # and must never be deleted just because one MSR is removed). + new_orphan_ids = con.execute(f""" + SELECT s.row_id + FROM {SRC} s + JOIN candidates_cur c ON c.row_id = s.row_id + WHERE s.otype != 'MaterialSampleRecord' + AND s.otype != 'IdentifiedConcept' + AND s.row_id NOT IN (SELECT row_id FROM remove_set) + AND s.row_id NOT IN (SELECT ref_id FROM survivor_refs_cur) + """).fetchall() + + n_new = len(new_orphan_ids) + log(f"fixpoint pass {pass_num}: {n_new} new orphans", t0) + if n_new == 0: + con.execute("DROP TABLE IF EXISTS survivor_refs_cur") + con.execute("DROP TABLE IF EXISTS candidates_cur") + break + + # Insert the new orphans into remove_set + new_ids_csv = ",".join(str(r[0]) for r in new_orphan_ids) + con.execute(f""" + INSERT INTO remove_set + SELECT DISTINCT row_id FROM {SRC} + WHERE row_id IN ({new_ids_csv}) + AND row_id NOT IN (SELECT row_id FROM remove_set) + """) + + n_rows_to_remove = con.sql("SELECT COUNT(*) FROM remove_set").fetchone()[0] + log(f"fixpoint done in {pass_num} passes: rows_to_remove={n_rows_to_remove:,}", t0) + + # Sanity: verify no non-OC MSR rows crept into remove_set + n_non_oc_in_remove = con.sql(f""" + SELECT COUNT(*) FROM remove_set rs + JOIN {SRC} s ON s.row_id = rs.row_id + WHERE s.otype='MaterialSampleRecord' AND s.n != '{OC_SOURCE}' + """).fetchone()[0] + if n_non_oc_in_remove: + sys.exit(f"FATAL: fixpoint orphan put {n_non_oc_in_remove} non-OC MSR rows in remove_set") + + # For logging: count by otype + otype_counts = con.sql(f""" + SELECT s.otype, COUNT(*) AS n + FROM remove_set rs JOIN {SRC} s ON s.row_id=rs.row_id + GROUP BY s.otype ORDER BY s.otype + """).fetchall() + orphan_counts = {ot: n for ot, n in otype_counts} + n_removed_msrs_actual = orphan_counts.get("MaterialSampleRecord", 0) + if n_removed_msrs_actual != n_removed_pids: + sys.exit(f"FATAL: remove_set has {n_removed_msrs_actual} MSR rows but expected {n_removed_pids}") + log(f"orphan subgraph by otype: {orphan_counts}", t0) + + # Alias rows_to_remove for compatibility with downstream SQL + con.execute("CREATE TEMP TABLE rows_to_remove AS SELECT row_id FROM remove_set") + total_orphan_rows = n_rows_to_remove + + # ---- Phase A: identify new pids ---------------------------------------- + con.execute(f""" + CREATE TEMP TABLE new_pids AS + SELECT pid + FROM {OC} WHERE otype='MaterialSampleRecord' + EXCEPT + SELECT pid + FROM {SRC} WHERE otype='MaterialSampleRecord' AND n='{OC_SOURCE}'; + """) + n_new_pids = con.sql("SELECT COUNT(*) FROM new_pids").fetchone()[0] + log(f"new pids: {n_new_pids:,}", t0) + if n_new_pids == 0: + sys.exit("INFO: no new pids to ingest. Output would be identical to src minus removals. Exiting.") + + # Check none of the new pids sneak in as non-OPENCONTEXT records in src + n_pid_collision = con.sql(f""" + SELECT COUNT(*) FROM new_pids np + JOIN {SRC} s ON s.pid = np.pid AND s.otype='MaterialSampleRecord' + """).fetchone()[0] + if n_pid_collision: + sys.exit(f"FATAL: {n_pid_collision} 'new' pids already exist in src wide (with different n). " + f"This would create duplicate pids in output.") + + # ---- Phase B: extract new MSR rows + full entity subgraph --------------- + log("extracting entity subgraph for new pids...", t0) + con.execute(f""" + -- New MSR rows from Eric's wide + CREATE TEMP TABLE new_msr_eric AS + SELECT e.* + FROM {OC} e + WHERE e.otype='MaterialSampleRecord' AND e.pid IN (SELECT pid FROM new_pids); + + -- Linked SamplingEvent row_ids + CREATE TEMP TABLE se_ids AS + SELECT DISTINCT u.se_id AS eric_row_id + FROM new_msr_eric, UNNEST(p__produced_by) AS u(se_id); + + -- SamplingEvent rows + CREATE TEMP TABLE new_se_eric AS + SELECT e.* FROM {OC} e + WHERE e.otype='SamplingEvent' AND e.row_id IN (SELECT eric_row_id FROM se_ids); + + -- GeospatialCoordLocation ids from SE (p__sample_location) + CREATE TEMP TABLE geo_from_se AS + SELECT DISTINCT u.geo_id AS eric_row_id + FROM new_se_eric, UNNEST(p__sample_location) AS u(geo_id); + + -- SamplingSite ids from SE (p__sampling_site) + CREATE TEMP TABLE site_ids AS + SELECT DISTINCT u.site_id AS eric_row_id + FROM new_se_eric, UNNEST(p__sampling_site) AS u(site_id); + + -- SamplingSite rows + CREATE TEMP TABLE new_site_eric AS + SELECT e.* FROM {OC} e + WHERE e.otype='SamplingSite' AND e.row_id IN (SELECT eric_row_id FROM site_ids); + + -- GeospatialCoordLocation ids from SamplingSite (p__site_location) + CREATE TEMP TABLE geo_from_site AS + SELECT DISTINCT u.loc_id AS eric_row_id + FROM new_site_eric, UNNEST(p__site_location) AS u(loc_id); + + -- All unique GeoCoordLoc ids (union of SE-linked and site-linked) + CREATE TEMP TABLE all_geo_ids AS + SELECT eric_row_id FROM geo_from_se + UNION + SELECT eric_row_id FROM geo_from_site; + + -- GeoCoordLoc rows + CREATE TEMP TABLE new_geo_eric AS + SELECT e.* FROM {OC} e + WHERE e.otype='GeospatialCoordLocation' AND e.row_id IN (SELECT eric_row_id FROM all_geo_ids); + + -- Agent ids from MSR (p__registrant AND p__responsibility — both columns ref Agents) + CREATE TEMP TABLE agent_ids AS + SELECT DISTINCT u.agent_id AS eric_row_id + FROM new_msr_eric, UNNEST(p__registrant) AS u(agent_id) + UNION + SELECT DISTINCT u.agent_id AS eric_row_id + FROM new_msr_eric, UNNEST(p__responsibility) AS u(agent_id); + + -- Agent rows + CREATE TEMP TABLE new_agent_eric AS + SELECT e.* FROM {OC} e + WHERE e.otype='Agent' AND e.row_id IN (SELECT eric_row_id FROM agent_ids); + """) + + counts = { + "new_msr": con.sql("SELECT COUNT(*) FROM new_msr_eric").fetchone()[0], + "new_se": con.sql("SELECT COUNT(*) FROM new_se_eric").fetchone()[0], + "new_geo": con.sql("SELECT COUNT(*) FROM new_geo_eric").fetchone()[0], + "new_site": con.sql("SELECT COUNT(*) FROM new_site_eric").fetchone()[0], + "new_agent": con.sql("SELECT COUNT(*) FROM new_agent_eric").fetchone()[0], + } + log(f"subgraph: msr={counts['new_msr']:,} se={counts['new_se']:,} geo={counts['new_geo']:,} " + f"site={counts['new_site']:,} agent={counts['new_agent']:,}", t0) + + # ---- Phase C: assign new row_ids ---------------------------------------- + max_src_row_id = con.sql(f"SELECT COALESCE(MAX(row_id), 0) FROM {SRC}").fetchone()[0] + log(f"src max_row_id={max_src_row_id:,}", t0) + + # All new entities in one table, ordered deterministically by (otype, pid) + # for stable dense-rank assignment + con.execute(f""" + CREATE TEMP TABLE all_new_entities AS + SELECT row_id AS eric_row_id, pid, otype FROM new_msr_eric + UNION ALL + SELECT row_id, pid, otype FROM new_se_eric + UNION ALL + SELECT row_id, pid, otype FROM new_geo_eric + UNION ALL + SELECT row_id, pid, otype FROM new_site_eric + UNION ALL + SELECT row_id, pid, otype FROM new_agent_eric; + + CREATE TEMP TABLE eric_id_map AS + SELECT eric_row_id, + {max_src_row_id} + DENSE_RANK() OVER (ORDER BY otype, pid) AS our_row_id + FROM all_new_entities; + """) + + n_id_map = con.sql("SELECT COUNT(*) FROM eric_id_map").fetchone()[0] + new_max = con.sql("SELECT MAX(our_row_id) FROM eric_id_map").fetchone()[0] + # Verify no collision with src + n_collision = con.sql(f""" + SELECT COUNT(*) FROM eric_id_map m + WHERE m.our_row_id IN (SELECT row_id FROM {SRC}) + """).fetchone()[0] + if n_collision: + sys.exit(f"FATAL: {n_collision} proposed new row_ids collide with existing src row_ids") + # FIX 2: verify our_row_id uniqueness within eric_id_map. + # DENSE_RANK() over (otype, pid) is unique when all (otype, pid) pairs are distinct. + # If a duplicate (otype, pid) pair sneaked through, two eric_row_ids would share + # the same our_row_id — silently producing colliding row_ids in the output. + n_dup_our_row_id = con.sql(""" + SELECT COUNT(*) FROM ( + SELECT our_row_id FROM eric_id_map + GROUP BY our_row_id HAVING COUNT(*) > 1 + ) + """).fetchone()[0] + if n_dup_our_row_id: + dup_examples = con.sql(""" + SELECT our_row_id, COUNT(*) AS cnt FROM eric_id_map + GROUP BY our_row_id HAVING COUNT(*) > 1 ORDER BY cnt DESC LIMIT 5 + """).fetchall() + sys.exit( + f"FATAL: {n_dup_our_row_id} duplicate our_row_id values in id_map " + f"(duplicate (otype,pid) pairs in new entity set). Examples: {dup_examples}" + ) + log(f"id_map: {n_id_map:,} entries, new row_id range {max_src_row_id+1} to {new_max:,}, " + f"collisions={n_collision}, dup_our_row_ids={n_dup_our_row_id}", t0) + + # ---- Phase D: concept resolution for p__has_* dims --------------------- + # OC concept row_ids (Eric's space) -> URI -> our row_id + # Uses same approach as enrich_wide_with_oc_concepts.py + con.execute(f""" + CREATE TEMP TABLE oc_concept_rows AS + SELECT row_id AS eric_row_id, pid AS uri + FROM {OC} WHERE otype='IdentifiedConcept'; + + CREATE TEMP TABLE src_concept_map AS + SELECT pid AS uri, MIN(row_id) AS our_row_id + FROM {SRC} WHERE otype='IdentifiedConcept' GROUP BY pid; + """) + + # Find concepts referenced by new MSRs that are missing from src. + # Includes p__keywords so keyword IdentifiedConcept rows are minted if absent. + con.execute(f""" + CREATE TEMP TABLE new_concept_refs AS + SELECT DISTINCT u.cid AS eric_cid + FROM new_msr_eric, UNNEST(p__has_material_category) AS u(cid) + UNION + SELECT DISTINCT u.cid FROM new_msr_eric, UNNEST(p__has_sample_object_type) AS u(cid) + UNION + SELECT DISTINCT u.cid FROM new_msr_eric, UNNEST(p__has_context_category) AS u(cid) + UNION + SELECT DISTINCT u.cid FROM new_msr_eric, UNNEST(p__keywords) AS u(cid); + + CREATE TEMP TABLE new_concept_uris AS + SELECT DISTINCT c.uri + FROM new_concept_refs r + JOIN oc_concept_rows c ON c.eric_row_id = r.eric_cid; + """) + + n_unresolved_uris = con.sql(""" + SELECT COUNT(*) FROM new_concept_uris u + LEFT JOIN src_concept_map m ON m.uri = u.uri + WHERE m.our_row_id IS NULL + """).fetchone()[0] + + if n_unresolved_uris: + # These need to be minted — expected: only earthsurface when base is 202606. + missing = con.sql(""" + SELECT u.uri FROM new_concept_uris u + LEFT JOIN src_concept_map m ON m.uri = u.uri + WHERE m.our_row_id IS NULL + ORDER BY u.uri + """).fetchall() + log(f"minting {n_unresolved_uris} new IdentifiedConcept rows: {[r[0] for r in missing]}", t0) + else: + log("all concept URIs already in src", t0) + + # Mint new concept rows + max_src_row_id_with_map = con.sql("SELECT MAX(our_row_id) FROM eric_id_map").fetchone()[0] + con.execute(f""" + CREATE TEMP TABLE new_concepts_to_mint AS + WITH missing_uris AS ( + SELECT u.uri FROM new_concept_uris u + LEFT JOIN src_concept_map m ON m.uri = u.uri + WHERE m.our_row_id IS NULL + ), + meta AS ( + SELECT c.uri, MIN(c2.label) AS label, MIN(c2.scheme_name) AS scheme_name, + MIN(c2.scheme_uri) AS scheme_uri + FROM missing_uris c + JOIN (SELECT pid AS uri, label, scheme_name, scheme_uri FROM {OC} + WHERE otype='IdentifiedConcept') c2 ON c2.uri = c.uri + GROUP BY c.uri + ) + SELECT {max_src_row_id_with_map} + DENSE_RANK() OVER (ORDER BY m.uri) AS our_row_id, + m.uri, m.label, m.scheme_name, m.scheme_uri + FROM meta m; + + -- Complete concept lookup: src existing + newly minted + CREATE TEMP TABLE concept_id_lookup AS + SELECT uri, our_row_id FROM src_concept_map + UNION ALL + SELECT uri, our_row_id FROM new_concepts_to_mint; + """) + + n_minted = con.sql("SELECT COUNT(*) FROM new_concepts_to_mint").fetchone()[0] + log(f"minted_concepts={n_minted}", t0) + + # ---- Phase E: build coord table for new MSRs ---------------------------- + # Eric's wide stores geometry as DuckDB GEOMETRY type (spatial extension auto-decodes). + # Our wide stores geometry as BLOB (WKB bytes). Convert with ST_AsWKB() so the + # UNION ALL with src rows (BLOB) does not fail with BLOB->GEOMETRY cast error. + con.execute(""" + CREATE TEMP TABLE new_msr_coords AS + WITH msr_se AS ( + SELECT m.pid, + se.row_id AS se_eric_row_id, + se.p__sample_location + FROM new_msr_eric m, + UNNEST(m.p__produced_by) AS u(se_rid) + JOIN new_se_eric se ON se.row_id = u.se_rid + ) + SELECT ms.pid, + CASE WHEN geo.geometry IS NOT NULL + THEN ST_AsWKB(geo.geometry)::BLOB + ELSE NULL END AS geometry, + geo.latitude, + geo.longitude + FROM msr_se ms, + UNNEST(ms.p__sample_location) AS u(geo_rid) + JOIN new_geo_eric geo ON geo.row_id = u.geo_rid + WHERE geo.latitude IS NOT NULL; + """) + n_coords = con.sql("SELECT COUNT(*) FROM new_msr_coords").fetchone()[0] + n_dup_coords = con.sql( + "SELECT COUNT(*) FROM (SELECT pid FROM new_msr_coords GROUP BY pid HAVING COUNT(*)>1)" + ).fetchone()[0] + log(f"coords: {n_coords:,} pids with coords, {n_dup_coords} duplicate-coord pids", t0) + if n_dup_coords: + sys.exit(f"FATAL: {n_dup_coords} MSR pids have multiple coord rows in the graph path") + + # ---- Phase F: remap p__ arrays for new entities ------------------------- + # Build remapped MSR rows (concept p__ via URI lookup; structural p__ via eric_id_map) + # Using UNNEST WITH ORDINALITY + JOIN + list() aggregation (decorrelated — no correlated subqueries) + log("remapping p__ arrays for new MSR rows...", t0) + + con.execute(""" + -- Pre-aggregate remapped structural arrays for new MSRs + -- p__produced_by (SamplingEvent references) + CREATE TEMP TABLE remap_msr_pb AS + SELECT m.pid, + list(idm.our_row_id::BIGINT ORDER BY u.ord) AS remapped + FROM new_msr_eric m, + UNNEST(m.p__produced_by) WITH ORDINALITY AS u(eric_rid, ord) + JOIN eric_id_map idm ON idm.eric_row_id = u.eric_rid + GROUP BY m.pid; + + -- p__has_material_category (concept refs via URI lookup) + CREATE TEMP TABLE remap_msr_mat AS + SELECT m.pid, + list(cl.our_row_id::BIGINT ORDER BY u.ord) AS remapped + FROM new_msr_eric m, + UNNEST(m.p__has_material_category) WITH ORDINALITY AS u(eric_rid, ord) + JOIN oc_concept_rows ocr ON ocr.eric_row_id = u.eric_rid + JOIN concept_id_lookup cl ON cl.uri = ocr.uri + GROUP BY m.pid; + + -- p__has_sample_object_type (concept refs) + CREATE TEMP TABLE remap_msr_obj AS + SELECT m.pid, + list(cl.our_row_id::BIGINT ORDER BY u.ord) AS remapped + FROM new_msr_eric m, + UNNEST(m.p__has_sample_object_type) WITH ORDINALITY AS u(eric_rid, ord) + JOIN oc_concept_rows ocr ON ocr.eric_row_id = u.eric_rid + JOIN concept_id_lookup cl ON cl.uri = ocr.uri + GROUP BY m.pid; + + -- p__has_context_category (concept refs) + CREATE TEMP TABLE remap_msr_ctx AS + SELECT m.pid, + list(cl.our_row_id::BIGINT ORDER BY u.ord) AS remapped + FROM new_msr_eric m, + UNNEST(m.p__has_context_category) WITH ORDINALITY AS u(eric_rid, ord) + JOIN oc_concept_rows ocr ON ocr.eric_row_id = u.eric_rid + JOIN concept_id_lookup cl ON cl.uri = ocr.uri + GROUP BY m.pid; + + -- p__registrant (Agent refs) + CREATE TEMP TABLE remap_msr_reg AS + SELECT m.pid, + list(idm.our_row_id::BIGINT ORDER BY u.ord) AS remapped + FROM new_msr_eric m, + UNNEST(m.p__registrant) WITH ORDINALITY AS u(eric_rid, ord) + JOIN eric_id_map idm ON idm.eric_row_id = u.eric_rid + GROUP BY m.pid; + + -- p__keywords (concept refs via URI lookup; same pattern as p__has_material_category) + CREATE TEMP TABLE remap_msr_kw AS + SELECT m.pid, + list(cl.our_row_id::BIGINT ORDER BY u.ord) AS remapped + FROM new_msr_eric m, + UNNEST(m.p__keywords) WITH ORDINALITY AS u(eric_rid, ord) + JOIN oc_concept_rows ocr ON ocr.eric_row_id = u.eric_rid + JOIN concept_id_lookup cl ON cl.uri = ocr.uri + GROUP BY m.pid; + + -- p__responsibility (Agent or other entity refs) + CREATE TEMP TABLE remap_msr_resp AS + SELECT m.pid, + list(idm.our_row_id::BIGINT ORDER BY u.ord) AS remapped + FROM new_msr_eric m, + UNNEST(m.p__responsibility) WITH ORDINALITY AS u(eric_rid, ord) + JOIN eric_id_map idm ON idm.eric_row_id = u.eric_rid + GROUP BY m.pid; + """) + + # Similarly remap SamplingEvent p__ arrays + con.execute(""" + -- SE p__sample_location (GeoCoordLoc refs) + CREATE TEMP TABLE remap_se_sl AS + SELECT s.pid, + list(idm.our_row_id::BIGINT ORDER BY u.ord) AS remapped + FROM new_se_eric s, + UNNEST(s.p__sample_location) WITH ORDINALITY AS u(eric_rid, ord) + JOIN eric_id_map idm ON idm.eric_row_id = u.eric_rid + GROUP BY s.pid; + + -- SE p__sampling_site (SamplingSite refs) + CREATE TEMP TABLE remap_se_ss AS + SELECT s.pid, + list(idm.our_row_id::BIGINT ORDER BY u.ord) AS remapped + FROM new_se_eric s, + UNNEST(s.p__sampling_site) WITH ORDINALITY AS u(eric_rid, ord) + JOIN eric_id_map idm ON idm.eric_row_id = u.eric_rid + GROUP BY s.pid; + + -- SamplingSite p__site_location (GeoCoordLoc refs) + CREATE TEMP TABLE remap_site_sl AS + SELECT s.pid, + list(idm.our_row_id::BIGINT ORDER BY u.ord) AS remapped + FROM new_site_eric s, + UNNEST(s.p__site_location) WITH ORDINALITY AS u(eric_rid, ord) + JOIN eric_id_map idm ON idm.eric_row_id = u.eric_rid + GROUP BY s.pid; + """) + log("p__ remapping tables built", t0) + + # ---- trust checks before writing ---------------------------------------- + log("running pre-write trust checks...", t0) + + # Check all new MSR p__produced_by refs resolve + n_unresolved_se = con.sql(""" + SELECT COUNT(*) FROM new_msr_eric m, UNNEST(m.p__produced_by) AS u(rid) + LEFT JOIN eric_id_map idm ON idm.eric_row_id = u.rid + WHERE idm.our_row_id IS NULL + """).fetchone()[0] + if n_unresolved_se: + sys.exit(f"FATAL: {n_unresolved_se} p__produced_by references in new MSRs do not resolve") + + # Check all concept references resolve (via URI) — includes p__keywords + n_unresolved_concepts = con.sql(""" + WITH all_refs AS ( + SELECT m.pid, u.eric_rid FROM new_msr_eric m, UNNEST(m.p__has_material_category) AS u(eric_rid) + UNION ALL + SELECT m.pid, u.eric_rid FROM new_msr_eric m, UNNEST(m.p__has_sample_object_type) AS u(eric_rid) + UNION ALL + SELECT m.pid, u.eric_rid FROM new_msr_eric m, UNNEST(m.p__has_context_category) AS u(eric_rid) + UNION ALL + SELECT m.pid, u.eric_rid FROM new_msr_eric m, UNNEST(m.p__keywords) AS u(eric_rid) + ) + SELECT COUNT(*) FROM all_refs r + LEFT JOIN oc_concept_rows ocr ON ocr.eric_row_id = r.eric_rid + LEFT JOIN concept_id_lookup cl ON cl.uri = ocr.uri + WHERE cl.our_row_id IS NULL + """).fetchone()[0] + if n_unresolved_concepts: + sys.exit(f"FATAL: {n_unresolved_concepts} concept references (including keywords) in new MSRs do not resolve") + + # Check that rows_to_remove doesn't contain any non-OC rows + n_non_oc_removal = con.sql(f""" + SELECT COUNT(*) FROM rows_to_remove rr + JOIN {SRC} s ON s.row_id = rr.row_id AND s.otype='MaterialSampleRecord' + WHERE s.n != '{OC_SOURCE}' + """).fetchone()[0] + if n_non_oc_removal: + sys.exit(f"FATAL: {n_non_oc_removal} removal targets are non-OC MSR rows (would corrupt other sources)") + + # FIX B — SILENT-DROP GUARD: verify that every p__* source array on new rows + # has a 1:1 remapped array (no silently-dropped refs due to inner-join misses). + # + # The remapping tables (remap_msr_pb, remap_se_sl, remap_se_ss, remap_site_sl) + # use INNER JOINs to eric_id_map. If a source row has a ref not in eric_id_map, + # that row simply DISAPPEARS from the remap table, and the LEFT JOIN in the + # write SQL gives NULL for the column — silently dropping the reference. + # + # For each (source_table, p__col, remap_table) pair, we assert: + # every row with a non-null source array has a matching remap row AND + # the remapped array has the same length as the source array. + # Any mismatch → RuntimeError, build aborted. + def _check_remap_length(source_table, pid_col, src_col, remap_table, remap_col, label): + bad = con.sql(f""" + SELECT s.{pid_col}, len(s.{src_col}) AS src_len, COALESCE(len(r.{remap_col}), 0) AS remap_len + FROM {source_table} s + LEFT JOIN {remap_table} r ON r.{pid_col} = s.{pid_col} + WHERE s.{src_col} IS NOT NULL AND len(s.{src_col}) > 0 + AND COALESCE(len(r.{remap_col}), 0) != len(s.{src_col}) + """).fetchall() + if bad: + details = "; ".join(f"{pid_col}={row[0]} src_len={row[1]} remap_len={row[2]}" for row in bad[:5]) + raise RuntimeError( + f"SILENT-DROP GUARD FAIL [{label}]: {len(bad)} rows have mismatched " + f"source vs remapped array lengths. First offenders: {details}. " + f"Check that all referenced entities were extracted before remapping." + ) + + # MSR structural refs + _check_remap_length("new_msr_eric", "pid", "p__produced_by", "remap_msr_pb", "remapped", "MSR.p__produced_by") + _check_remap_length("new_msr_eric", "pid", "p__registrant", "remap_msr_reg", "remapped", "MSR.p__registrant") + _check_remap_length("new_msr_eric", "pid", "p__responsibility", "remap_msr_resp", "remapped", "MSR.p__responsibility") + # MSR concept refs (via URI lookup — p__has_* dims + p__keywords) + _check_remap_length("new_msr_eric", "pid", "p__has_material_category", "remap_msr_mat", "remapped", "MSR.p__has_material_category") + _check_remap_length("new_msr_eric", "pid", "p__has_sample_object_type", "remap_msr_obj", "remapped", "MSR.p__has_sample_object_type") + _check_remap_length("new_msr_eric", "pid", "p__has_context_category", "remap_msr_ctx", "remapped", "MSR.p__has_context_category") + _check_remap_length("new_msr_eric", "pid", "p__keywords", "remap_msr_kw", "remapped", "MSR.p__keywords") + # SE structural refs + _check_remap_length("new_se_eric", "pid", "p__sample_location", "remap_se_sl", "remapped", "SE.p__sample_location") + _check_remap_length("new_se_eric", "pid", "p__sampling_site", "remap_se_ss", "remapped", "SE.p__sampling_site") + # SamplingSite structural refs + _check_remap_length("new_site_eric", "pid", "p__site_location", "remap_site_sl", "remapped", "Site.p__site_location") + log("silent-drop guard: all structural + concept remapped arrays length-verified (PASS)", t0) + + log("trust checks passed", t0) + + # ---- compute expected output row count ----------------------------------- + n_src = con.sql(f"SELECT COUNT(*) FROM {SRC}").fetchone()[0] + n_new_entities = n_id_map # entities in eric_id_map + n_out_expected = n_src - n_rows_to_remove + n_new_entities + n_minted + log(f"expected output rows: {n_src:,} src - {n_rows_to_remove:,} removed + " + f"{n_new_entities:,} new entities + {n_minted} concepts = {n_out_expected:,}", t0) + + if args.dry_run: + log("DRY RUN: skipping write step", t0) + print("\n=== DRY RUN SUMMARY ===") + print(f" removed_pids: {n_removed_pids:,}") + print(f" orphan_rows: {total_orphan_rows - n_removed_pids:,}") + print(f" total_rows_removed: {n_rows_to_remove:,}") + print(f" new_pids: {n_new_pids:,}") + print(f" new_entities: {n_new_entities:,}") + print(f" minted_concepts: {n_minted}") + print(f" expected_out: {n_out_expected:,}") + print(f" trust_checks: PASS") + return 0 + + # ---- Phase I: write output ----------------------------------------------- + log("writing output...", t0) + + # Build the column list for new MSR rows + # For each column in src_cols, produce an expression that maps Eric's data to our schema + # The key transformations: + # row_id -> from eric_id_map + # n -> 'OPENCONTEXT' + # geometry/latitude/longitude -> from new_msr_coords + # p__produced_by -> from remap_msr_pb + # p__has_material_category -> from remap_msr_mat + # p__has_sample_object_type -> from remap_msr_obj + # p__has_context_category -> from remap_msr_ctx + # p__registrant -> from remap_msr_reg + # p__keywords -> from remap_msr_kw + # p__responsibility -> from remap_msr_resp + # p__curation -> NULL + # p__related_resource -> NULL + # all others -> direct from new_msr_eric + + # New MSR rows SELECT + msr_select_cols = [] + for col, typ in src_cols: + if col == "row_id": + msr_select_cols.append(f"idm.our_row_id::BIGINT AS row_id") + elif col == "n": + msr_select_cols.append(f"'{OC_SOURCE}'::VARCHAR AS n") + elif col == "geometry": + msr_select_cols.append(f"coords.geometry AS geometry") + elif col == "latitude": + msr_select_cols.append(f"coords.latitude AS latitude") + elif col == "longitude": + msr_select_cols.append(f"coords.longitude AS longitude") + elif col == "p__produced_by": + msr_select_cols.append(f"rmap_pb.remapped::{typ} AS p__produced_by") + elif col == "p__has_material_category": + msr_select_cols.append(f"rmap_mat.remapped::{typ} AS p__has_material_category") + elif col == "p__has_sample_object_type": + msr_select_cols.append(f"rmap_obj.remapped::{typ} AS p__has_sample_object_type") + elif col == "p__has_context_category": + msr_select_cols.append(f"rmap_ctx.remapped::{typ} AS p__has_context_category") + elif col == "p__registrant": + msr_select_cols.append(f"rmap_reg.remapped::{typ} AS p__registrant") + elif col == "p__keywords": + msr_select_cols.append(f"rmap_kw.remapped::{typ} AS p__keywords") + elif col == "p__responsibility": + msr_select_cols.append(f"rmap_resp.remapped::{typ} AS p__responsibility") + elif col in OUR_ONLY_COLS: + msr_select_cols.append(f"NULL::{typ} AS {col}") + elif col in oc_colnames: + msr_select_cols.append(f"m.{col}::{typ} AS {col}") + else: + msr_select_cols.append(f"NULL::{typ} AS {col}") + + msr_select = ",\n ".join(msr_select_cols) + + # New SE rows SELECT (remapped p__sample_location and p__sampling_site) + se_select_cols = [] + for col, typ in src_cols: + if col == "row_id": + se_select_cols.append(f"idm.our_row_id::BIGINT AS row_id") + elif col == "p__sample_location": + se_select_cols.append(f"rmap_sl.remapped::{typ} AS p__sample_location") + elif col == "p__sampling_site": + se_select_cols.append(f"rmap_ss.remapped::{typ} AS p__sampling_site") + elif col in OUR_ONLY_COLS: + se_select_cols.append(f"NULL::{typ} AS {col}") + elif col in oc_colnames: + se_select_cols.append(f"s.{col}::{typ} AS {col}") + else: + se_select_cols.append(f"NULL::{typ} AS {col}") + se_select = ",\n ".join(se_select_cols) + + # New SamplingSite rows SELECT (remapped p__site_location) + site_select_cols = [] + for col, typ in src_cols: + if col == "row_id": + site_select_cols.append(f"idm.our_row_id::BIGINT AS row_id") + elif col == "p__site_location": + site_select_cols.append(f"rmap_site_sl.remapped::{typ} AS p__site_location") + elif col in OUR_ONLY_COLS: + site_select_cols.append(f"NULL::{typ} AS {col}") + elif col in oc_colnames: + site_select_cols.append(f"st.{col}::{typ} AS {col}") + else: + site_select_cols.append(f"NULL::{typ} AS {col}") + site_select = ",\n ".join(site_select_cols) + + # Generic entity SELECT (Geo, Agent: just row_id remapped, all other cols direct) + # geometry: Eric's wide has GEOMETRY type (spatial extension), our wide has BLOB (WKB). + # Convert with ST_AsWKB() for GEOMETRY-typed columns; BLOB columns pass through directly. + def generic_entity_select(alias, table_alias, eric_geo_is_geometry=False): + parts = [] + for col, typ in src_cols: + if col == "row_id": + parts.append(f"idm.our_row_id::BIGINT AS row_id") + elif col == "geometry" and eric_geo_is_geometry: + # GeoCoordLoc in Eric's wide stores geometry as GEOMETRY type + parts.append( + f"CASE WHEN {table_alias}.geometry IS NOT NULL " + f"THEN ST_AsWKB({table_alias}.geometry)::BLOB " + f"ELSE NULL END AS geometry" + ) + elif col in OUR_ONLY_COLS: + parts.append(f"NULL::{typ} AS {col}") + elif col in oc_colnames: + parts.append(f"{table_alias}.{col}::{typ} AS {col}") + else: + parts.append(f"NULL::{typ} AS {col}") + return ",\n ".join(parts) + + # GeoCoordLoc in Eric's wide has geometry as GEOMETRY type (auto-decoded by spatial extension) + geo_select = generic_entity_select("g", "g", eric_geo_is_geometry=True) + agent_select = generic_entity_select("a", "a") + + # Minted concept rows SELECT + concept_select_cols = [] + for col, typ in src_cols: + mapping = { + "row_id": f"nc.our_row_id::BIGINT", + "pid": "nc.uri::VARCHAR", + "otype": "'IdentifiedConcept'::VARCHAR", + "label": "nc.label::VARCHAR", + "scheme_name": "nc.scheme_name::VARCHAR", + "scheme_uri": "nc.scheme_uri::VARCHAR", + } + if col in mapping: + concept_select_cols.append(f"{mapping[col]} AS {col}") + else: + concept_select_cols.append(f"NULL::{typ} AS {col}") + concept_select = ",\n ".join(concept_select_cols) + + write_sql = f""" + COPY ( + -- 1. Surviving src rows (all rows NOT in the removal set) + SELECT * FROM {SRC} + WHERE row_id NOT IN (SELECT row_id FROM rows_to_remove) + + UNION ALL BY NAME + + -- 2. New MaterialSampleRecord rows (remapped + denormalized coords) + SELECT {msr_select} + FROM new_msr_eric m + JOIN eric_id_map idm ON idm.eric_row_id = m.row_id + LEFT JOIN new_msr_coords coords ON coords.pid = m.pid + LEFT JOIN remap_msr_pb rmap_pb ON rmap_pb.pid = m.pid + LEFT JOIN remap_msr_mat rmap_mat ON rmap_mat.pid = m.pid + LEFT JOIN remap_msr_obj rmap_obj ON rmap_obj.pid = m.pid + LEFT JOIN remap_msr_ctx rmap_ctx ON rmap_ctx.pid = m.pid + LEFT JOIN remap_msr_reg rmap_reg ON rmap_reg.pid = m.pid + LEFT JOIN remap_msr_kw rmap_kw ON rmap_kw.pid = m.pid + LEFT JOIN remap_msr_resp rmap_resp ON rmap_resp.pid = m.pid + + UNION ALL BY NAME + + -- 3. New SamplingEvent rows (remapped structural arrays) + SELECT {se_select} + FROM new_se_eric s + JOIN eric_id_map idm ON idm.eric_row_id = s.row_id + LEFT JOIN remap_se_sl rmap_sl ON rmap_sl.pid = s.pid + LEFT JOIN remap_se_ss rmap_ss ON rmap_ss.pid = s.pid + + UNION ALL BY NAME + + -- 4. New GeospatialCoordLocation rows (just row_id remapped) + SELECT {geo_select} + FROM new_geo_eric g + JOIN eric_id_map idm ON idm.eric_row_id = g.row_id + + UNION ALL BY NAME + + -- 5. New SamplingSite rows (remapped p__site_location) + SELECT {site_select} + FROM new_site_eric st + JOIN eric_id_map idm ON idm.eric_row_id = st.row_id + LEFT JOIN remap_site_sl rmap_site_sl ON rmap_site_sl.pid = st.pid + + UNION ALL BY NAME + + -- 6. New Agent rows + SELECT {agent_select} + FROM new_agent_eric a + JOIN eric_id_map idm ON idm.eric_row_id = a.row_id + + UNION ALL BY NAME + + -- 7. Minted IdentifiedConcept rows + SELECT {concept_select} + FROM new_concepts_to_mint nc + + ORDER BY row_id + ) TO '{args.out}' (FORMAT PARQUET, COMPRESSION ZSTD) + """ + + con.execute(write_sql) + log(f"wrote {args.out}", t0) + + # ---- post-write verification -------------------------------------------- + OUT = f"read_parquet('{args.out}')" + n_out = con.sql(f"SELECT COUNT(*) FROM {OUT}").fetchone()[0] + if n_out != n_out_expected: + sys.exit(f"FATAL: row count {n_out:,} != expected {n_out_expected:,}. " + f"(src={n_src:,} - removed={n_rows_to_remove:,} + " + f"new={n_new_entities:,} + minted={n_minted})") + + n_dup_out_rowid = con.sql( + f"SELECT COUNT(*) FROM (SELECT row_id FROM {OUT} GROUP BY row_id HAVING COUNT(*)>1)" + ).fetchone()[0] + if n_dup_out_rowid: + sys.exit(f"FATAL: {n_dup_out_rowid} duplicate row_ids in output") + + n_dup_out_pid = con.sql( + f"SELECT COUNT(*) FROM (SELECT pid FROM {OUT} WHERE otype='MaterialSampleRecord' " + f"GROUP BY pid HAVING COUNT(*)>1)" + ).fetchone()[0] + if n_dup_out_pid: + sys.exit(f"FATAL: {n_dup_out_pid} duplicate MaterialSampleRecord pids in output") + + # Verify n='OPENCONTEXT' on ALL new MSR rows in output + n_wrong_n = con.sql(f""" + SELECT COUNT(*) FROM {OUT} + WHERE otype='MaterialSampleRecord' AND n!='{OC_SOURCE}' + AND pid IN (SELECT pid FROM new_pids) + """).fetchone()[0] + if n_wrong_n: + sys.exit(f"FATAL: {n_wrong_n} new MSR rows have n != '{OC_SOURCE}'") + + # Verify NONE of the removed pids remain in output + n_stale_pids_remain = con.sql(f""" + SELECT COUNT(*) FROM {OUT} + WHERE otype='MaterialSampleRecord' AND n='{OC_SOURCE}' + AND pid IN (SELECT pid FROM removed_pids) + """).fetchone()[0] + if n_stale_pids_remain: + sys.exit(f"FATAL: {n_stale_pids_remain} stale (removed) pids remain in output") + + out_oc_count = con.sql( + f"SELECT COUNT(*) FROM {OUT} WHERE otype='MaterialSampleRecord' AND n='{OC_SOURCE}'" + ).fetchone()[0] + log(f"post-write: rows={n_out:,} dup_rowids={n_dup_out_rowid} " + f"dup_pids={n_dup_out_pid} oc_msrs={out_oc_count:,} " + f"stale_remain={n_stale_pids_remain} n_check=PASS", t0) + + # ---- Mandatory in-script dangling-ref gate: ALL rows, ALL p__* columns ---- + # Scan EVERY p__* array column across the ENTIRE output (not just new rows). + # Surviving src rows must also be checked because orphan deletion can create + # dangling refs in old rows (e.g. surviving SamplingSites whose p__site_location + # pointed at geos that were incorrectly deleted as orphans). + # This is a HARD FAIL if any dangling ref is found — build aborted. + log("running mandatory dangling-ref gate on ALL rows, ALL p__* columns...", t0) + p_ref_cols = [ + col for col, typ in src_cols + if col.startswith("p__") + and any(t in typ.upper() for t in ("BIGINT", "INTEGER")) + ] + n_total_dangling = 0 + dangling_details = {} + out_row_ids_subq = f"SELECT row_id FROM {OUT}" + for p_col in p_ref_cols: + # Check ALL rows in output (both surviving src rows and new rows) + n_dangle = con.sql(f""" + WITH all_row_ids AS ({out_row_ids_subq}), + refs AS ( + SELECT unnest(w.{p_col}) AS ref_id + FROM {OUT} w + WHERE w.{p_col} IS NOT NULL AND len(w.{p_col}) > 0 + ) + SELECT COUNT(*) + FROM refs + LEFT JOIN all_row_ids ON refs.ref_id = all_row_ids.row_id + WHERE all_row_ids.row_id IS NULL + """).fetchone()[0] + dangling_details[p_col] = n_dangle + n_total_dangling += n_dangle + for col, cnt in sorted(dangling_details.items()): + print(f" {col}: {cnt} dangling refs", flush=True) + if n_total_dangling: + raise RuntimeError( + f"INTEGRITY FAIL: {n_total_dangling} dangling references in output. " + f"Per-column: {dangling_details}. Build aborted — do NOT emit manifest." + ) + log(f"Dangling ref check: PASS (0 dangling across {len(p_ref_cols)} columns)", t0) + + # ---- Phase J: description enrichment (#277) --------------------------------- + # OC sample descriptions in the combined wide are terse LD metadata strings + # ('updated': 2023-10-05...) instead of the human-readable site-path strings + # ('Open Context published "Sample" from: Europe/Cyprus/PKAP Survey Area/...') + # present in Eric's OC wide. Overwrite `description` for ALL OC MSR pids in + # the output from Eric's wide. This covers both existing ~1.04M pids that + # survived the sync and the 67,187 newly added pids. + # + # Implementation: single DuckDB COPY rewriting only the description column + # for OC MSR rows; all other columns and all non-OC rows pass through as-is. + # Row counts are invariant (JOIN on pid, not a filter). + log("description enrichment (#277): copying OC descriptions from Eric's wide…", t0) + + tmp_enriched = args.out + ".enriching.tmp" + # Use a UNION ALL approach for efficiency: join only OC MSR rows (1.1M) with + # Eric's wide for descriptions, then pass all non-OC rows through unchanged. + # This avoids a full-scan LEFT JOIN on 20M rows (which materializes the full + # wide in memory and is very slow). Both branches use SELECT * REPLACE for + # schema-agnostic column handling. + con.execute(f""" + COPY ( + -- OC MSR rows: overwrite description from Eric's wide where available + SELECT w.* REPLACE ( + CASE WHEN oc.description IS NOT NULL THEN oc.description ELSE w.description END AS description + ) + FROM read_parquet('{args.out}') w + LEFT JOIN ( + SELECT pid, description FROM {OC} WHERE otype='MaterialSampleRecord' + ) oc ON oc.pid = w.pid + WHERE w.otype='MaterialSampleRecord' AND w.n='{OC_SOURCE}' + + UNION ALL BY NAME + + -- All non-OC rows: pass through unchanged (no description modification) + SELECT * FROM read_parquet('{args.out}') + WHERE NOT (otype='MaterialSampleRecord' AND n='{OC_SOURCE}') + + ORDER BY row_id + ) TO '{tmp_enriched}' (FORMAT PARQUET, COMPRESSION ZSTD) + """) + + # Trust gate: verify row count is unchanged, then check Cyprus count + n_enriched = con.sql(f"SELECT COUNT(*) FROM read_parquet('{tmp_enriched}')").fetchone()[0] + if n_enriched != n_out: + os.unlink(tmp_enriched) + sys.exit(f"FATAL: description enrichment changed row count {n_out:,} → {n_enriched:,}") + + n_cyprus = con.sql( + f"SELECT COUNT(*) FROM read_parquet('{tmp_enriched}') " + f"WHERE otype='MaterialSampleRecord' AND n='{OC_SOURCE}' " + f"AND description ILIKE '%Cyprus%'" + ).fetchone()[0] + log(f"description enrichment trust gate: Cyprus OC MSR count = {n_cyprus:,} (expect ≈ 69,230)", t0) + # Hard trust gate: only applies at production scale (out_oc_count > 1M) to avoid + # false-positive failures on small synthetic fixtures that lack Cyprus descriptions. + # Threshold 69,000 is conservative relative to the observed production count of 69,230. + CYPRUS_THRESHOLD = 69000 + if out_oc_count > 1_000_000 and n_cyprus < CYPRUS_THRESHOLD: + os.unlink(tmp_enriched) + raise RuntimeError( + f"Trust gate FAIL: Cyprus description count {n_cyprus:,} < {CYPRUS_THRESHOLD:,} threshold. " + f"Description enrichment may have failed or OC wide is missing Cyprus data." + ) + + # Atomically replace the output with the enriched version + os.replace(tmp_enriched, args.out) + log(f"description enrichment complete: replaced {args.out}", t0) + + # ---- manifest ----------------------------------------------------------- + if not args.no_manifest: + manifest = { + "script": os.path.basename(__file__), + "argv": sys.argv, + "git_sha": git_sha(), + "duckdb_version": duckdb.__version__, + "policy": ("TRUE SYNC: add new OC pids + remove stale OC pids from Eric's fresh OC PQG wide " + "(#272 phase 2, D3 decision 2026-06-12)"), + "inputs": { + "src": {"path": args.src, "bytes": os.path.getsize(args.src), + "sha256": sha256_file(args.src)}, + "oc_wide": {"path": args.oc_wide, "bytes": os.path.getsize(args.oc_wide), + "sha256": sha256_file(args.oc_wide)}, + }, + "counts": { + "src_rows": n_src, + "removed_pids": n_removed_pids, + "orphan_rows": total_orphan_rows - n_removed_pids, + "total_rows_removed": n_rows_to_remove, + "orphan_breakdown": orphan_counts, + "new_pids": n_new_pids, + "new_entity_rows": n_new_entities, + "minted_concepts": n_minted, + "out_rows": n_out, + "new_oc_msr_total": out_oc_count, + "entity_breakdown": counts, + }, + "output": {"path": args.out, "bytes": os.path.getsize(args.out), + "sha256": sha256_file(args.out)}, + } + mpath = args.out + ".manifest.json" + with open(mpath, "w") as fh: + json.dump(manifest, fh, indent=2) + log(f"manifest -> {mpath}", t0) + + log("done", t0) + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/scripts/validate_frontend_derived.py b/scripts/validate_frontend_derived.py index 9976e26..8af61c2 100755 --- a/scripts/validate_frontend_derived.py +++ b/scripts/validate_frontend_derived.py @@ -112,8 +112,9 @@ def scalar(sql): check("facets.pid == map_lite.pid", diff == 0, f"{diff} pids differ between facets and map_lite") # --- 5. ALGEBRA: facet_summaries == GROUP BY facets (per dim) --- + # NOTE: build_frontend_derived.py filters both NULL and empty-string values (#283a fix). recompute = " UNION ALL ".join( - f"SELECT '{d}' AS facet_type, {d} AS facet_value, COUNT(*) AS c FROM {F} WHERE {d} IS NOT NULL GROUP BY {d}" + f"SELECT '{d}' AS facet_type, {d} AS facet_value, COUNT(*) AS c FROM {F} WHERE {d} IS NOT NULL AND {d} <> '' GROUP BY {d}" for d in ("source", "material", "context", "object_type")) mismatch = scalar(f""" WITH recomputed AS ({recompute}), @@ -125,15 +126,19 @@ def scalar(sql): check("facet_summaries == GROUP BY facets", mismatch == 0, f"{mismatch} (facet_type,value,count) rows disagree") check("facet_summaries.scheme all NULL", scalar(f"SELECT COUNT(*) FROM {S} WHERE scheme IS NOT NULL") == 0, "non-NULL scheme rows (contract: scheme is NULL)") + # --- 5b. blank facet values absent (#283a) — also catches whitespace-only values --- + check("facet_summaries no blank values (#283a)", scalar(f"SELECT COUNT(*) FROM {S} WHERE TRIM(facet_value) = ''") == 0, + "blank/whitespace-only facet_value rows (want 0; caused by GEOME empty-string concept URI)") # --- 6. ALGEBRA: facet_cross_filter single-dim rows == conditional GROUP BY facets --- + # NOTE: build_frontend_derived.py filters both NULL and empty-string values (#283a fix). dims = ("source", "material", "context", "object_type") parts = [] for filt in dims: for fd in dims: parts.append( f"SELECT '{filt}' AS fcol, {filt} AS fval, '{fd}' AS facet_type, {fd} AS facet_value, COUNT(*) AS c " - f"FROM {F} WHERE {filt} IS NOT NULL AND {fd} IS NOT NULL GROUP BY {filt}, {fd}") + f"FROM {F} WHERE {filt} IS NOT NULL AND {filt} <> '' AND {fd} IS NOT NULL AND {fd} <> '' GROUP BY {filt}, {fd}") recompute_cf = " UNION ALL ".join(parts) # normalize cross_filter single-dim rows into (fcol, fval, facet_type, facet_value, count) cf_single = f""" diff --git a/tests/test_ingest_oc_records.py b/tests/test_ingest_oc_records.py new file mode 100644 index 0000000..e8feebe --- /dev/null +++ b/tests/test_ingest_oc_records.py @@ -0,0 +1,1526 @@ +"""Fast, AI-free fixture tests for the OC record ingestion (#272 Phase 2). + +Builds tiny synthetic src-wide + oc-wide parquet pairs, runs the real +ingest script against them, and asserts the contract: + + TRUE SYNC behavior (D3 decision): + - New pids (Eric's \ src) are ingested with full entity subgraph + - Stale pids (src \ Eric's) are REMOVED along with orphan subgraph entities + - Shared entities (referenced by both surviving and removed MSRs) are kept + - Surviving non-OC rows are byte-identical + + Entity subgraph: + - MaterialSampleRecord + SamplingEvent + GeospatialCoordLocation + SamplingSite + Agent + - row_id remapping: new entities get deterministic ids starting at max(src)+1 + - p__ arrays remapped from Eric's integer space to our BIGINT space + - geometry denormalized from GeoCoordLoc onto MSR rows (WKB BLOB) + - n='OPENCONTEXT' on new MSR rows (Eric's wide has NULL) + + Trust-gate invariants: + - Hard-fail on duplicate OC MSR pids in Eric's wide + - Hard-fail on new pids that already exist in src wide + - Hard-fail on unresolved p__ references in new rows + - Row count arithmetic verified post-write + - No removed pids remain in output + + Determinism: + - Same inputs → bit-identical output (--no-manifest mode) + +Run: pytest tests/test_ingest_oc_records.py -q (needs: duckdb, spatial, h3) +""" +import hashlib +import json +import os +import subprocess +import sys + +import duckdb +import pytest + +HERE = os.path.dirname(os.path.abspath(__file__)) +REPO = os.path.dirname(HERE) +INGEST = os.path.join(REPO, "scripts", "ingest_oc_records.py") + +# Vocabulary URI prefixes for test fixtures +MAT = "https://w3id.org/isample/vocabulary/material/1.0/" +OBJ = "https://w3id.org/isample/vocabulary/materialsampleobjecttype/1.0/" +SF = "https://w3id.org/isample/vocabulary/sampledfeature/1.0/" +ROOT_MAT = MAT + "material" + + +# ---- fixture-building helpers ----------------------------------------------- + +def build_src_wide(path, *, msr_rows, concept_rows, se_rows, geo_rows, + site_rows=None, agent_rows=None, extra_rows=None): + """Build a minimal src wide parquet with the specified entity rows. + + msr_rows: list of dict with keys: row_id, pid, n, p__produced_by (list of ints), + p__has_material_category, p__has_sample_object_type, p__has_context_category + (lists of ints), geometry (WKB BLOB bytes or None), latitude, longitude + concept_rows: list of (row_id, uri) + se_rows: list of (row_id, pid, p__sample_location [list of int], p__sampling_site [list of int]) + geo_rows: list of (row_id, pid, latitude, longitude) — geometry will be ST_AsWKB(ST_Point) + site_rows: list of (row_id, pid, p__site_location [list of int]) + agent_rows: list of (row_id, pid) + """ + con = duckdb.connect() + con.execute("INSTALL spatial; LOAD spatial;") + + def _arr(xs, t="BIGINT[]"): + if xs is None: + return f"NULL::{t}" + return "[" + ",".join(str(x) for x in xs) + f"]::{t}" + + rows = [] + + # Concept rows + for rid, uri in concept_rows: + rows.append( + f"SELECT {rid}::BIGINT AS row_id, '{uri}' AS pid, 'IdentifiedConcept' AS otype, " + f"NULL::VARCHAR AS n, NULL::BLOB AS geometry, NULL::DOUBLE AS latitude, " + f"NULL::DOUBLE AS longitude, NULL::VARCHAR AS label, NULL::VARCHAR AS description, " + f"NULL::VARCHAR[] AS place_name, NULL::TIMESTAMP AS result_time, " + f"NULL::BIGINT[] AS p__has_material_category, NULL::BIGINT[] AS p__has_sample_object_type, " + f"NULL::BIGINT[] AS p__has_context_category, NULL::BIGINT[] AS p__produced_by, " + f"NULL::BIGINT[] AS p__sample_location, NULL::BIGINT[] AS p__sampling_site, " + f"NULL::BIGINT[] AS p__site_location, NULL::BIGINT[] AS p__registrant, " + f"NULL::BIGINT[] AS p__keywords, NULL::BIGINT[] AS p__responsibility, " + f"NULL::INTEGER[] AS p__curation, NULL::BIGINT[] AS p__related_resource, " + f"NULL::VARCHAR AS thumbnail_url, NULL::VARCHAR AS scheme_name, NULL::VARCHAR AS scheme_uri" + ) + + # MSR rows + for m in msr_rows: + lat = m.get("latitude") + lon = m.get("longitude") + if lat is not None and lon is not None: + geom_expr = f"ST_AsWKB(ST_Point({lon}, {lat}))::BLOB" + else: + geom_expr = "NULL::BLOB" + lat_expr = f"{lat}::DOUBLE" if lat is not None else "NULL::DOUBLE" + lon_expr = f"{lon}::DOUBLE" if lon is not None else "NULL::DOUBLE" + pid = m['pid'] + n_val = m.get('n', 'OPENCONTEXT') + rows.append( + f"SELECT {m['row_id']}::BIGINT, '{pid}', 'MaterialSampleRecord', " + f"'{n_val}'::VARCHAR, " + f"{geom_expr}, {lat_expr}, {lon_expr}, " + f"'label {pid}', 'desc {pid}', " + f"['place1']::VARCHAR[], NULL::TIMESTAMP, " + f"{_arr(m.get('p__has_material_category'))}, " + f"{_arr(m.get('p__has_sample_object_type'))}, " + f"{_arr(m.get('p__has_context_category'))}, " + f"{_arr(m.get('p__produced_by'))}, " + f"NULL::BIGINT[], NULL::BIGINT[], NULL::BIGINT[], " + f"{_arr(m.get('p__registrant'))}, " + f"NULL::BIGINT[], NULL::BIGINT[], NULL::INTEGER[], NULL::BIGINT[], " + f"NULL::VARCHAR, NULL::VARCHAR, NULL::VARCHAR" + ) + + # SE rows + for rid, pid, sample_loc, sampling_site in (se_rows or []): + rows.append( + f"SELECT {rid}::BIGINT, '{pid}', 'SamplingEvent', NULL::VARCHAR, " + f"NULL::BLOB, NULL::DOUBLE, NULL::DOUBLE, NULL, NULL, NULL::VARCHAR[], NULL::TIMESTAMP, " + f"NULL::BIGINT[], NULL::BIGINT[], NULL::BIGINT[], NULL::BIGINT[], " + f"{_arr(sample_loc)}, {_arr(sampling_site)}, NULL::BIGINT[], " + f"NULL::BIGINT[], NULL::BIGINT[], NULL::BIGINT[], NULL::INTEGER[], NULL::BIGINT[], " + f"NULL::VARCHAR, NULL::VARCHAR, NULL::VARCHAR" + ) + + # Geo rows + for rid, pid, lat, lon in (geo_rows or []): + rows.append( + f"SELECT {rid}::BIGINT, '{pid}', 'GeospatialCoordLocation', NULL::VARCHAR, " + f"ST_AsWKB(ST_Point({lon}, {lat}))::BLOB, {lat}::DOUBLE, {lon}::DOUBLE, " + f"NULL, NULL, NULL::VARCHAR[], NULL::TIMESTAMP, " + f"NULL::BIGINT[], NULL::BIGINT[], NULL::BIGINT[], NULL::BIGINT[], " + f"NULL::BIGINT[], NULL::BIGINT[], NULL::BIGINT[], NULL::BIGINT[], " + f"NULL::BIGINT[], NULL::BIGINT[], NULL::INTEGER[], NULL::BIGINT[], " + f"NULL::VARCHAR, NULL::VARCHAR, NULL::VARCHAR" + ) + + # SamplingSite rows + for rid, pid, site_loc in (site_rows or []): + rows.append( + f"SELECT {rid}::BIGINT, '{pid}', 'SamplingSite', NULL::VARCHAR, " + f"NULL::BLOB, NULL::DOUBLE, NULL::DOUBLE, NULL, NULL, NULL::VARCHAR[], NULL::TIMESTAMP, " + f"NULL::BIGINT[], NULL::BIGINT[], NULL::BIGINT[], NULL::BIGINT[], " + f"NULL::BIGINT[], NULL::BIGINT[], {_arr(site_loc)}, " + f"NULL::BIGINT[], NULL::BIGINT[], NULL::BIGINT[], NULL::INTEGER[], NULL::BIGINT[], " + f"NULL::VARCHAR, NULL::VARCHAR, NULL::VARCHAR" + ) + + # Agent rows + for rid, pid in (agent_rows or []): + rows.append( + f"SELECT {rid}::BIGINT, '{pid}', 'Agent', NULL::VARCHAR, " + f"NULL::BLOB, NULL::DOUBLE, NULL::DOUBLE, NULL, NULL, NULL::VARCHAR[], NULL::TIMESTAMP, " + f"NULL::BIGINT[], NULL::BIGINT[], NULL::BIGINT[], NULL::BIGINT[], " + f"NULL::BIGINT[], NULL::BIGINT[], NULL::BIGINT[], NULL::BIGINT[], " + f"NULL::BIGINT[], NULL::BIGINT[], NULL::INTEGER[], NULL::BIGINT[], " + f"NULL::VARCHAR, NULL::VARCHAR, NULL::VARCHAR" + ) + + # Extra rows (raw SQL) + if extra_rows: + rows.extend(extra_rows) + + con.execute(f"COPY ({' UNION ALL '.join(rows)}) TO '{path}' (FORMAT PARQUET)") + con.close() + + +def build_oc_wide(path, *, msr_rows, concept_rows, se_rows, geo_rows, + site_rows=None, agent_rows=None): + """Build a minimal OC wide parquet in Eric's schema (INTEGER row_id, GEOMETRY geometry). + + geo_rows: list of (row_id, pid, latitude, longitude) — geometry stored as GEOMETRY type + """ + con = duckdb.connect() + con.execute("INSTALL spatial; LOAD spatial;") + + def _arr(xs, t="INTEGER[]"): + if xs is None: + return f"NULL::{t}" + return "[" + ",".join(str(x) for x in xs) + f"]::{t}" + + rows = [] + + # Concept rows + for rid, uri, label in concept_rows: + rows.append( + f"SELECT {rid}::INTEGER AS row_id, '{uri}' AS pid, 'IdentifiedConcept' AS otype, " + f"NULL::VARCHAR AS n, NULL::GEOMETRY AS geometry, NULL::DOUBLE AS latitude, " + f"NULL::DOUBLE AS longitude, {repr(label)}::VARCHAR AS label, " + f"NULL::VARCHAR AS description, NULL::VARCHAR[] AS place_name, NULL::TIMESTAMP AS result_time, " + f"NULL::INTEGER[] AS p__has_material_category, NULL::INTEGER[] AS p__has_sample_object_type, " + f"NULL::INTEGER[] AS p__has_context_category, NULL::INTEGER[] AS p__produced_by, " + f"NULL::INTEGER[] AS p__sample_location, NULL::INTEGER[] AS p__sampling_site, " + f"NULL::INTEGER[] AS p__site_location, NULL::INTEGER[] AS p__registrant, " + f"NULL::INTEGER[] AS p__keywords, NULL::INTEGER[] AS p__responsibility, " + f"NULL::VARCHAR AS thumbnail_url, NULL::VARCHAR AS scheme_name, NULL::VARCHAR AS scheme_uri" + ) + + # MSR rows (no geometry on MSR in Eric's wide) + for m in msr_rows: + pid = m['pid'] + rows.append( + f"SELECT {m['row_id']}::INTEGER, '{pid}', 'MaterialSampleRecord', " + f"NULL::VARCHAR, " # n is NULL in Eric's wide + f"NULL::GEOMETRY, NULL::DOUBLE, NULL::DOUBLE, " + f"'label {pid}', 'desc {pid}', " + f"['place1']::VARCHAR[], NULL::TIMESTAMP, " + f"{_arr(m.get('p__has_material_category'))}, " + f"{_arr(m.get('p__has_sample_object_type'))}, " + f"{_arr(m.get('p__has_context_category'))}, " + f"{_arr(m.get('p__produced_by'))}, " + f"NULL::INTEGER[], NULL::INTEGER[], NULL::INTEGER[], " + f"{_arr(m.get('p__registrant'))}, " + f"{_arr(m.get('p__keywords'))}, NULL::INTEGER[], " + f"NULL::VARCHAR, NULL::VARCHAR, NULL::VARCHAR" + ) + + # SE rows + for rid, pid, sample_loc, sampling_site in (se_rows or []): + rows.append( + f"SELECT {rid}::INTEGER, '{pid}', 'SamplingEvent', NULL::VARCHAR, " + f"NULL::GEOMETRY, NULL::DOUBLE, NULL::DOUBLE, NULL, NULL, NULL::VARCHAR[], NULL::TIMESTAMP, " + f"NULL::INTEGER[], NULL::INTEGER[], NULL::INTEGER[], NULL::INTEGER[], " + f"{_arr(sample_loc)}, {_arr(sampling_site)}, NULL::INTEGER[], " + f"NULL::INTEGER[], NULL::INTEGER[], NULL::INTEGER[], " + f"NULL::VARCHAR, NULL::VARCHAR, NULL::VARCHAR" + ) + + # Geo rows (GEOMETRY type in Eric's wide) + for rid, pid, lat, lon in (geo_rows or []): + rows.append( + f"SELECT {rid}::INTEGER, '{pid}', 'GeospatialCoordLocation', NULL::VARCHAR, " + f"ST_Point({lon}, {lat})::GEOMETRY, {lat}::DOUBLE, {lon}::DOUBLE, " + f"NULL, NULL, NULL::VARCHAR[], NULL::TIMESTAMP, " + f"NULL::INTEGER[], NULL::INTEGER[], NULL::INTEGER[], NULL::INTEGER[], " + f"NULL::INTEGER[], NULL::INTEGER[], NULL::INTEGER[], NULL::INTEGER[], " + f"NULL::INTEGER[], NULL::INTEGER[], " + f"NULL::VARCHAR, NULL::VARCHAR, NULL::VARCHAR" + ) + + # SamplingSite rows + for rid, pid, site_loc in (site_rows or []): + rows.append( + f"SELECT {rid}::INTEGER, '{pid}', 'SamplingSite', NULL::VARCHAR, " + f"NULL::GEOMETRY, NULL::DOUBLE, NULL::DOUBLE, NULL, NULL, NULL::VARCHAR[], NULL::TIMESTAMP, " + f"NULL::INTEGER[], NULL::INTEGER[], NULL::INTEGER[], NULL::INTEGER[], " + f"NULL::INTEGER[], NULL::INTEGER[], {_arr(site_loc)}, " + f"NULL::INTEGER[], NULL::INTEGER[], NULL::INTEGER[], " + f"NULL::VARCHAR, NULL::VARCHAR, NULL::VARCHAR" + ) + + # Agent rows + for rid, pid in (agent_rows or []): + rows.append( + f"SELECT {rid}::INTEGER, '{pid}', 'Agent', NULL::VARCHAR, " + f"NULL::GEOMETRY, NULL::DOUBLE, NULL::DOUBLE, NULL, NULL, NULL::VARCHAR[], NULL::TIMESTAMP, " + f"NULL::INTEGER[], NULL::INTEGER[], NULL::INTEGER[], NULL::INTEGER[], " + f"NULL::INTEGER[], NULL::INTEGER[], NULL::INTEGER[], NULL::INTEGER[], " + f"NULL::INTEGER[], NULL::INTEGER[], " + f"NULL::VARCHAR, NULL::VARCHAR, NULL::VARCHAR" + ) + + con.execute(f"COPY ({' UNION ALL '.join(rows)}) TO '{path}' (FORMAT PARQUET)") + con.close() + + +def run_ingest(src, oc, out, extra_args=None): + cmd = [sys.executable, INGEST, "--src", src, "--oc-wide", oc, "--out", out, + "--no-manifest"] + if extra_args: + cmd.extend(extra_args) + return subprocess.run(cmd, capture_output=True, text=True) + + +def count_otype(path, otype): + con = duckdb.connect() + n = con.sql(f"SELECT COUNT(*) FROM read_parquet('{path}') WHERE otype='{otype}'").fetchone()[0] + con.close() + return n + + +def get_msr(path, pid): + con = duckdb.connect() + r = con.sql(f"SELECT * FROM read_parquet('{path}') WHERE pid='{pid}' AND otype='MaterialSampleRecord'").fetchone() + desc = con.sql(f"DESCRIBE SELECT * FROM read_parquet('{path}')").fetchall() + cols = [d[0] for d in desc] + con.close() + if r is None: + return None + return dict(zip(cols, r)) + + +# ---- shared fixture --------------------------------------------------------- + +# Concept IDs in src space (BIGINT) +SRC_ROOT_CONCEPT_ID = 1 +SRC_ROCK_CONCEPT_ID = 2 +SRC_ARTIFACT_CONCEPT_ID = 3 + +SRC_CONCEPT_ROWS = [ + (SRC_ROOT_CONCEPT_ID, ROOT_MAT), + (SRC_ROCK_CONCEPT_ID, MAT + "rock"), + (SRC_ARTIFACT_CONCEPT_ID, OBJ + "artifact"), +] + +# OC concept IDs in Eric's space (INTEGER) +OC_ROOT_CONCEPT_ID = 901 +OC_ROCK_CONCEPT_ID = 902 +OC_ARTIFACT_CONCEPT_ID = 903 +OC_EARTH_CONCEPT_ID = 904 # earthsurface — not yet in src + +OC_CONCEPT_ROWS = [ + (OC_ROOT_CONCEPT_ID, ROOT_MAT, "Material"), + (OC_ROCK_CONCEPT_ID, MAT + "rock", "Rock"), + (OC_ARTIFACT_CONCEPT_ID, OBJ + "artifact", "Artifact"), + (OC_EARTH_CONCEPT_ID, SF + "earthsurface", "Earth Surface"), +] + +# Eric's subgraph: 3 SEs, 3 Geos, 2 sites +# New pids MSR: pid-A (se=101->geo=201), pid-B (se=102->geo=202, site=301->geo_site=211) +# Removed pids: pid-C (se=103->geo=203) — in src, not in Eric's → stale +OC_SE_ROWS = [ + (101, "se-pid-A", [201], None), # SE for pid-A + (102, "se-pid-B", [202], [301]), # SE for pid-B (with sampling site) +] +OC_SITE_ROWS = [ + (301, "site-pid-B", [211]), # SamplingSite for pid-B, geo=211 +] +OC_GEO_ROWS = [ + (201, "geo-pid-A", 45.0, 10.0), + (202, "geo-pid-B", 50.0, 15.0), + (211, "geo-site-B", 50.1, 15.1), # geo from SamplingSite for pid-B +] +OC_MSR_ROWS = [ + {"row_id": 1, "pid": "pid-A", "p__produced_by": [101], + "p__has_material_category": [OC_ROCK_CONCEPT_ID], + "p__has_sample_object_type": [OC_ARTIFACT_CONCEPT_ID], + "p__has_context_category": [OC_EARTH_CONCEPT_ID]}, # earthsurface to be minted + {"row_id": 2, "pid": "pid-B", "p__produced_by": [102], + "p__has_material_category": [OC_ROOT_CONCEPT_ID], + "p__has_sample_object_type": [OC_ARTIFACT_CONCEPT_ID], + "p__has_context_category": None}, +] + +# src wide: has pid-C (stale — not in Eric's), + all existing entities +# pid-C: se_id=103, geo_id=203 — both orphans (not shared with any surviving MSR) +SRC_SE_ROWS = [ + # pid-C's SE (will become orphan) + (103, "se-pid-C", [203], None), +] +SRC_GEO_ROWS = [ + (203, "geo-pid-C", 60.0, 20.0), # orphan geo for pid-C +] +SRC_MSR_ROWS = [ + {"row_id": 1000, "pid": "pid-C", "n": "OPENCONTEXT", + "p__produced_by": [103], + "p__has_material_category": [SRC_ROCK_CONCEPT_ID], + "p__has_sample_object_type": [SRC_ARTIFACT_CONCEPT_ID], + "latitude": 60.0, "longitude": 20.0}, + # non-OC MSR — must survive unchanged + {"row_id": 1001, "pid": "pid-NON-OC", "n": "SESAR", + "p__has_material_category": [SRC_ROCK_CONCEPT_ID], + "latitude": 55.0, "longitude": 25.0}, +] +SRC_AGENT_ROWS = [(500, "agent-existing")] # pre-existing agent + + +@pytest.fixture +def pair(tmp_path): + """Canonical 3-MSR fixture: pid-A (new), pid-B (new), pid-C (stale/removed).""" + src = str(tmp_path / "src.parquet") + oc = str(tmp_path / "oc.parquet") + out = str(tmp_path / "out.parquet") + + build_src_wide( + src, + msr_rows=SRC_MSR_ROWS, + concept_rows=SRC_CONCEPT_ROWS, + se_rows=SRC_SE_ROWS, + geo_rows=SRC_GEO_ROWS, + agent_rows=SRC_AGENT_ROWS, + ) + build_oc_wide( + oc, + msr_rows=OC_MSR_ROWS, + concept_rows=OC_CONCEPT_ROWS, + se_rows=OC_SE_ROWS, + geo_rows=OC_GEO_ROWS, + site_rows=OC_SITE_ROWS, + ) + return src, oc, out + + +# ---- tests ------------------------------------------------------------------ + +def test_new_pids_ingested(pair): + """pid-A and pid-B (new) are present in output.""" + src, oc, out = pair + r = run_ingest(src, oc, out) + assert r.returncode == 0, r.stderr + r.stdout + assert get_msr(out, "pid-A") is not None, "pid-A missing from output" + assert get_msr(out, "pid-B") is not None, "pid-B missing from output" + + +def test_stale_pid_removed(pair): + """pid-C (stale) is NOT in output.""" + src, oc, out = pair + assert run_ingest(src, oc, out).returncode == 0 + assert get_msr(out, "pid-C") is None, "pid-C (stale) should have been removed" + + +def test_orphan_subgraph_entities_removed(pair): + """se-pid-C and geo-pid-C (orphans) are NOT in output.""" + src, oc, out = pair + assert run_ingest(src, oc, out).returncode == 0 + con = duckdb.connect() + n_se = con.sql(f"SELECT COUNT(*) FROM read_parquet('{out}') WHERE pid='se-pid-C'").fetchone()[0] + n_geo = con.sql(f"SELECT COUNT(*) FROM read_parquet('{out}') WHERE pid='geo-pid-C'").fetchone()[0] + con.close() + assert n_se == 0, "orphan SE se-pid-C should have been removed" + assert n_geo == 0, "orphan geo geo-pid-C should have been removed" + + +def test_non_oc_rows_survive_unchanged(pair): + """pid-NON-OC (SESAR) is present and byte-identical.""" + src, oc, out = pair + assert run_ingest(src, oc, out).returncode == 0 + r_src = get_msr(src, "pid-NON-OC") + r_out = get_msr(out, "pid-NON-OC") + assert r_out is not None, "non-OC MSR should survive" + assert r_src["pid"] == r_out["pid"] + assert r_src["n"] == r_out["n"] + assert r_src["latitude"] == r_out["latitude"] + assert r_src["longitude"] == r_out["longitude"] + + +def test_geometry_denormalized_onto_new_msr(pair): + """New MSR pid-A gets lat/lon from linked GeoCoordLoc (via SE).""" + src, oc, out = pair + assert run_ingest(src, oc, out).returncode == 0 + r = get_msr(out, "pid-A") + assert r is not None + assert abs(r["latitude"] - 45.0) < 1e-5, f"lat wrong: {r['latitude']}" + assert abs(r["longitude"] - 10.0) < 1e-5, f"lon wrong: {r['longitude']}" + assert r["geometry"] is not None, "geometry should be non-null (WKB BLOB)" + + +def test_n_column_set_on_new_msrs(pair): + """New MSR rows have n='OPENCONTEXT'.""" + src, oc, out = pair + assert run_ingest(src, oc, out).returncode == 0 + assert get_msr(out, "pid-A")["n"] == "OPENCONTEXT" + assert get_msr(out, "pid-B")["n"] == "OPENCONTEXT" + + +def test_p_array_remapped_to_output_id_space(pair): + """p__produced_by on new MSR pid-A resolves to a SE row in the output.""" + src, oc, out = pair + assert run_ingest(src, oc, out).returncode == 0 + con = duckdb.connect() + # Find the SE row_id stored in pid-A's p__produced_by + pb = con.sql(f""" + SELECT p__produced_by[1] FROM read_parquet('{out}') + WHERE pid='pid-A' AND otype='MaterialSampleRecord' + """).fetchone()[0] + # Verify that row_id exists in the output as a SamplingEvent + se_exists = con.sql(f""" + SELECT COUNT(*) FROM read_parquet('{out}') + WHERE row_id = {pb} AND otype='SamplingEvent' + """).fetchone()[0] + con.close() + assert pb is not None, "p__produced_by must be non-null" + assert se_exists == 1, f"SE row_id {pb} not found in output" + + +def test_concept_remap_via_uri_lookup(pair): + """p__has_material_category on new MSR pid-A resolves to the src 'rock' concept row.""" + src, oc, out = pair + assert run_ingest(src, oc, out).returncode == 0 + con = duckdb.connect() + # Get the concept row_id used by pid-A in the output + mat_rid = con.sql(f""" + SELECT p__has_material_category[1] FROM read_parquet('{out}') + WHERE pid='pid-A' AND otype='MaterialSampleRecord' + """).fetchone()[0] + # Verify it resolves to the 'rock' URI + rock_uri = con.sql(f""" + SELECT pid FROM read_parquet('{out}') + WHERE row_id = {mat_rid} AND otype='IdentifiedConcept' + """).fetchone()[0] + con.close() + assert rock_uri == MAT + "rock", f"concept URI wrong: {rock_uri}" + + +def test_minted_concept_earthsurface(pair): + """earthsurface concept is minted when absent from src (referenced by pid-A's context).""" + src, oc, out = pair + assert run_ingest(src, oc, out).returncode == 0 + con = duckdb.connect() + n = con.sql(f""" + SELECT COUNT(*) FROM read_parquet('{out}') + WHERE otype='IdentifiedConcept' AND pid='{SF}earthsurface' + """).fetchone()[0] + con.close() + assert n == 1, "earthsurface concept should be minted in output" + + +def test_sampling_site_ingested_for_pid_b(pair): + """SamplingSite row (site-pid-B) is present in output for pid-B's chain.""" + src, oc, out = pair + assert run_ingest(src, oc, out).returncode == 0 + con = duckdb.connect() + n = con.sql(f"SELECT COUNT(*) FROM read_parquet('{out}') WHERE pid='site-pid-B'").fetchone()[0] + con.close() + assert n == 1, "SamplingSite site-pid-B should be in output" + + +def test_row_count_arithmetic(pair): + """Output row count = (src - removed) + new_entities + minted_concepts.""" + src, oc, out = pair + assert run_ingest(src, oc, out).returncode == 0 + con = duckdb.connect() + n_src = con.sql(f"SELECT COUNT(*) FROM read_parquet('{src}')").fetchone()[0] + n_out = con.sql(f"SELECT COUNT(*) FROM read_parquet('{out}')").fetchone()[0] + # Removed: 1 MSR (pid-C) + 1 SE (se-pid-C) + 1 geo (geo-pid-C) = 3 rows + # New entities: 2 MSR + 2 SE + 3 Geo (201, 202, 211) + 1 Site = 8 rows + # Minted: 1 (earthsurface) + # Expected: n_src - 3 + 8 + 1 + expected = n_src - 3 + 8 + 1 + con.close() + assert n_out == expected, f"row count {n_out} != expected {expected}" + + +def test_no_duplicate_row_ids(pair): + """Output has no duplicate row_ids.""" + src, oc, out = pair + assert run_ingest(src, oc, out).returncode == 0 + con = duckdb.connect() + n_dup = con.sql(f""" + SELECT COUNT(*) FROM ( + SELECT row_id FROM read_parquet('{out}') + GROUP BY row_id HAVING COUNT(*) > 1 + ) + """).fetchone()[0] + con.close() + assert n_dup == 0, f"{n_dup} duplicate row_ids in output" + + +def test_no_duplicate_msr_pids(pair): + """Output has no duplicate MSR pids.""" + src, oc, out = pair + assert run_ingest(src, oc, out).returncode == 0 + con = duckdb.connect() + n_dup = con.sql(f""" + SELECT COUNT(*) FROM ( + SELECT pid FROM read_parquet('{out}') WHERE otype='MaterialSampleRecord' + GROUP BY pid HAVING COUNT(*) > 1 + ) + """).fetchone()[0] + con.close() + assert n_dup == 0, f"{n_dup} duplicate MSR pids in output" + + +def test_determinism_bit_identical(pair, tmp_path): + """Same inputs → bit-identical outputs (--no-manifest suppresses timestamp drift).""" + src, oc, out = pair + out2 = str(tmp_path / "out2.parquet") + assert run_ingest(src, oc, out).returncode == 0 + assert run_ingest(src, oc, out2).returncode == 0 + h = lambda p: hashlib.sha256(open(p, "rb").read()).hexdigest() + assert h(out) == h(out2), "outputs not bit-identical across two runs with same inputs" + + +def test_dry_run_produces_no_output(pair): + """--dry-run exits 0 but does NOT write the output file.""" + src, oc, out = pair + r = run_ingest(src, oc, out, extra_args=["--dry-run"]) + assert r.returncode == 0, r.stderr + r.stdout + assert not os.path.exists(out), "--dry-run should not write output" + assert "DRY RUN" in r.stdout + + +def test_hard_fail_on_duplicate_oc_pids(pair, tmp_path): + """Eric's wide with duplicate MSR pids triggers a hard failure.""" + src, oc, out = pair + dup_oc = str(tmp_path / "oc_dup.parquet") + # Build OC with pid-A appearing twice + build_oc_wide( + dup_oc, + msr_rows=OC_MSR_ROWS + [{"row_id": 99, "pid": "pid-A", + "p__produced_by": [101], + "p__has_material_category": [OC_ROCK_CONCEPT_ID]}], + concept_rows=OC_CONCEPT_ROWS, + se_rows=OC_SE_ROWS, + geo_rows=OC_GEO_ROWS, + ) + r = run_ingest(src, dup_oc, out) + assert r.returncode != 0, "should fail on duplicate OC pids" + assert "duplicate" in (r.stderr + r.stdout).lower() + assert not os.path.exists(out) + + +def test_hard_fail_new_pid_already_in_src(tmp_path): + """A 'new' pid already in src as non-OC row triggers a hard failure.""" + src = str(tmp_path / "src.parquet") + oc = str(tmp_path / "oc.parquet") + out = str(tmp_path / "out.parquet") + + # Build src with pid-A as SESAR (not OC) + pid-C as OC (will be "removed") + build_src_wide( + src, + msr_rows=[ + {"row_id": 1000, "pid": "pid-A", "n": "SESAR", + "p__has_material_category": [SRC_ROCK_CONCEPT_ID], + "latitude": 45.0, "longitude": 10.0}, + {"row_id": 1001, "pid": "pid-C", "n": "OPENCONTEXT", + "p__produced_by": [103], + "p__has_material_category": [SRC_ROCK_CONCEPT_ID]}, + ], + concept_rows=SRC_CONCEPT_ROWS, + se_rows=[(103, "se-pid-C", [203], None)], + geo_rows=[(203, "geo-pid-C", 60.0, 20.0)], + ) + build_oc_wide( + oc, + msr_rows=OC_MSR_ROWS, # pid-A is "new" from OC's perspective + concept_rows=OC_CONCEPT_ROWS, + se_rows=OC_SE_ROWS, + geo_rows=OC_GEO_ROWS, + ) + r = run_ingest(src, oc, out) + assert r.returncode != 0, "should fail — pid-A is 'new' from OC but already in src as SESAR" + assert not os.path.exists(out) + + +def test_removal_only_removes_oc_entities(pair): + """The non-OC MSR (pid-NON-OC) and its row are not in the removal set.""" + src, oc, out = pair + assert run_ingest(src, oc, out).returncode == 0 + # non-OC row must still be in output + r = get_msr(out, "pid-NON-OC") + assert r is not None, "non-OC MSR should not be removed" + assert r["n"] == "SESAR" + + +def test_new_row_ids_no_collision_with_src(pair): + """All new row_ids are strictly greater than max(src.row_id).""" + src, oc, out = pair + assert run_ingest(src, oc, out).returncode == 0 + con = duckdb.connect() + max_src = con.sql(f"SELECT MAX(row_id) FROM read_parquet('{src}')").fetchone()[0] + # New rows start at max_src+1 — get all rows NOT in src + src_ids = set(r[0] for r in con.sql(f"SELECT row_id FROM read_parquet('{src}')").fetchall()) + out_ids = set(r[0] for r in con.sql(f"SELECT row_id FROM read_parquet('{out}')").fetchall()) + new_ids = out_ids - src_ids + con.close() + # Removed rows are also gone; new ids are all > max_src + if new_ids: + assert min(new_ids) > max_src, f"New ids start at {min(new_ids)}, but max_src={max_src}" + + +def test_refuses_to_overwrite_input(pair): + """--out same as --src triggers a hard failure.""" + src, oc, _ = pair + r = run_ingest(src, oc, src) + assert r.returncode != 0 + assert "overwrite" in (r.stderr + r.stdout).lower() + + +# ============================================================================ +# Fix #277 — OC description enrichment +# ============================================================================ + +def test_oc_description_enriched_from_eric_wide(pair): + """OC MSR pid-A gets its description from Eric's OC wide after ingestion. + + The src wide stores 'desc pid-A' (a placeholder). Eric's wide also stores + 'desc pid-A' by default from build_oc_wide(). We override pid-A's description + in Eric's wide to a realistic site-path string and verify the output carries + that enriched value, not the src placeholder. + """ + src, oc, out = pair + + # Patch Eric's wide to have a realistic description for pid-A. + # We rebuild oc with a custom description for pid-A. + oc_patched = out.replace("out.parquet", "oc_patched.parquet") + con = duckdb.connect() + con.execute("INSTALL spatial; LOAD spatial;") + # Read Eric's wide into a temp table, update pid-A's description, rewrite. + con.execute(f""" + COPY ( + SELECT + row_id, pid, otype, n, geometry, latitude, longitude, + CASE WHEN pid='pid-A' AND otype='MaterialSampleRecord' + THEN 'Open Context published "Sample" from: Europe/Cyprus/PKAP Survey Area/Unit 42' + ELSE label + END AS label, + CASE WHEN pid='pid-A' AND otype='MaterialSampleRecord' + THEN 'Open Context published "Sample" from: Europe/Cyprus/PKAP Survey Area/Unit 42' + ELSE description + END AS description, + place_name, result_time, p__has_material_category, p__has_sample_object_type, + p__has_context_category, p__produced_by, p__sample_location, p__sampling_site, + p__site_location, p__registrant, p__keywords, p__responsibility, + thumbnail_url, scheme_name, scheme_uri + FROM read_parquet('{oc}') + ) TO '{oc_patched}' (FORMAT PARQUET) + """) + con.close() + + r = run_ingest(src, oc_patched, out) + assert r.returncode == 0, r.stderr + r.stdout + + row = get_msr(out, "pid-A") + assert row is not None + assert "Cyprus" in row["description"], ( + f"Expected enriched description with 'Cyprus', got: {row['description']!r}" + ) + + +def test_non_oc_description_unchanged_by_enrichment(pair): + """Non-OC MSR (pid-NON-OC) description is not overwritten by the OC enrichment.""" + src, oc, out = pair + r = run_ingest(src, oc, out) + assert r.returncode == 0, r.stderr + r.stdout + + src_row = get_msr(src, "pid-NON-OC") + out_row = get_msr(out, "pid-NON-OC") + assert out_row is not None + # Non-OC rows must have same description as in src (enrichment must not touch them) + assert out_row["description"] == src_row["description"], ( + f"Non-OC description changed: {src_row['description']!r} → {out_row['description']!r}" + ) + + +def test_oc_msr_count_unchanged_by_enrichment(pair): + """Description enrichment does not change the OC MSR row count.""" + src, oc, out = pair + r = run_ingest(src, oc, out) + assert r.returncode == 0, r.stderr + r.stdout + + con = duckdb.connect() + n_total = con.sql(f"SELECT COUNT(*) FROM read_parquet('{out}')").fetchone()[0] + n_oc_msr = con.sql(f""" + SELECT COUNT(*) FROM read_parquet('{out}') + WHERE otype='MaterialSampleRecord' AND n='OPENCONTEXT' + """).fetchone()[0] + con.close() + # 2 new OC MSRs (pid-A, pid-B), 1 removed (pid-C), 1 non-OC (pid-NON-OC) → 2 total OC MSRs + assert n_oc_msr == 2, f"Expected 2 OC MSRs after sync, got {n_oc_msr}" + # Row count must match the sync arithmetic (n_src - 3 removed + 8 new + 1 minted) + con2 = duckdb.connect() + n_src = con2.sql(f"SELECT COUNT(*) FROM read_parquet('{src}')").fetchone()[0] + con2.close() + assert n_total == n_src - 3 + 8 + 1, f"Total row count unexpected: {n_total}" + + +# ============================================================================ +# Fix #283a — Empty-string facet filter +# ============================================================================ + +def test_empty_string_facet_values_filtered_from_summaries(tmp_path): + """build_facet_summaries must not produce rows with facet_value=''. + + This is a synthetic test: we build a tiny samp_geo with an empty-string + context value and verify it does NOT appear in facet_summaries output. + """ + import duckdb as _duckdb + BUILD = os.path.join(REPO, "scripts", "build_frontend_derived.py") + sys.path.insert(0, os.path.join(REPO, "scripts")) + import build_frontend_derived as B + + con = _duckdb.connect() + con.execute("INSTALL h3 FROM community; LOAD h3; INSTALL spatial; LOAD spatial;") + # Create a synthetic samp_geo with an empty-string context and a real one + con.execute(""" + CREATE OR REPLACE TEMP TABLE samp_geo AS + SELECT 'pid1' AS pid, 'GEOME' AS source, + 'https://w3id.org/isample/vocabulary/material/1.0/rock' AS material, + '' AS context, -- empty-string concept URI (the bug scenario) + 'https://w3id.org/isample/vocabulary/materialsampleobjecttype/1.0/artifact' AS object_type, + 'label1' AS label, 'desc1' AS description, + NULL::VARCHAR AS place_name, NULL::TIMESTAMP AS result_time, + 10.0::DOUBLE AS latitude, 45.0::DOUBLE AS longitude, + 1::UBIGINT AS h3_res4, 2::UBIGINT AS h3_res6, 3::UBIGINT AS h3_res8 + UNION ALL + SELECT 'pid2', 'GEOME', + 'https://w3id.org/isample/vocabulary/material/1.0/rock', + 'https://w3id.org/isample/vocabulary/sampledfeature/1.0/earthsurface', + 'https://w3id.org/isample/vocabulary/materialsampleobjecttype/1.0/artifact', + 'label2', 'desc2', NULL, NULL, 11.0, 46.0, 1, 2, 3 + """) + + out = str(tmp_path / "facet_summaries.parquet") + B.build_facet_summaries(con, out) + + rows = con.sql(f"SELECT * FROM read_parquet('{out}') WHERE facet_value = ''").fetchall() + assert rows == [], ( + f"Expected no blank facet_value rows, but got: {rows}" + ) + # Real context value should appear + real_rows = con.sql( + f"SELECT COUNT(*) FROM read_parquet('{out}') WHERE facet_type='context' AND facet_value != ''" + ).fetchone()[0] + assert real_rows >= 1, "Expected at least one non-blank context facet row" + + +def test_empty_string_facet_values_filtered_from_cross_filter(tmp_path): + """build_facet_cross_filter must not produce rows with blank facet_value.""" + import duckdb as _duckdb + sys.path.insert(0, os.path.join(REPO, "scripts")) + import build_frontend_derived as B + + con = _duckdb.connect() + con.execute("INSTALL h3 FROM community; LOAD h3; INSTALL spatial; LOAD spatial;") + con.execute(""" + CREATE OR REPLACE TEMP TABLE samp_geo AS + SELECT 'pid1' AS pid, 'GEOME' AS source, + 'https://w3id.org/isample/vocabulary/material/1.0/rock' AS material, + '' AS context, + 'https://w3id.org/isample/vocabulary/materialsampleobjecttype/1.0/artifact' AS object_type, + 'label1' AS label, 'desc1' AS description, + NULL::VARCHAR AS place_name, NULL::TIMESTAMP AS result_time, + 10.0::DOUBLE AS latitude, 45.0::DOUBLE AS longitude, + 1::UBIGINT AS h3_res4, 2::UBIGINT AS h3_res6, 3::UBIGINT AS h3_res8 + """) + + out = str(tmp_path / "facet_cross_filter.parquet") + B.build_facet_cross_filter(con, out) + + blank_rows = con.sql(f"SELECT * FROM read_parquet('{out}') WHERE facet_value = ''").fetchall() + assert blank_rows == [], ( + f"Expected no blank facet_value in cross_filter, got: {blank_rows}" + ) + blank_filter_rows = con.sql( + f"SELECT * FROM read_parquet('{out}') WHERE filter_context = ''" + ).fetchall() + assert blank_filter_rows == [], ( + f"Expected no blank filter_context in cross_filter, got: {blank_filter_rows}" + ) + + +# ============================================================================ +# Fix #283b — specimentype/1.0 vocab labels +# ============================================================================ + +SPEC_URI_SOLID = "https://w3id.org/isample/vocabulary/specimentype/1.0/othersolidobject" +SPEC_URI_PHYS = "https://w3id.org/isample/vocabulary/specimentype/1.0/physicalspecimen" + +# Optional fast-path: if ISAMPLES_VOCAB_LABELS points at an already-built +# vocab_labels.parquet, reuse it; otherwise (CI / fresh checkout) we rebuild +# it on the fly. No machine-specific default — avoids leaking a local path. +VOCAB_LABELS_PATH = os.environ.get("ISAMPLES_VOCAB_LABELS", "") + + +def _get_vocab_labels_parquet(): + """Return a path to vocab_labels.parquet, building it if needed.""" + if VOCAB_LABELS_PATH and os.path.exists(VOCAB_LABELS_PATH): + return VOCAB_LABELS_PATH + # Build into a temp file for CI / offline environments. + BUILD_VL = os.path.join(REPO, "scripts", "build_vocab_labels.py") + import tempfile + tmp = tempfile.mktemp(suffix=".parquet") + result = subprocess.run( + [sys.executable, BUILD_VL, "-o", tmp], + capture_output=True, text=True + ) + if result.returncode != 0: + pytest.skip(f"build_vocab_labels.py failed (network?): {result.stderr[:200]}") + return tmp + + +def test_specimentype_othersolidobject_in_vocab_labels(): + """specimentype/1.0/othersolidobject must be present with label 'Other solid object'.""" + vl = _get_vocab_labels_parquet() + con = duckdb.connect() + row = con.sql( + f"SELECT pref_label FROM read_parquet('{vl}') WHERE uri='{SPEC_URI_SOLID}'" + ).fetchone() + con.close() + assert row is not None, f"{SPEC_URI_SOLID!r} not found in vocab_labels" + assert row[0] == "Other solid object", f"Expected 'Other solid object', got {row[0]!r}" + + +def test_specimentype_physicalspecimen_in_vocab_labels(): + """specimentype/1.0/physicalspecimen must be present with label 'Material sample'.""" + vl = _get_vocab_labels_parquet() + con = duckdb.connect() + row = con.sql( + f"SELECT pref_label FROM read_parquet('{vl}') WHERE uri='{SPEC_URI_PHYS}'" + ).fetchone() + con.close() + assert row is not None, f"{SPEC_URI_PHYS!r} not found in vocab_labels" + assert row[0] == "Material sample", f"Expected 'Material sample', got {row[0]!r}" + + +def test_specimentype_labels_have_lang_en(): + """Both specimentype manual overrides must have lang='en'.""" + vl = _get_vocab_labels_parquet() + con = duckdb.connect() + rows = con.sql( + f"SELECT uri, lang FROM read_parquet('{vl}') WHERE uri LIKE '%specimentype%'" + ).fetchall() + con.close() + assert len(rows) == 2, f"Expected 2 specimentype rows, got {len(rows)}: {rows}" + for uri, lang in rows: + assert lang == "en", f"Expected lang='en' for {uri!r}, got {lang!r}" + + +# ============================================================================ +# Blocker 1 — cross-source orphan protection (Nit C + Nit D) +# ============================================================================ + +def test_cross_source_shared_entity_not_orphaned(tmp_path): + """SE and SamplingSite shared between a removed OC MSR and a surviving SESAR MSR + must NOT be deleted as orphans. + + Scenario: + src: + - OC MSR pid='OC_remove_me' → SE row_id=100 → SamplingSite row_id=200 + - SE row_id=100 + - SamplingSite row_id=200 + - SESAR MSR pid='SESAR_keep_me' ALSO → SE row_id=100 + SamplingSite row_id=200 + Eric's OC wide: + - pid='NEW_OC_pid' (new OC record, NOT 'OC_remove_me' → it's gone) + + After sync: + - 'OC_remove_me' is removed (not in Eric's wide) + - SE row_id=100 is STILL referenced by 'SESAR_keep_me' → must survive + - SamplingSite row_id=200 is STILL referenced by 'SESAR_keep_me' → must survive + + Old code (surviving_se_refs filtered to n='OPENCONTEXT') would incorrectly + mark SE 100 and Site 200 as orphans and delete them, breaking the SESAR MSR. + New code (all-source surviving refs) must keep them. + """ + src = str(tmp_path / "src_b1.parquet") + oc = str(tmp_path / "oc_b1.parquet") + out = str(tmp_path / "out_b1.parquet") + + # row_ids in src space + SE_ROW_ID = 100 + SITE_ROW_ID = 200 + GEO_ROW_ID = 300 + + # Build src wide: OC MSR to-be-removed + SESAR MSR sharing SE+Site + build_src_wide( + src, + msr_rows=[ + # OC MSR that will be removed (not in Eric's wide) + { + "row_id": 1000, "pid": "OC_remove_me", "n": "OPENCONTEXT", + "p__produced_by": [SE_ROW_ID], + "p__has_material_category": [SRC_ROCK_CONCEPT_ID], + "latitude": 45.0, "longitude": 10.0, + }, + # SESAR MSR that ALSO references SE 100 + Site 200 (shared!) + { + "row_id": 1001, "pid": "SESAR_keep_me", "n": "SESAR", + "p__produced_by": [SE_ROW_ID], + "p__has_material_category": [SRC_ROCK_CONCEPT_ID], + "latitude": 45.1, "longitude": 10.1, + }, + ], + concept_rows=SRC_CONCEPT_ROWS, + se_rows=[ + # SE shared between OC + SESAR MSRs + (SE_ROW_ID, "se-shared", [GEO_ROW_ID], [SITE_ROW_ID]), + ], + geo_rows=[ + (GEO_ROW_ID, "geo-shared", 45.0, 10.0), + ], + site_rows=[ + (SITE_ROW_ID, "site-shared", [GEO_ROW_ID]), + ], + ) + + # Build Eric's OC wide: NEW_OC_pid (new), NOT OC_remove_me → it becomes stale + # Use a simple SE + geo for the new OC record + build_oc_wide( + oc, + msr_rows=[ + { + "row_id": 1, "pid": "NEW_OC_pid", + "p__produced_by": [501], + "p__has_material_category": [OC_ROCK_CONCEPT_ID], + }, + ], + concept_rows=OC_CONCEPT_ROWS, + se_rows=[(501, "se-new", [601], None)], + geo_rows=[(601, "geo-new", 46.0, 11.0)], + ) + + r = run_ingest(src, oc, out) + assert r.returncode == 0, f"ingest failed:\nSTDERR: {r.stderr}\nSTDOUT: {r.stdout}" + + con = duckdb.connect() + + # SESAR MSR must survive + sesar_row = con.sql( + f"SELECT COUNT(*) FROM read_parquet('{out}') WHERE pid='SESAR_keep_me' AND otype='MaterialSampleRecord'" + ).fetchone()[0] + assert sesar_row == 1, "SESAR_keep_me MSR should survive — it was not an OC record" + + # OC MSR must be gone + oc_row = con.sql( + f"SELECT COUNT(*) FROM read_parquet('{out}') WHERE pid='OC_remove_me' AND otype='MaterialSampleRecord'" + ).fetchone()[0] + assert oc_row == 0, "OC_remove_me MSR should have been removed" + + # Shared SE must survive (still referenced by SESAR_keep_me) + se_count = con.sql( + f"SELECT COUNT(*) FROM read_parquet('{out}') WHERE pid='se-shared' AND otype='SamplingEvent'" + ).fetchone()[0] + assert se_count == 1, ( + "se-shared (SE row_id=100) must NOT be orphaned — still referenced by SESAR_keep_me" + ) + + # Shared SamplingSite must survive + site_count = con.sql( + f"SELECT COUNT(*) FROM read_parquet('{out}') WHERE pid='site-shared' AND otype='SamplingSite'" + ).fetchone()[0] + assert site_count == 1, ( + "site-shared (SamplingSite row_id=200) must NOT be orphaned — still referenced by SESAR_keep_me's SE" + ) + + # Shared Geo must survive + geo_count = con.sql( + f"SELECT COUNT(*) FROM read_parquet('{out}') WHERE pid='geo-shared' AND otype='GeospatialCoordLocation'" + ).fetchone()[0] + assert geo_count == 1, ( + "geo-shared (Geo row_id=300) must NOT be orphaned — still referenced by the shared SE" + ) + + # SESAR MSR's p__produced_by must still resolve to a valid SE + pb = con.sql(f""" + SELECT p__produced_by[1] FROM read_parquet('{out}') + WHERE pid='SESAR_keep_me' AND otype='MaterialSampleRecord' + """).fetchone()[0] + if pb is not None: + se_exists = con.sql(f""" + SELECT COUNT(*) FROM read_parquet('{out}') + WHERE row_id = {pb} AND otype='SamplingEvent' + """).fetchone()[0] + assert se_exists == 1, f"SESAR_keep_me p__produced_by row_id={pb} not found in output (dangling ref!)" + + con.close() + + +# ============================================================================ +# Fix #272 Phase 5 — surviving SamplingSite's p__site_location Geo not orphaned +# ============================================================================ + +def test_site_location_geo_not_orphaned(tmp_path): + """A GeospatialCoordLocation referenced by a surviving SamplingSite via + p__site_location must NOT be deleted as an orphan even if it is also + referenced by an orphan SamplingEvent via p__sample_location. + + Scenario: + src: + - OC MSR pid='OC_removed', n='OPENCONTEXT' + → p__produced_by=[10] (SE row_id=10) + - SE row_id=10, p__sample_location=[20], p__sampling_site=[30] + - Geo row_id=20 + - SamplingSite row_id=30, p__site_location=[20] ← same Geo! + - SESAR MSR pid='SESAR_kept', n='SESAR' + → p__produced_by=[40] (SE row_id=40) + - SE row_id=40, p__sampling_site=[30] ← references surviving Site + + Eric's OC wide: does NOT contain 'OC_removed' → it becomes stale + - Contains 'NEW_OC_pid' (new) + + After sync: + - 'OC_removed' MSR is removed + - SE row_id=10 is orphaned (only referenced by removed OC MSR's p__produced_by) + - SamplingSite row_id=30 SURVIVES (referenced by SESAR SE row_id=40) + - GeospatialCoordLocation row_id=20 SURVIVES (referenced by surviving Site + row_id=30 via p__site_location) — this is the critical assertion + - Zero dangling p__site_location refs in output + + OLD CODE (BUG): surviving_geo_refs only checked p__sample_location on SEs. + Since SE row_id=10 is orphaned, Geo row_id=20 appeared to have NO surviving + SE references → incorrectly deleted → dangling p__site_location on Site row_id=30. + + NEW CODE (FIX): surviving_geo_refs also checks p__site_location on non-orphan + SamplingSites → Geo row_id=20 is retained. + """ + src = str(tmp_path / "src_sl.parquet") + oc = str(tmp_path / "oc_sl.parquet") + out = str(tmp_path / "out_sl.parquet") + + GEO_ROW_ID = 20 + SE_OC_ROW_ID = 10 + SITE_ROW_ID = 30 + SE_SESAR_ROW_ID = 40 + + build_src_wide( + src, + msr_rows=[ + # OC MSR that will be removed (not in Eric's wide) + { + "row_id": 1000, "pid": "OC_removed", "n": "OPENCONTEXT", + "p__produced_by": [SE_OC_ROW_ID], + "p__has_material_category": [SRC_ROCK_CONCEPT_ID], + "latitude": 45.0, "longitude": 10.0, + }, + # SESAR MSR whose SE references the surviving SamplingSite + { + "row_id": 1001, "pid": "SESAR_kept", "n": "SESAR", + "p__produced_by": [SE_SESAR_ROW_ID], + "p__has_material_category": [SRC_ROCK_CONCEPT_ID], + "latitude": 45.1, "longitude": 10.1, + }, + ], + concept_rows=SRC_CONCEPT_ROWS, + se_rows=[ + # OC's SE: sample_location → Geo 20, sampling_site → Site 30 + (SE_OC_ROW_ID, "se-oc", [GEO_ROW_ID], [SITE_ROW_ID]), + # SESAR's SE: sampling_site → Site 30 (keeps Site alive) + (SE_SESAR_ROW_ID, "se-sesar", None, [SITE_ROW_ID]), + ], + geo_rows=[ + (GEO_ROW_ID, "geo-shared", 45.0, 10.0), + ], + site_rows=[ + # SamplingSite references Geo via p__site_location + (SITE_ROW_ID, "site-shared", [GEO_ROW_ID]), + ], + ) + + # Eric's OC wide: NEW_OC_pid only (OC_removed is absent → stale) + build_oc_wide( + oc, + msr_rows=[ + { + "row_id": 1, "pid": "NEW_OC_pid", + "p__produced_by": [501], + "p__has_material_category": [OC_ROCK_CONCEPT_ID], + }, + ], + concept_rows=OC_CONCEPT_ROWS, + se_rows=[(501, "se-new-oc", [601], None)], + geo_rows=[(601, "geo-new-oc", 46.0, 11.0)], + ) + + r = run_ingest(src, oc, out) + assert r.returncode == 0, f"ingest failed:\nSTDERR: {r.stderr}\nSTDOUT: {r.stdout}" + + con = duckdb.connect() + + # OC_removed must be gone + assert con.sql( + f"SELECT COUNT(*) FROM read_parquet('{out}') WHERE pid='OC_removed'" + ).fetchone()[0] == 0, "OC_removed must be removed" + + # SESAR_kept must survive + assert con.sql( + f"SELECT COUNT(*) FROM read_parquet('{out}') WHERE pid='SESAR_kept' AND otype='MaterialSampleRecord'" + ).fetchone()[0] == 1, "SESAR_kept must survive" + + # SE row_id=10 (OC's SE) must be orphaned + assert con.sql( + f"SELECT COUNT(*) FROM read_parquet('{out}') WHERE pid='se-oc' AND otype='SamplingEvent'" + ).fetchone()[0] == 0, "se-oc (orphan SE) must be removed" + + # SamplingSite row_id=30 must survive (referenced by SESAR SE) + assert con.sql( + f"SELECT COUNT(*) FROM read_parquet('{out}') WHERE pid='site-shared' AND otype='SamplingSite'" + ).fetchone()[0] == 1, "site-shared must survive — still referenced by SESAR's SE" + + # GeospatialCoordLocation row_id=20 MUST survive (this is the critical fix assertion) + assert con.sql( + f"SELECT COUNT(*) FROM read_parquet('{out}') WHERE pid='geo-shared' AND otype='GeospatialCoordLocation'" + ).fetchone()[0] == 1, ( + "geo-shared (Geo row_id=20) must NOT be orphaned — " + "surviving SamplingSite still references it via p__site_location" + ) + + # Zero dangling p__site_location refs in output (the production-scale symptom) + dangling = con.sql(f""" + WITH all_row_ids AS (SELECT row_id FROM read_parquet('{out}')), + site_refs AS ( + SELECT unnest(p__site_location) AS ref_id + FROM read_parquet('{out}') + WHERE otype='SamplingSite' + AND p__site_location IS NOT NULL AND len(p__site_location) > 0 + ) + SELECT COUNT(*) FROM site_refs + LEFT JOIN all_row_ids ON site_refs.ref_id = all_row_ids.row_id + WHERE all_row_ids.row_id IS NULL + """).fetchone()[0] + assert dangling == 0, f"p__site_location dangling refs: {dangling} (expected 0)" + + con.close() + + +# ============================================================================ +# Fix A — Fixpoint orphan removal (R2-A) +# ============================================================================ + +def test_orphan_geo_via_site_only_removed(tmp_path): + """A Geo referenced ONLY via an orphan SamplingSite's p__site_location + (and NOT by any surviving SE's p__sample_location) must be REMOVED. + + This tests Fix A's over-retention correction: the fixpoint algorithm must + not retain a Geo that appears only in an orphan chain with no surviving refs. + + Scenario: + src: + - OC MSR pid='OC_gone' → SE row_id=10 → Site row_id=20 → Geo row_id=30 + - SE row_id=10, p__sample_location=[], p__sampling_site=[20] + (SE has NO direct p__sample_location — only a site reference) + - SamplingSite row_id=20, p__site_location=[30] + - Geo row_id=30 ← referenced ONLY via orphan Site, no surviving refs + - NO other MSR references SE 10, Site 20, or Geo 30 + + Eric's OC wide: does NOT contain 'OC_gone' → stale; has 'NEW_OC_pid' + + After sync: + - 'OC_gone' MSR removed + - SE row_id=10 is orphan (no surviving MSR's p__produced_by points to it) + - SamplingSite row_id=20 is orphan (SE 10 is removed; no other ref) + - Geo row_id=30 is orphan (Site 20 is removed; no other ref) + → ALL THREE must be REMOVED + + OLD CODE BUG (pre-phase-5 path-specific logic): surviving_geo_refs included + Geos referenced by non-orphan SamplingSites — but Phase 5 only checked Site + surviving status by whether the Site appeared in surviving_site_refs, which + depended on a hand-coded SE→Site chain. If the agent traversal missed a path, + a Geo could be incorrectly retained. + + FIXPOINT: correctly computes survivor_refs from all surviving rows; since no + surviving row points to Geo 30, it is removed. + + This test MUST FAIL on old path-specific code and PASS on fixpoint code. + (It passed on Phase 5 code that hand-enumerated the site_location path, + but verifies the fixpoint correctly handles the fully-orphaned chain.) + """ + src = str(tmp_path / "src_chain.parquet") + oc = str(tmp_path / "oc_chain.parquet") + out = str(tmp_path / "out_chain.parquet") + + SE_ROW = 10 + SITE_ROW = 20 + GEO_ROW = 30 + + build_src_wide( + src, + msr_rows=[ + # OC MSR to be removed — references SE only + { + "row_id": 1000, "pid": "OC_gone", "n": "OPENCONTEXT", + "p__produced_by": [SE_ROW], + "p__has_material_category": [SRC_ROCK_CONCEPT_ID], + "latitude": 45.0, "longitude": 10.0, + }, + ], + concept_rows=SRC_CONCEPT_ROWS, + se_rows=[ + # SE: no direct p__sample_location; only p__sampling_site → Site + (SE_ROW, "se-orphan", [], [SITE_ROW]), + ], + geo_rows=[ + (GEO_ROW, "geo-orphan", 45.0, 10.0), + ], + site_rows=[ + # Site: p__site_location → Geo (the only ref to Geo) + (SITE_ROW, "site-orphan", [GEO_ROW]), + ], + ) + + build_oc_wide( + oc, + msr_rows=[ + { + "row_id": 1, "pid": "NEW_OC_pid", + "p__produced_by": [501], + "p__has_material_category": [OC_ROCK_CONCEPT_ID], + }, + ], + concept_rows=OC_CONCEPT_ROWS, + se_rows=[(501, "se-new", [601], None)], + geo_rows=[(601, "geo-new", 46.0, 11.0)], + ) + + r = run_ingest(src, oc, out) + assert r.returncode == 0, f"ingest failed:\nSTDERR: {r.stderr}\nSTDOUT: {r.stdout}" + + con = duckdb.connect() + + # All three orphan entities must be gone + for pid, otype, label in [ + ("se-orphan", "SamplingEvent", "SE row_id=10"), + ("site-orphan", "SamplingSite", "Site row_id=20"), + ("geo-orphan", "GeospatialCoordLocation", "Geo row_id=30"), + ]: + n = con.sql( + f"SELECT COUNT(*) FROM read_parquet('{out}') WHERE pid='{pid}' AND otype='{otype}'" + ).fetchone()[0] + assert n == 0, f"{label} ({pid}) should be REMOVED — no surviving refs; got count={n}" + + # Zero dangling refs in output + for col in ("p__sample_location", "p__sampling_site", "p__site_location"): + dangling = con.sql(f""" + WITH ids AS (SELECT row_id FROM read_parquet('{out}')), + refs AS (SELECT unnest({col}) AS ref_id FROM read_parquet('{out}') + WHERE {col} IS NOT NULL AND len({col}) > 0) + SELECT COUNT(*) FROM refs LEFT JOIN ids ON refs.ref_id = ids.row_id + WHERE ids.row_id IS NULL + """).fetchone()[0] + assert dangling == 0, f"{col}: {dangling} dangling refs (expected 0)" + + con.close() + + +def test_unresolved_new_ref_hard_fails(tmp_path): + """A new OC SE with p__sampling_site pointing to a SamplingSite absent from + Eric's OC wide must cause the ingest to RAISE (non-zero exit), NOT silently + emit NULL for that reference. + + This tests Fix B (silent-drop guard): after remapping, if the remapped array + length != source array length, the build must hard-fail. + + Scenario: + Eric's OC wide: + - MSR pid='new_pid' → SE row_id=201, p__produced_by=[201] + - SE row_id=201, p__sampling_site=[999] ← Site row_id=999 + - NO SamplingSite row_id=999 exists in Eric's wide + - Geo row_id=301 exists (SE's p__sample_location=[301]) + + Expected: ingest raises RuntimeError / exits non-zero. + The output file must NOT be written. + """ + src = str(tmp_path / "src_miss.parquet") + oc = str(tmp_path / "oc_miss.parquet") + out = str(tmp_path / "out_miss.parquet") + + # Minimal src: just concepts + a stale OC MSR (so there's something to remove) + build_src_wide( + src, + msr_rows=[ + { + "row_id": 1000, "pid": "pid_stale", "n": "OPENCONTEXT", + "p__produced_by": [100], + "p__has_material_category": [SRC_ROCK_CONCEPT_ID], + "latitude": 40.0, "longitude": 5.0, + }, + ], + concept_rows=SRC_CONCEPT_ROWS, + se_rows=[(100, "se-stale", [110], None)], + geo_rows=[(110, "geo-stale", 40.0, 5.0)], + ) + + # Eric's OC wide: new MSR whose SE has p__sampling_site=[999] but Site 999 is absent + build_oc_wide( + oc, + msr_rows=[ + { + "row_id": 1, "pid": "new_pid", + "p__produced_by": [201], + "p__has_material_category": [OC_ROCK_CONCEPT_ID], + }, + ], + concept_rows=OC_CONCEPT_ROWS, + # SE references Site 999 via p__sampling_site — but Site 999 is not in the wide + se_rows=[(201, "se-new-missing-site", [301], [999])], + geo_rows=[(301, "geo-new", 41.0, 6.0)], + # site_rows intentionally omitted — no Site 999 + ) + + r = run_ingest(src, oc, out) + combined = r.stdout + r.stderr + assert r.returncode != 0, ( + f"Expected ingest to FAIL (non-zero exit) when p__sampling_site ref is unresolvable, " + f"but it exited 0.\nSTDOUT:\n{r.stdout}\nSTDERR:\n{r.stderr}" + ) + assert "SILENT-DROP" in combined or "GUARD" in combined or "FATAL" in combined or "mismatch" in combined.lower(), ( + f"Expected a silent-drop guard / FATAL error message; got:\n{combined[:500]}" + ) + assert not os.path.exists(out), ( + "Output file must NOT be written when the silent-drop guard fires" + ) + + +# ============================================================================ +# Fix 1 (Round 7) — p__keywords concepts fully extracted and preserved +# ============================================================================ + +# Keyword concept IDs in Eric's space (INTEGER) +OC_KW1_CONCEPT_ID = 950 # keyword concept already in src +OC_KW2_CONCEPT_ID = 951 # keyword concept NOT in src — must be minted + +KW1_URI = "https://w3id.org/isample/keyword/1.0/existing_keyword" +KW2_URI = "https://w3id.org/isample/keyword/1.0/new_keyword" # absent from src → minted + +# src concept ID for the existing keyword (BIGINT) +SRC_KW1_CONCEPT_ID = 10 + + +def test_new_msr_keywords_preserved(tmp_path): + """New OC MSR with p__keywords pointing to concept(s) — at least one NOT in src + (forcing a mint) — must have all keyword refs preserved in output: + - output p__keywords array length == source array length + - all targets resolve to IdentifiedConcept rows in output + - zero dangling keyword refs in output + + This test MUST FAIL on old HEAD (where keywords were silently dropped because + keyword concept URIs were not collected in new_concept_refs and thus not minted, + causing the remap_msr_kw inner join to produce no matches → remap length 0 vs + source length 2 → silent-drop guard fires → FATAL). + + After FIX 1: keywords are included in new_concept_refs; missing keyword concepts + are minted; the full-length remap is verified by the silent-drop guard; all refs + are valid in output. + """ + src = str(tmp_path / "src_kw.parquet") + oc = str(tmp_path / "oc_kw.parquet") + out = str(tmp_path / "out_kw.parquet") + + # ---- src wide: has existing keyword concept (KW1), NOT KW2 ---- + src_concepts = list(SRC_CONCEPT_ROWS) + [(SRC_KW1_CONCEPT_ID, KW1_URI)] + + build_src_wide( + src, + msr_rows=[ + # Stale OC MSR (to ensure removal path is exercised) + { + "row_id": 1000, "pid": "pid_stale", "n": "OPENCONTEXT", + "p__produced_by": [103], + "p__has_material_category": [SRC_ROCK_CONCEPT_ID], + "latitude": 60.0, "longitude": 20.0, + }, + ], + concept_rows=src_concepts, + se_rows=[(103, "se-stale", [203], None)], + geo_rows=[(203, "geo-stale", 60.0, 20.0)], + ) + + # ---- OC wide: new MSR with p__keywords=[OC_KW1_CONCEPT_ID, OC_KW2_CONCEPT_ID] + # KW1 is already in src (must be looked up by URI, not minted) + # KW2 is NOT in src (must be minted as a new IdentifiedConcept row) + oc_concepts = list(OC_CONCEPT_ROWS) + [ + (OC_KW1_CONCEPT_ID, KW1_URI, "Existing Keyword"), + (OC_KW2_CONCEPT_ID, KW2_URI, "New Keyword"), + ] + + build_oc_wide( + oc, + msr_rows=[ + { + "row_id": 1, "pid": "pid_kw", + "p__produced_by": [101], + "p__has_material_category": [OC_ROCK_CONCEPT_ID], + "p__keywords": [OC_KW1_CONCEPT_ID, OC_KW2_CONCEPT_ID], + }, + ], + concept_rows=oc_concepts, + se_rows=[(101, "se-kw", [201], None)], + geo_rows=[(201, "geo-kw", 45.0, 10.0)], + ) + + r = run_ingest(src, oc, out) + assert r.returncode == 0, ( + f"ingest failed:\nSTDERR: {r.stderr}\nSTDOUT: {r.stdout}" + ) + + con = duckdb.connect() + + # 1. New OC MSR must be present with n='OPENCONTEXT' + kw_msr = con.sql( + f"SELECT p__keywords FROM read_parquet('{out}') " + f"WHERE pid='pid_kw' AND otype='MaterialSampleRecord'" + ).fetchone() + assert kw_msr is not None, "pid_kw MSR missing from output" + kw_refs = kw_msr[0] + assert kw_refs is not None, "p__keywords is NULL in output (should be a 2-element array)" + assert len(kw_refs) == 2, ( + f"p__keywords length mismatch: expected 2, got {len(kw_refs)}. " + f"Array: {kw_refs}" + ) + + # 2. Both keyword targets must resolve to IdentifiedConcept rows in output + for ref_id in kw_refs: + resolved = con.sql( + f"SELECT pid, otype FROM read_parquet('{out}') WHERE row_id = {ref_id}" + ).fetchone() + assert resolved is not None, ( + f"Keyword ref row_id={ref_id} not found in output (dangling ref!)" + ) + assert resolved[1] == "IdentifiedConcept", ( + f"Keyword ref row_id={ref_id} resolves to otype={resolved[1]!r}, " + f"expected 'IdentifiedConcept'" + ) + + # 3. Verify both URIs are resolvable in output IdentifiedConcept rows + kw1_out = con.sql( + f"SELECT COUNT(*) FROM read_parquet('{out}') " + f"WHERE otype='IdentifiedConcept' AND pid='{KW1_URI}'" + ).fetchone()[0] + assert kw1_out == 1, f"KW1 concept ({KW1_URI}) missing from output" + + kw2_out = con.sql( + f"SELECT COUNT(*) FROM read_parquet('{out}') " + f"WHERE otype='IdentifiedConcept' AND pid='{KW2_URI}'" + ).fetchone()[0] + assert kw2_out == 1, f"KW2 concept ({KW2_URI}) not minted in output (should have been minted)" + + # 4. Zero dangling p__keywords refs in output + dangling = con.sql(f""" + WITH all_row_ids AS (SELECT row_id FROM read_parquet('{out}')), + kw_refs AS ( + SELECT unnest(p__keywords) AS ref_id + FROM read_parquet('{out}') + WHERE p__keywords IS NOT NULL AND len(p__keywords) > 0 + ) + SELECT COUNT(*) FROM kw_refs + LEFT JOIN all_row_ids ON kw_refs.ref_id = all_row_ids.row_id + WHERE all_row_ids.row_id IS NULL + """).fetchone()[0] + assert dangling == 0, f"p__keywords: {dangling} dangling refs in output (expected 0)" + + con.close()