Skip to content

pipeline: OC sync (#272) + #277/#283 fixes → isamples_202608 [DRAFT, no publish]#10

Merged
rdhyee merged 8 commits into
mainfrom
pipeline/oc-sync-272-202608
Jun 13, 2026
Merged

pipeline: OC sync (#272) + #277/#283 fixes → isamples_202608 [DRAFT, no publish]#10
rdhyee merged 8 commits into
mainfrom
pipeline/oc-sync-272-202608

Conversation

@rdhyee

@rdhyee rdhyee commented Jun 13, 2026

Copy link
Copy Markdown
Owner

DRAFT — held for RY sign-off. Code/pipeline only; produces the isamples_202608 data but does NOT publish it. R2 upload + explorer.qmd cutover remain a separate, human-gated step.

What this is

Data-pipeline code for the isamplesorg#272 next phase: a true sync of OpenContext records against @ekansa's current export (+67,187 new / −21,227 stale Murlo re-IDs, per his Option B), plus three diagnosed bug fixes (isamplesorg#277 descriptions, #283a/b facets). Reproducible recipe; the data itself is staged at ~/Data/iSample/pqg_refining/staged_202608/, not in the repo.

Review history — converged after 4 Codex rounds + independent verification

Round Found Resolution
Codex R1 3 blockers: cross-source orphan filter, incomplete new-row ref remap, non-deterministic output fixed
Independent sweep 4,606 dangling p__site_location (orphan-Geo only protected the SE path) fixed + in-script dangling gate added
Codex R2 over-retention (path-specific orphan logic) + silent ref-drop via inner-join NULL fixed via fixpoint orphan rule + silent-drop guard
Codex R3 verified fixpoint/determinism/gate correct; 1 blocker: 252,978 keyword refs dropped on new records fixed: keywords extracted/minted like other concept dims
Codex R4 none MERGE

Verified numbers (independently re-checked against the staged build)

Check Result
OC MaterialSampleRecord count 1,110,791 (= 1,064,831 − 21,227 + 67,187, matches @ekansa's wide)
material/1.0/rock 37,953
Dangling refs, all 12 reference columns 0
New-record keyword refs preserved 252,978 (1,046 concepts minted)
Cyprus-in-description (isamplesorg#277) 69,230 (was 0)
Blank facet entries (#283a) 0
Concept double-minting / duplicate row_ids 0 / 0
Validator 26/26 · pytest 33 passed

Hardening that outlasts this PR

  • In-script dangling-ref gate over all 12 reference columns — a build with any dangling ref cannot be emitted.
  • Fixpoint orphan removal — keeps any entity referenced by any surviving row through any column; no path enumeration to miss.
  • Silent-drop guard — an unresolved new-row reference hard-fails instead of vanishing into a NULL.

Scope / not in this PR

🤖 Generated with Claude Code

rdhyee and others added 8 commits June 12, 2026 08:01
…rg#272 phase 2)

Gap analysis against Eric's 2026-06-09 OC wide + local 202604 wide:
- 67,187 new MSRs, 152,311 total new entity rows
- All new records have coords via MSR->SE->GeoCoordLoc graph path (100%)
- 1 new concept to mint against 202606 base: earthsurface
- Row-id strategy: dense rank from max(src)+1 (20,729,359+)
- Key bug found+fixed: geometry type mismatch (BLOB vs GEOMETRY) needs ST_AsWKB()

Deliverables:
- DESIGN_272_INGEST.md: gap characterization, schema mapping, pipeline plan,
  trust-gate invariants, 8 open decisions for RY, honest-gaps section
- scripts/ingest_oc_records.py: spike implementation (dry-run verified +
  full write verified: all post-write trust gates pass, Stage 4 ALL CHECKS PASS)
- SPIKE_RESULTS.md: real numbers from all executed queries

No push, no R2 publish, no PR. Spike only — for RY review.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ale (isamplesorg#272 phase 2)

D3 decision (RY 2026-06-12): remove stale OC pids rather than keep them.
OpenContext mass-updated Murlo project PIDs; old PIDs duplicate same physical samples.

What changed:
- scripts/ingest_oc_records.py: production sync logic
    - Removes 21,227 stale OC MSR pids + 43,382 orphan subgraph entities (SE/Geo/Site)
    - Adds 67,187 new OC pids + full entity subgraph (152,311 rows)
    - Mints 1 new IdentifiedConcept (sampledfeature/1.0/earthsurface)
    - Output: 20,817,062 rows, OC MSR = 1,110,791 (matches Eric exactly)
    - All 25 validate_frontend_derived.py checks pass
    - Base: isamples_202606_wide.parquet; tag: isamples_202608
- tests/test_ingest_oc_records.py: 20 fixture tests (all pass)
    - removal, orphan cleanup, subgraph ingestion, remapping, trust-gate invariants
- DESIGN_272_INGEST.md: all decisions resolved; orphan analysis results added
- SYNC_RESULTS.md: actual run numbers and verification query outputs
- Makefile: add ingest-272 and all-202608 targets

Rock count after sync: 37,953 OC rocks (30,272 - 85 removed + 7,766 new).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s into 202608 rebuild

Phase 3 bug fixes applied before staging 202608:

isamplesorg#277 (OC description enrichment):
- Added Phase J to scripts/ingest_oc_records.py: after sync write, UNION ALL
  approach overwrites `description` for all OC MSR pids from Eric's OC wide.
  Uses SELECT * REPLACE for schema-agnostic column handling. Non-OC rows and
  OC rows with NULL description in Eric's wide are unchanged.
- Trust gate: Cyprus OC MSR count = 69,230 (was 0 before fix).
- Applied to existing 202608 wide; derived files rebuilt.

#283a (empty-string facet filter):
- In scripts/build_frontend_derived.py: changed IS NOT NULL filters to
  IS NOT NULL AND {d} <> '' in build_facet_summaries and build_facet_cross_filter.
  Removes the blank selectable facet entry from 586 GEOME records with empty-string
  concept URI.
- Updated scripts/validate_frontend_derived.py: algebraic recompute now matches
  the builder's empty-string filter; added new check 5b
  "facet_summaries no blank values (#283a)".
- Trust gate: blank facet_value count = 0.

#283b (deprecated specimentype/1.0 labels):
- In scripts/build_vocab_labels.py: added MANUAL_LABEL_OVERRIDES list with two
  entries for deprecated SESAR URIs absent from live TTLs:
    specimentype/1.0/othersolidobject -> "Other solid object" (en)
    specimentype/1.0/physicalspecimen -> "Material sample"   (en)
  Rebuilt vocab_labels.parquet (539 rows, up from 537).
- Trust gate: both specimentype URIs present in vocab_labels with correct labels.

Fixture tests (tests/test_ingest_oc_records.py):
- 8 new tests added (28 total, all pass):
  Fix isamplesorg#277: test_oc_description_enriched_from_eric_wide,
            test_non_oc_description_unchanged_by_enrichment,
            test_oc_msr_count_unchanged_by_enrichment
  Fix #283a: test_empty_string_facet_values_filtered_from_summaries,
             test_empty_string_facet_values_filtered_from_cross_filter
  Fix #283b: test_specimentype_othersolidobject_in_vocab_labels,
             test_specimentype_physicalspecimen_in_vocab_labels,
             test_specimentype_labels_have_lang_en

Validator: 26/26 checks pass (was 25/25).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reviewer hygiene: ISAMPLES_VOCAB_LABELS env var instead of a machine-
specific /tmp default; falls back to building vocab_labels on the fly.
28/28 tests pass.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… ref extraction/trust gate, deterministic output

B1 (cross-source orphan protection): surviving_se_refs previously filtered to
n='OPENCONTEXT', allowing SEs/Sites/Geos shared with SESAR/GEOME MSRs to be
wrongly deleted. Fixed by removing the source filter — all surviving MSRs
(any source) now protect shared subgraph entities.

B2A (incomplete agent extraction): Agents in p__responsibility were remapped
but their rows were never extracted (only p__registrant was queried). Fixed
by UNIONing p__responsibility into agent_ids.

B2B (trust gate extension): pre-write gate only checked p__produced_by and 3
concept dims. Added comprehensive post-write HARD FAIL gate checking all 10
p__* array columns (BIGINT[] + INTEGER[]) on new rows.

B3 (deterministic Phase J): Phase J UNION ALL lacked ORDER BY row_id, making
sha256 non-reproducible. Added ORDER BY row_id before COPY TO PARQUET.

Nit A: whitespace-only facet filter (NULLIF(TRIM(d),'') IS NOT NULL) in
build_frontend_derived.py + validator check 5b uses TRIM().

Nit B: Cyprus enrichment is now a hard RuntimeError at production scale
(out_oc_count > 1M), not just a log statement.

Nit C: regression test test_cross_source_shared_entity_not_orphaned confirms
B1 FAILS on old code and PASSES on fixed code.

Rebuild: 29/29 tests pass. All verification queries pass (b-k). Validator
26/26. Zero dangling refs across all 10 p__* columns.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…te (isamplesorg#272)

- Replace per-path orphan logic with general formulation: compute orphan
  SamplingSites BEFORE orphan Geos so that surviving_geo_refs includes Geos
  referenced by non-orphan SamplingSites via p__site_location (UNION with
  Geos from surviving SEs via p__sample_location).
- Fixes 4,606 dangling p__site_location refs introduced by Murlo sync:
  those Geos were wrongly deleted because orphan determination only checked
  p__sample_location on SEs, missing the p__site_location path from Sites.
- Replace new-rows-only B2B gate with mandatory ALL-rows ALL-columns gate
  before manifest emission: scans every p__* (BIGINT[] + INTEGER[]) column
  across entire output; RuntimeError + build abort if any dangling ref.
- New orphan counts: geo=16,621 (was 21,227), total removed=60,003 (was 64,609).
  4,606 Geos correctly retained; output rows=20,821,668 (was 20,817,062).
- Regression test test_site_location_geo_not_orphaned: FAILS on old code
  (Geo deleted), PASSES on fixed code (Geo retained).
- All 30 pytest tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nd-6)

Fix A: replace hand-enumerated path-specific orphan tables with a true
fixpoint algorithm. remove_set := stale MSR row_ids; iterate until stable:
compute survivor_refs = all row_ids referenced by any surviving row through
ALL 12 p__* columns (enumerated from schema, not hand-picked); add to
remove_set any non-MSR, non-IdentifiedConcept candidate row not in
survivor_refs. Converges in 4 passes on real data. Correctly removes 4
Agent orphans that path-specific logic missed (over-retention = 0).
Total rows removed: 60,007 (Phase 5 was 60,003).

Fix B: silent-drop guard. For every p__* structural/concept column on new
rows, assert remapped array length == source array length after inner-join
remapping. Any mismatch → RuntimeError, build aborted. p__keywords excluded
(best-effort URI lookup, documented). Correctly catches SE.p__sampling_site
pointing to absent site (new regression test).

Regression tests (both FAIL on old code, PASS on fixed):
- test_orphan_geo_via_site_only_removed: Geo only via orphan Site's
  p__site_location (no surviving refs) must be REMOVED, not retained
- test_unresolved_new_ref_hard_fails: SE p__sampling_site=[999], no
  Site 999 in Eric's wide → RAISES, does NOT silently emit NULL

All verification checks:
- 0 dangling refs across all 12 p__* columns (in-script gate + independent)
- OC MSR: 1,110,791 | Rock: 37,953 | Cyprus: 69,230 | Blank facets: 0
- Total rows: 20,821,664 | Validator: 26/26 PASS
- 32/32 fixture tests pass

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
 round-7)

FIX 1 (BLOCKER) — p__keywords now fully extracted and preserved:
  - Extend new_concept_refs to include p__keywords alongside p__has_* dims
  - All keyword concept URIs from new OC MSRs are now minted if absent from src
  - Remove p__keywords carve-out from silent-drop guard; strict length check enforced
  - Remove misleading "WHAT IT DOES NOT DO" note about keyword concepts
  - Result: 252,978 keyword refs preserved across 67,183 new OC MSRs
  - 1,046 concepts minted total (1 earthsurface + 1,045 new keyword concepts)
  - 0 dangling p__keywords refs in output (all 12 columns clean)

FIX 2 (NIT) — our_row_id uniqueness assertion on eric_id_map:
  - Hard-fail if DENSE_RANK produces duplicate our_row_ids (duplicate otype,pid pairs)
  - Cheap insurance against silent collision before write

REGRESSION TEST (fail-old/pass-new):
  - test_new_msr_keywords_preserved: new OC MSR with 2 keyword refs (1 in src, 1
    not in src → minted); verifies array length == 2, both targets resolve to
    IdentifiedConcept rows, 0 dangling p__keywords refs; update build_oc_wide to
    accept p__keywords in MSR rows

Build verified (round-7, /tmp/ingest_202608_r7/):
  - total rows: 20,822,709 (+1,045 vs round-6's 20,821,664 — exactly the new concepts)
  - OC MSR: 1,110,791 | Cyprus: 69,230 | blank facets: 0
  - removed/stale pids in output: 0
  - dangling refs all 12 p__* cols: 0
  - pytest: 33/33 PASS | validator: 26/26 PASS

No push per standing rules (RY independently re-verifies + Codex round 4).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@rdhyee rdhyee marked this pull request as ready for review June 13, 2026 15:34
@rdhyee rdhyee merged commit 0ca84d0 into main Jun 13, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant