pipeline: OC sync (#272) + #277/#283 fixes → isamples_202608 [DRAFT, no publish] by rdhyee · Pull Request #10 · rdhyee/isamplesorg.github.io

rdhyee · 2026-06-13T13:36:42Z

DRAFT — held for RY sign-off. Code/pipeline only; produces the isamples_202608 data but does NOT publish it. R2 upload + explorer.qmd cutover remain a separate, human-gated step.

What this is

Data-pipeline code for the isamplesorg#272 next phase: a true sync of OpenContext records against @ekansa's current export (+67,187 new / −21,227 stale Murlo re-IDs, per his Option B), plus three diagnosed bug fixes (isamplesorg#277 descriptions, #283a/b facets). Reproducible recipe; the data itself is staged at ~/Data/iSample/pqg_refining/staged_202608/, not in the repo.

Review history — converged after 4 Codex rounds + independent verification

Round	Found	Resolution
Codex R1	3 blockers: cross-source orphan filter, incomplete new-row ref remap, non-deterministic output	fixed
Independent sweep	4,606 dangling `p__site_location` (orphan-Geo only protected the SE path)	fixed + in-script dangling gate added
Codex R2	over-retention (path-specific orphan logic) + silent ref-drop via inner-join NULL	fixed via fixpoint orphan rule + silent-drop guard
Codex R3	verified fixpoint/determinism/gate correct; 1 blocker: 252,978 keyword refs dropped on new records	fixed: keywords extracted/minted like other concept dims
Codex R4	none	MERGE ✅

Verified numbers (independently re-checked against the staged build)

Check	Result
OC MaterialSampleRecord count	1,110,791 (= 1,064,831 − 21,227 + 67,187, matches @ekansa's wide)
`material/1.0/rock`	37,953
Dangling refs, all 12 reference columns	0
New-record keyword refs preserved	252,978 (1,046 concepts minted)
Cyprus-in-description (isamplesorg#277)	69,230 (was 0)
Blank facet entries (#283a)	0
Concept double-minting / duplicate row_ids	0 / 0
Validator	26/26 · pytest 33 passed

Hardening that outlasts this PR

In-script dangling-ref gate over all 12 reference columns — a build with any dangling ref cannot be emitted.
Fixpoint orphan removal — keeps any entity referenced by any surviving row through any column; no path enumeration to miss.
Silent-drop guard — an unresolved new-row reference hard-fails instead of vanishing into a NULL.

Scope / not in this PR

Does NOT publish. Promotion to R2 + cutover is the separate human-gated step.
Interactive Explorer: Search for "pottery Cyprus" no longer returns results isamplesorg/isamplesorg.github.io#277 half-closed: geography text search now works; literal "pottery Cyprus" still ~0 until concept-labels-in-search ships (awaiting @ekansa).
Suggested path: squash-PR to isamplesorg:main after staging inspection, same flow as Rigorous, reproducible derived-parquet pipeline + AI-free tests (#273) isamplesorg/isamplesorg.github.io#274/OC concept reintegration: overlay Eric's material/object-type mappings onto the wide (#272, fixes #260) isamplesorg/isamplesorg.github.io#275.

🤖 Generated with Claude Code

…rg#272 phase 2) Gap analysis against Eric's 2026-06-09 OC wide + local 202604 wide: - 67,187 new MSRs, 152,311 total new entity rows - All new records have coords via MSR->SE->GeoCoordLoc graph path (100%) - 1 new concept to mint against 202606 base: earthsurface - Row-id strategy: dense rank from max(src)+1 (20,729,359+) - Key bug found+fixed: geometry type mismatch (BLOB vs GEOMETRY) needs ST_AsWKB() Deliverables: - DESIGN_272_INGEST.md: gap characterization, schema mapping, pipeline plan, trust-gate invariants, 8 open decisions for RY, honest-gaps section - scripts/ingest_oc_records.py: spike implementation (dry-run verified + full write verified: all post-write trust gates pass, Stage 4 ALL CHECKS PASS) - SPIKE_RESULTS.md: real numbers from all executed queries No push, no R2 publish, no PR. Spike only — for RY review. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ale (isamplesorg#272 phase 2) D3 decision (RY 2026-06-12): remove stale OC pids rather than keep them. OpenContext mass-updated Murlo project PIDs; old PIDs duplicate same physical samples. What changed: - scripts/ingest_oc_records.py: production sync logic - Removes 21,227 stale OC MSR pids + 43,382 orphan subgraph entities (SE/Geo/Site) - Adds 67,187 new OC pids + full entity subgraph (152,311 rows) - Mints 1 new IdentifiedConcept (sampledfeature/1.0/earthsurface) - Output: 20,817,062 rows, OC MSR = 1,110,791 (matches Eric exactly) - All 25 validate_frontend_derived.py checks pass - Base: isamples_202606_wide.parquet; tag: isamples_202608 - tests/test_ingest_oc_records.py: 20 fixture tests (all pass) - removal, orphan cleanup, subgraph ingestion, remapping, trust-gate invariants - DESIGN_272_INGEST.md: all decisions resolved; orphan analysis results added - SYNC_RESULTS.md: actual run numbers and verification query outputs - Makefile: add ingest-272 and all-202608 targets Rock count after sync: 37,953 OC rocks (30,272 - 85 removed + 7,766 new). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…s into 202608 rebuild Phase 3 bug fixes applied before staging 202608: isamplesorg#277 (OC description enrichment): - Added Phase J to scripts/ingest_oc_records.py: after sync write, UNION ALL approach overwrites `description` for all OC MSR pids from Eric's OC wide. Uses SELECT * REPLACE for schema-agnostic column handling. Non-OC rows and OC rows with NULL description in Eric's wide are unchanged. - Trust gate: Cyprus OC MSR count = 69,230 (was 0 before fix). - Applied to existing 202608 wide; derived files rebuilt. #283a (empty-string facet filter): - In scripts/build_frontend_derived.py: changed IS NOT NULL filters to IS NOT NULL AND {d} <> '' in build_facet_summaries and build_facet_cross_filter. Removes the blank selectable facet entry from 586 GEOME records with empty-string concept URI. - Updated scripts/validate_frontend_derived.py: algebraic recompute now matches the builder's empty-string filter; added new check 5b "facet_summaries no blank values (#283a)". - Trust gate: blank facet_value count = 0. #283b (deprecated specimentype/1.0 labels): - In scripts/build_vocab_labels.py: added MANUAL_LABEL_OVERRIDES list with two entries for deprecated SESAR URIs absent from live TTLs: specimentype/1.0/othersolidobject -> "Other solid object" (en) specimentype/1.0/physicalspecimen -> "Material sample" (en) Rebuilt vocab_labels.parquet (539 rows, up from 537). - Trust gate: both specimentype URIs present in vocab_labels with correct labels. Fixture tests (tests/test_ingest_oc_records.py): - 8 new tests added (28 total, all pass): Fix isamplesorg#277: test_oc_description_enriched_from_eric_wide, test_non_oc_description_unchanged_by_enrichment, test_oc_msr_count_unchanged_by_enrichment Fix #283a: test_empty_string_facet_values_filtered_from_summaries, test_empty_string_facet_values_filtered_from_cross_filter Fix #283b: test_specimentype_othersolidobject_in_vocab_labels, test_specimentype_physicalspecimen_in_vocab_labels, test_specimentype_labels_have_lang_en Validator: 26/26 checks pass (was 25/25). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Reviewer hygiene: ISAMPLES_VOCAB_LABELS env var instead of a machine- specific /tmp default; falls back to building vocab_labels on the fly. 28/28 tests pass. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… ref extraction/trust gate, deterministic output B1 (cross-source orphan protection): surviving_se_refs previously filtered to n='OPENCONTEXT', allowing SEs/Sites/Geos shared with SESAR/GEOME MSRs to be wrongly deleted. Fixed by removing the source filter — all surviving MSRs (any source) now protect shared subgraph entities. B2A (incomplete agent extraction): Agents in p__responsibility were remapped but their rows were never extracted (only p__registrant was queried). Fixed by UNIONing p__responsibility into agent_ids. B2B (trust gate extension): pre-write gate only checked p__produced_by and 3 concept dims. Added comprehensive post-write HARD FAIL gate checking all 10 p__* array columns (BIGINT[] + INTEGER[]) on new rows. B3 (deterministic Phase J): Phase J UNION ALL lacked ORDER BY row_id, making sha256 non-reproducible. Added ORDER BY row_id before COPY TO PARQUET. Nit A: whitespace-only facet filter (NULLIF(TRIM(d),'') IS NOT NULL) in build_frontend_derived.py + validator check 5b uses TRIM(). Nit B: Cyprus enrichment is now a hard RuntimeError at production scale (out_oc_count > 1M), not just a log statement. Nit C: regression test test_cross_source_shared_entity_not_orphaned confirms B1 FAILS on old code and PASSES on fixed code. Rebuild: 29/29 tests pass. All verification queries pass (b-k). Validator 26/26. Zero dangling refs across all 10 p__* columns. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…te (isamplesorg#272) - Replace per-path orphan logic with general formulation: compute orphan SamplingSites BEFORE orphan Geos so that surviving_geo_refs includes Geos referenced by non-orphan SamplingSites via p__site_location (UNION with Geos from surviving SEs via p__sample_location). - Fixes 4,606 dangling p__site_location refs introduced by Murlo sync: those Geos were wrongly deleted because orphan determination only checked p__sample_location on SEs, missing the p__site_location path from Sites. - Replace new-rows-only B2B gate with mandatory ALL-rows ALL-columns gate before manifest emission: scans every p__* (BIGINT[] + INTEGER[]) column across entire output; RuntimeError + build abort if any dangling ref. - New orphan counts: geo=16,621 (was 21,227), total removed=60,003 (was 64,609). 4,606 Geos correctly retained; output rows=20,821,668 (was 20,817,062). - Regression test test_site_location_geo_not_orphaned: FAILS on old code (Geo deleted), PASSES on fixed code (Geo retained). - All 30 pytest tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…nd-6) Fix A: replace hand-enumerated path-specific orphan tables with a true fixpoint algorithm. remove_set := stale MSR row_ids; iterate until stable: compute survivor_refs = all row_ids referenced by any surviving row through ALL 12 p__* columns (enumerated from schema, not hand-picked); add to remove_set any non-MSR, non-IdentifiedConcept candidate row not in survivor_refs. Converges in 4 passes on real data. Correctly removes 4 Agent orphans that path-specific logic missed (over-retention = 0). Total rows removed: 60,007 (Phase 5 was 60,003). Fix B: silent-drop guard. For every p__* structural/concept column on new rows, assert remapped array length == source array length after inner-join remapping. Any mismatch → RuntimeError, build aborted. p__keywords excluded (best-effort URI lookup, documented). Correctly catches SE.p__sampling_site pointing to absent site (new regression test). Regression tests (both FAIL on old code, PASS on fixed): - test_orphan_geo_via_site_only_removed: Geo only via orphan Site's p__site_location (no surviving refs) must be REMOVED, not retained - test_unresolved_new_ref_hard_fails: SE p__sampling_site=[999], no Site 999 in Eric's wide → RAISES, does NOT silently emit NULL All verification checks: - 0 dangling refs across all 12 p__* columns (in-script gate + independent) - OC MSR: 1,110,791 | Rock: 37,953 | Cyprus: 69,230 | Blank facets: 0 - Total rows: 20,821,664 | Validator: 26/26 PASS - 32/32 fixture tests pass Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

round-7) FIX 1 (BLOCKER) — p__keywords now fully extracted and preserved: - Extend new_concept_refs to include p__keywords alongside p__has_* dims - All keyword concept URIs from new OC MSRs are now minted if absent from src - Remove p__keywords carve-out from silent-drop guard; strict length check enforced - Remove misleading "WHAT IT DOES NOT DO" note about keyword concepts - Result: 252,978 keyword refs preserved across 67,183 new OC MSRs - 1,046 concepts minted total (1 earthsurface + 1,045 new keyword concepts) - 0 dangling p__keywords refs in output (all 12 columns clean) FIX 2 (NIT) — our_row_id uniqueness assertion on eric_id_map: - Hard-fail if DENSE_RANK produces duplicate our_row_ids (duplicate otype,pid pairs) - Cheap insurance against silent collision before write REGRESSION TEST (fail-old/pass-new): - test_new_msr_keywords_preserved: new OC MSR with 2 keyword refs (1 in src, 1 not in src → minted); verifies array length == 2, both targets resolve to IdentifiedConcept rows, 0 dangling p__keywords refs; update build_oc_wide to accept p__keywords in MSR rows Build verified (round-7, /tmp/ingest_202608_r7/): - total rows: 20,822,709 (+1,045 vs round-6's 20,821,664 — exactly the new concepts) - OC MSR: 1,110,791 | Cyprus: 69,230 | blank facets: 0 - removed/stale pids in output: 0 - dangling refs all 12 p__* cols: 0 - pytest: 33/33 PASS | validator: 26/26 PASS No push per standing rules (RY independently re-verifies + Codex round 4). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

rdhyee and others added 8 commits June 12, 2026 08:01

test: make vocab_labels fast-path env-driven, drop hardcoded /tmp

9789ca6

Reviewer hygiene: ISAMPLES_VOCAB_LABELS env var instead of a machine- specific /tmp default; falls back to building vocab_labels on the fly. 28/28 tests pass. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

rdhyee marked this pull request as ready for review June 13, 2026 15:34

rdhyee merged commit 0ca84d0 into main Jun 13, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pipeline: OC sync (#272) + #277/#283 fixes → isamples_202608 [DRAFT, no publish]#10

pipeline: OC sync (#272) + #277/#283 fixes → isamples_202608 [DRAFT, no publish]#10
rdhyee merged 8 commits into
mainfrom
pipeline/oc-sync-272-202608

rdhyee commented Jun 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rdhyee commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this is

Review history — converged after 4 Codex rounds + independent verification

Verified numbers (independently re-checked against the staged build)

Hardening that outlasts this PR

Scope / not in this PR

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rdhyee commented Jun 13, 2026 •

edited

Loading