pipeline: OC sync (#272) + #277/#283 fixes → isamples_202608 [DRAFT, no publish]#10
Merged
Conversation
…rg#272 phase 2) Gap analysis against Eric's 2026-06-09 OC wide + local 202604 wide: - 67,187 new MSRs, 152,311 total new entity rows - All new records have coords via MSR->SE->GeoCoordLoc graph path (100%) - 1 new concept to mint against 202606 base: earthsurface - Row-id strategy: dense rank from max(src)+1 (20,729,359+) - Key bug found+fixed: geometry type mismatch (BLOB vs GEOMETRY) needs ST_AsWKB() Deliverables: - DESIGN_272_INGEST.md: gap characterization, schema mapping, pipeline plan, trust-gate invariants, 8 open decisions for RY, honest-gaps section - scripts/ingest_oc_records.py: spike implementation (dry-run verified + full write verified: all post-write trust gates pass, Stage 4 ALL CHECKS PASS) - SPIKE_RESULTS.md: real numbers from all executed queries No push, no R2 publish, no PR. Spike only — for RY review. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ale (isamplesorg#272 phase 2) D3 decision (RY 2026-06-12): remove stale OC pids rather than keep them. OpenContext mass-updated Murlo project PIDs; old PIDs duplicate same physical samples. What changed: - scripts/ingest_oc_records.py: production sync logic - Removes 21,227 stale OC MSR pids + 43,382 orphan subgraph entities (SE/Geo/Site) - Adds 67,187 new OC pids + full entity subgraph (152,311 rows) - Mints 1 new IdentifiedConcept (sampledfeature/1.0/earthsurface) - Output: 20,817,062 rows, OC MSR = 1,110,791 (matches Eric exactly) - All 25 validate_frontend_derived.py checks pass - Base: isamples_202606_wide.parquet; tag: isamples_202608 - tests/test_ingest_oc_records.py: 20 fixture tests (all pass) - removal, orphan cleanup, subgraph ingestion, remapping, trust-gate invariants - DESIGN_272_INGEST.md: all decisions resolved; orphan analysis results added - SYNC_RESULTS.md: actual run numbers and verification query outputs - Makefile: add ingest-272 and all-202608 targets Rock count after sync: 37,953 OC rocks (30,272 - 85 removed + 7,766 new). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s into 202608 rebuild Phase 3 bug fixes applied before staging 202608: isamplesorg#277 (OC description enrichment): - Added Phase J to scripts/ingest_oc_records.py: after sync write, UNION ALL approach overwrites `description` for all OC MSR pids from Eric's OC wide. Uses SELECT * REPLACE for schema-agnostic column handling. Non-OC rows and OC rows with NULL description in Eric's wide are unchanged. - Trust gate: Cyprus OC MSR count = 69,230 (was 0 before fix). - Applied to existing 202608 wide; derived files rebuilt. #283a (empty-string facet filter): - In scripts/build_frontend_derived.py: changed IS NOT NULL filters to IS NOT NULL AND {d} <> '' in build_facet_summaries and build_facet_cross_filter. Removes the blank selectable facet entry from 586 GEOME records with empty-string concept URI. - Updated scripts/validate_frontend_derived.py: algebraic recompute now matches the builder's empty-string filter; added new check 5b "facet_summaries no blank values (#283a)". - Trust gate: blank facet_value count = 0. #283b (deprecated specimentype/1.0 labels): - In scripts/build_vocab_labels.py: added MANUAL_LABEL_OVERRIDES list with two entries for deprecated SESAR URIs absent from live TTLs: specimentype/1.0/othersolidobject -> "Other solid object" (en) specimentype/1.0/physicalspecimen -> "Material sample" (en) Rebuilt vocab_labels.parquet (539 rows, up from 537). - Trust gate: both specimentype URIs present in vocab_labels with correct labels. Fixture tests (tests/test_ingest_oc_records.py): - 8 new tests added (28 total, all pass): Fix isamplesorg#277: test_oc_description_enriched_from_eric_wide, test_non_oc_description_unchanged_by_enrichment, test_oc_msr_count_unchanged_by_enrichment Fix #283a: test_empty_string_facet_values_filtered_from_summaries, test_empty_string_facet_values_filtered_from_cross_filter Fix #283b: test_specimentype_othersolidobject_in_vocab_labels, test_specimentype_physicalspecimen_in_vocab_labels, test_specimentype_labels_have_lang_en Validator: 26/26 checks pass (was 25/25). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reviewer hygiene: ISAMPLES_VOCAB_LABELS env var instead of a machine- specific /tmp default; falls back to building vocab_labels on the fly. 28/28 tests pass. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… ref extraction/trust gate, deterministic output B1 (cross-source orphan protection): surviving_se_refs previously filtered to n='OPENCONTEXT', allowing SEs/Sites/Geos shared with SESAR/GEOME MSRs to be wrongly deleted. Fixed by removing the source filter — all surviving MSRs (any source) now protect shared subgraph entities. B2A (incomplete agent extraction): Agents in p__responsibility were remapped but their rows were never extracted (only p__registrant was queried). Fixed by UNIONing p__responsibility into agent_ids. B2B (trust gate extension): pre-write gate only checked p__produced_by and 3 concept dims. Added comprehensive post-write HARD FAIL gate checking all 10 p__* array columns (BIGINT[] + INTEGER[]) on new rows. B3 (deterministic Phase J): Phase J UNION ALL lacked ORDER BY row_id, making sha256 non-reproducible. Added ORDER BY row_id before COPY TO PARQUET. Nit A: whitespace-only facet filter (NULLIF(TRIM(d),'') IS NOT NULL) in build_frontend_derived.py + validator check 5b uses TRIM(). Nit B: Cyprus enrichment is now a hard RuntimeError at production scale (out_oc_count > 1M), not just a log statement. Nit C: regression test test_cross_source_shared_entity_not_orphaned confirms B1 FAILS on old code and PASSES on fixed code. Rebuild: 29/29 tests pass. All verification queries pass (b-k). Validator 26/26. Zero dangling refs across all 10 p__* columns. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…te (isamplesorg#272) - Replace per-path orphan logic with general formulation: compute orphan SamplingSites BEFORE orphan Geos so that surviving_geo_refs includes Geos referenced by non-orphan SamplingSites via p__site_location (UNION with Geos from surviving SEs via p__sample_location). - Fixes 4,606 dangling p__site_location refs introduced by Murlo sync: those Geos were wrongly deleted because orphan determination only checked p__sample_location on SEs, missing the p__site_location path from Sites. - Replace new-rows-only B2B gate with mandatory ALL-rows ALL-columns gate before manifest emission: scans every p__* (BIGINT[] + INTEGER[]) column across entire output; RuntimeError + build abort if any dangling ref. - New orphan counts: geo=16,621 (was 21,227), total removed=60,003 (was 64,609). 4,606 Geos correctly retained; output rows=20,821,668 (was 20,817,062). - Regression test test_site_location_geo_not_orphaned: FAILS on old code (Geo deleted), PASSES on fixed code (Geo retained). - All 30 pytest tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nd-6) Fix A: replace hand-enumerated path-specific orphan tables with a true fixpoint algorithm. remove_set := stale MSR row_ids; iterate until stable: compute survivor_refs = all row_ids referenced by any surviving row through ALL 12 p__* columns (enumerated from schema, not hand-picked); add to remove_set any non-MSR, non-IdentifiedConcept candidate row not in survivor_refs. Converges in 4 passes on real data. Correctly removes 4 Agent orphans that path-specific logic missed (over-retention = 0). Total rows removed: 60,007 (Phase 5 was 60,003). Fix B: silent-drop guard. For every p__* structural/concept column on new rows, assert remapped array length == source array length after inner-join remapping. Any mismatch → RuntimeError, build aborted. p__keywords excluded (best-effort URI lookup, documented). Correctly catches SE.p__sampling_site pointing to absent site (new regression test). Regression tests (both FAIL on old code, PASS on fixed): - test_orphan_geo_via_site_only_removed: Geo only via orphan Site's p__site_location (no surviving refs) must be REMOVED, not retained - test_unresolved_new_ref_hard_fails: SE p__sampling_site=[999], no Site 999 in Eric's wide → RAISES, does NOT silently emit NULL All verification checks: - 0 dangling refs across all 12 p__* columns (in-script gate + independent) - OC MSR: 1,110,791 | Rock: 37,953 | Cyprus: 69,230 | Blank facets: 0 - Total rows: 20,821,664 | Validator: 26/26 PASS - 32/32 fixture tests pass Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
round-7) FIX 1 (BLOCKER) — p__keywords now fully extracted and preserved: - Extend new_concept_refs to include p__keywords alongside p__has_* dims - All keyword concept URIs from new OC MSRs are now minted if absent from src - Remove p__keywords carve-out from silent-drop guard; strict length check enforced - Remove misleading "WHAT IT DOES NOT DO" note about keyword concepts - Result: 252,978 keyword refs preserved across 67,183 new OC MSRs - 1,046 concepts minted total (1 earthsurface + 1,045 new keyword concepts) - 0 dangling p__keywords refs in output (all 12 columns clean) FIX 2 (NIT) — our_row_id uniqueness assertion on eric_id_map: - Hard-fail if DENSE_RANK produces duplicate our_row_ids (duplicate otype,pid pairs) - Cheap insurance against silent collision before write REGRESSION TEST (fail-old/pass-new): - test_new_msr_keywords_preserved: new OC MSR with 2 keyword refs (1 in src, 1 not in src → minted); verifies array length == 2, both targets resolve to IdentifiedConcept rows, 0 dangling p__keywords refs; update build_oc_wide to accept p__keywords in MSR rows Build verified (round-7, /tmp/ingest_202608_r7/): - total rows: 20,822,709 (+1,045 vs round-6's 20,821,664 — exactly the new concepts) - OC MSR: 1,110,791 | Cyprus: 69,230 | blank facets: 0 - removed/stale pids in output: 0 - dangling refs all 12 p__* cols: 0 - pytest: 33/33 PASS | validator: 26/26 PASS No push per standing rules (RY independently re-verifies + Codex round 4). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this is
Data-pipeline code for the isamplesorg#272 next phase: a true sync of OpenContext records against @ekansa's current export (+67,187 new / −21,227 stale Murlo re-IDs, per his Option B), plus three diagnosed bug fixes (isamplesorg#277 descriptions, #283a/b facets). Reproducible recipe; the data itself is staged at
~/Data/iSample/pqg_refining/staged_202608/, not in the repo.Review history — converged after 4 Codex rounds + independent verification
p__site_location(orphan-Geo only protected the SE path)Verified numbers (independently re-checked against the staged build)
material/1.0/rockHardening that outlasts this PR
Scope / not in this PR
isamplesorg:mainafter staging inspection, same flow as Rigorous, reproducible derived-parquet pipeline + AI-free tests (#273) isamplesorg/isamplesorg.github.io#274/OC concept reintegration: overlay Eric's material/object-type mappings onto the wide (#272, fixes #260) isamplesorg/isamplesorg.github.io#275.🤖 Generated with Claude Code