Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion .github/workflows/pipeline-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,10 @@ on:
paths:
- "scripts/build_frontend_derived.py"
- "scripts/validate_frontend_derived.py"
- "scripts/enrich_wide_with_oc_concepts.py"
- "scripts/validate_oc_concept_enrichment.py"
- "tests/test_frontend_derived.py"
- "tests/test_oc_concept_enrichment.py"
- "scripts/requirements.txt"
- "Makefile"
- ".github/workflows/pipeline-tests.yml"
Expand All @@ -16,7 +19,10 @@ on:
paths:
- "scripts/build_frontend_derived.py"
- "scripts/validate_frontend_derived.py"
- "scripts/enrich_wide_with_oc_concepts.py"
- "scripts/validate_oc_concept_enrichment.py"
- "tests/test_frontend_derived.py"
- "tests/test_oc_concept_enrichment.py"
workflow_dispatch:

jobs:
Expand All @@ -33,4 +39,4 @@ jobs:
# builds tiny synthetic wides (WKB BLOB + DuckDB GEOMETRY), runs the real
# builder + algebraic validator, asserts the contract. Exits non-zero on
# any failure -> PR is blocked.
run: python -m pytest tests/test_frontend_derived.py -q
run: python -m pytest tests/test_frontend_derived.py tests/test_oc_concept_enrichment.py -q
10 changes: 7 additions & 3 deletions DATA_PROVENANCE.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,11 @@ STAGE 0/1 export_client → JSONL → GeoParquet
STAGE 2 pqg/pqg/sql_converter.py (export → base PQG; 7-stage DuckDB SQL)
→ narrow (…_narrow.parquet, ~844MB, 106M rows) and wide (…_wide.parquet, ~282MB, 20M rows)
STAGE 3 sidecar/enrichment merge (LEFT JOIN by pid) ← Eric's independently-maintained OC PQG (GCS)
scripts/enrich_wide_with_oc_thumbnails.py → isamples_202604_wide.parquet (+47K thumbnails)
STAGE 3 sidecar/enrichment merges (LEFT JOIN by pid) ← Eric's independently-maintained OC PQG (GCS)
3a scripts/enrich_wide_with_oc_thumbnails.py → isamples_202604_wide.parquet (+47K thumbnails)
3b scripts/enrich_wide_with_oc_concepts.py → isamples_202606_wide.parquet (#272: OC material/
object-type concepts REPLACE the frozen export's for OC pids — OC wins unconditionally;
gate: scripts/validate_oc_concept_enrichment.py)
STAGE 4 wide → frontend derived files (NOW SCRIPTED: scripts/build_frontend_derived.py)
→ wide_h3 · h3_summary_res4/6/8 · samples_map_lite · sample_facets_v2 · facet_summaries · facet_cross_filter
Expand All @@ -36,7 +39,8 @@ DuckDB-WASM in the browser (explorer.qmd; parquet URLs ~L767-781)
|---|---|---|---|
| **0/1 Export** | Solr API → `isamples_export_*_geo.parquet` | `export_client` `ExportClient.perform_full_download()` (`export_client.py:423-469`) → `write_geoparquet_from_json_lines()`; schema `SOURCE_COLUMNS` (`duckdb_utilities.py:9-42`, incl. `keywords: STRUCT(keyword VARCHAR)[]` — **text only, no URI**, L17) | ❌ API offline; **frozen** |
| **2 Base PQG** | export → `*_narrow.parquet` / `*_wide.parquet` | `pqg/pqg/sql_converter.py` `convert_isamples_sql(input, output, wide=…)` (CLI `python pqg/sql_converter.py in.parquet out.parquet [--wide]`); 7 stages, decomposes nested structs → nodes+edges; site dedupe by rounded lat/lon+label | ✅ scripted (exact prod invocation not recorded — gap) |
| **3 Sidecar merge** | base wide + Eric's OC PQG → `isamples_202604_wide.parquet` | `scripts/enrich_wide_with_oc_thumbnails.py` — `LEFT JOIN` OC `(pid, thumbnail_url)` into wide (`COALESCE`). **This is the precedent for merging ANY per-source supplement (incl. concept URIs) by pid.** Drift check: `scripts/check_oc_pqg_drift.py` (detects only; no mirror) | ⚠️ merge scripted; OC mirror + R2 upload manual |
| **3a Sidecar: thumbnails** | base wide + Eric's OC PQG → `isamples_202604_wide.parquet` | `scripts/enrich_wide_with_oc_thumbnails.py` — `LEFT JOIN` OC `(pid, thumbnail_url)` into wide (`COALESCE`). Drift check: `scripts/check_oc_pqg_drift.py` (detects only; no mirror) | ⚠️ merge scripted; OC mirror + R2 upload manual |
| **3b Sidecar: OC concepts (#272)** | 3a wide + Eric's OC **wide** → `isamples_202606_wide.parquet` | `scripts/enrich_wide_with_oc_concepts.py` — REPLACES `p__has_material_category` / `p__has_sample_object_type` for OC pids with OC's ordered concept lists (**OC wins unconditionally** — RY decision 2026-06-10, #272); mints `IdentifiedConcept` rows for URIs the frozen export never had (e.g. `otheranthropogenicmaterial`, the #260 fix); deterministic; emits `.manifest.json`. Independent gate: `scripts/validate_oc_concept_enrichment.py` (re-derives from inputs; non-overlay rows must be byte-identical). Scope: overlay only — ~75K OC records absent from the frozen export are NOT ingested (follow-up); `p__has_context_category` untouched (follow-up). | ✅ merge + gate scripted (`make all-272`); R2 upload manual |
| **4 Frontend derived** | wide → 7 explorer files | The 6 map/facet files (`wide_h3`, `h3_summary_res4/6/8`, `samples_map_lite`, `sample_facets_v2`, `facet_summaries`, `facet_cross_filter`) ← **`scripts/build_frontend_derived.py`** (deterministic; geometry-agnostic; emits a manifest). `vocab_labels.parquet` ← `scripts/build_vocab_labels.py` (SKOS TTLs). Gated by `scripts/validate_frontend_derived.py` (algebraic + `--wide` semantic re-derivation) + `tests/test_frontend_derived.py` (fixtures, CI). | ✅ scripted; facet/map files semantic-tested; wide_h3 column-smoke-tested |
| **5 Publish** | files → R2 + Worker | Worker `workers/data-isamples-org/src/index.js` (`wrangler deploy`); immutable cache for `isamples_\d{6}_*.parquet`; `/current/<flavor>.parquet` → 302 via `current/manifest.json`. Bucket `isamples-ry` | ⚠️ Worker scripted; **file upload + manifest update are manual** |

Expand Down
57 changes: 48 additions & 9 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -2,45 +2,84 @@
#
# make test # fast fixture tests (no network, no big data) — the CI gate
# make wide # download + checksum the canonical wide parquet
# make derived # build the derived files from $(WIDE) into $(OUTDIR)
# make oc-wide # download + checksum Eric's OC PQG wide (concept source of truth, #272)
# make enrich # overlay OC material/object-type concepts onto $(WIDE) -> $(ENRICHED)
# make validate-enrich # independent trust gate for the enrichment (non-zero exit on failure)
# make derived # build the derived files from $(DERIVED_WIDE) into $(OUTDIR)
# make validate # algebraic trust gate over the built files (non-zero exit on failure)
# make all # wide -> derived -> validate
# make all # wide -> derived -> validate (no enrichment)
# make all-272 # wide+oc-wide -> enrich -> validate-enrich -> derived -> validate
#
# Override on the command line, e.g.:
# make all WIDE_URL=https://data.isamples.org/isamples_202604_wide.parquet TAG=isamples_202606
# make all-272 TAG=isamples_202606
#
# Requirements: python with `pip install -r scripts/requirements.txt`, plus
# network access on first run (DuckDB pulls the h3 community extension).

PY ?= python
WIDE_URL ?= https://data.isamples.org/isamples_202604_wide.parquet
OC_WIDE_URL ?= https://storage.googleapis.com/opencontext-parquet/oc_isamples_pqg_wide.parquet
OUTDIR ?= build/derived
WIDE ?= $(OUTDIR)/wide.parquet
OC_WIDE ?= $(OUTDIR)/oc_wide.parquet
TAG ?= isamples_dev
ENRICHED ?= $(OUTDIR)/$(TAG)_wide.parquet
# derived files build from the plain wide by default; all-272 overrides to the enriched one
DERIVED_WIDE ?= $(WIDE)
BUILD := scripts/build_frontend_derived.py
VALIDATE := scripts/validate_frontend_derived.py
ENRICH := scripts/enrich_wide_with_oc_concepts.py
VALIDATE_ENRICH := scripts/validate_oc_concept_enrichment.py

.PHONY: help test wide derived validate all clean
.PHONY: help test wide oc-wide enrich validate-enrich derived validate all all-272 clean
help:
@grep -E '^# make' Makefile | sed 's/^# / /'

# Fast, deterministic fixture tests — the gate a human (or CI) runs without any AI.
test:
$(PY) -m pytest tests/test_frontend_derived.py -q
$(PY) -m pytest tests/test_frontend_derived.py tests/test_oc_concept_enrichment.py -q

wide: $(WIDE)
$(WIDE):
@mkdir -p $(OUTDIR)
curl -fSL -o $(WIDE) "$(WIDE_URL)"
@echo "sha256: $$(shasum -a 256 $(WIDE) | cut -d' ' -f1) $(WIDE)"

derived: $(WIDE)
$(PY) $(BUILD) --wide $(WIDE) --outdir $(OUTDIR) --tag $(TAG) --skip wide_h3
oc-wide: $(OC_WIDE)
$(OC_WIDE):
@mkdir -p $(OUTDIR)
curl -fSL -o $(OC_WIDE) "$(OC_WIDE_URL)"
@echo "sha256: $$(shasum -a 256 $(OC_WIDE) | cut -d' ' -f1) $(OC_WIDE)"

# real file dependency so `make -j` orders enrich before validate-enrich
enrich: $(ENRICHED)
$(ENRICHED): $(WIDE) $(OC_WIDE)
$(PY) $(ENRICH) --src $(WIDE) --oc-wide $(OC_WIDE) --out $(ENRICHED)

validate-enrich: $(ENRICHED)
$(PY) $(VALIDATE_ENRICH) --src $(WIDE) --oc-wide $(OC_WIDE) --out $(ENRICHED)

derived: $(DERIVED_WIDE)
$(PY) $(BUILD) --wide $(DERIVED_WIDE) --outdir $(OUTDIR) --tag $(TAG) --skip wide_h3

# Sentinel expectation tracks data vintage: the plain (non-enriched) chain
# validates a frozen-export wide -> legacy value; the all-272 chain overrides
# to the OC-corrected default baked into the validator.
LEGACY_SENTINEL := https://w3id.org/isample/vocabulary/material/1.0/anthropogenicmetal
SENTINEL_FLAG ?= --sentinel-material $(LEGACY_SENTINEL)

validate:
$(PY) $(VALIDATE) --dir $(OUTDIR) --tag $(TAG)
$(PY) $(VALIDATE) --dir $(OUTDIR) --tag $(TAG) $(SENTINEL_FLAG)

# ordered sub-makes: safe under `make -j` (derived must finish before validate)
all: wide
$(MAKE) derived
$(MAKE) validate

all: wide derived validate
# Full #272 chain: enrich the wide with OC concepts, gate it, then build+gate derived.
all-272: validate-enrich
$(MAKE) derived DERIVED_WIDE=$(ENRICHED) TAG=$(TAG)
$(MAKE) validate TAG=$(TAG) SENTINEL_FLAG=

clean:
rm -rf $(OUTDIR)
26 changes: 14 additions & 12 deletions explorer.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ format:
include-in-header:
text: |
<link rel="preconnect" href="https://data.isamples.org" crossorigin>
<link rel="preload" as="fetch" crossorigin="anonymous" href="https://data.isamples.org/isamples_202601_h3_summary_res4.parquet">
<link rel="preload" as="fetch" crossorigin="anonymous" href="https://data.isamples.org/isamples_202601_facet_summaries.parquet">
<link rel="preload" as="fetch" crossorigin="anonymous" href="https://data.isamples.org/isamples_202606_h3_summary_res4.parquet">
<link rel="preload" as="fetch" crossorigin="anonymous" href="https://data.isamples.org/isamples_202606_facet_summaries.parquet">
<link rel="preload" as="fetch" crossorigin="anonymous" href="https://data.isamples.org/vocab_labels.parquet">
---

Expand Down Expand Up @@ -764,18 +764,20 @@ R2_BASE = (() => {
// default and absolute overrides (http://localhost:8099/data) pass through.
return raw.startsWith('/') ? new URL(raw, location.origin).href : raw;
})()
h3_res4_url = `${R2_BASE}/isamples_202601_h3_summary_res4.parquet`
h3_res6_url = `${R2_BASE}/isamples_202601_h3_summary_res6.parquet`
h3_res8_url = `${R2_BASE}/isamples_202601_h3_summary_res8.parquet`
lite_url = `${R2_BASE}/isamples_202601_samples_map_lite.parquet`
// Stable alias that 302-redirects to the current enriched wide parquet
// (isamples_YYYYMM_wide.parquet). Gets OpenContext thumbnails populated.
wide_url = `${R2_BASE}/current/wide.parquet`
h3_res4_url = `${R2_BASE}/isamples_202606_h3_summary_res4.parquet`
h3_res6_url = `${R2_BASE}/isamples_202606_h3_summary_res6.parquet`
h3_res8_url = `${R2_BASE}/isamples_202606_h3_summary_res8.parquet`
lite_url = `${R2_BASE}/isamples_202606_samples_map_lite.parquet`
// Explicit versioned wide (#272: OC concept-enriched — popups read material/
// object-type from this file). The stable alias `current/wide.parquet` still
// points at the previous wide until the production cutover flips the manifest;
// pinning the version here keeps staging and prod each self-consistent.
wide_url = `${R2_BASE}/isamples_202606_wide.parquet`
// v2 carries object_type alongside material and context (URI-string columns).
facets_url = `${R2_BASE}/isamples_202601_sample_facets_v2.parquet`
facet_summaries_url = `${R2_BASE}/isamples_202601_facet_summaries.parquet`
facets_url = `${R2_BASE}/isamples_202606_sample_facets_v2.parquet`
facet_summaries_url = `${R2_BASE}/isamples_202606_facet_summaries.parquet`
// Pre-aggregated single-filter cache for fast cross-filtered facet counts.
cross_filter_url = `${R2_BASE}/isamples_202601_facet_cross_filter.parquet`
cross_filter_url = `${R2_BASE}/isamples_202606_facet_cross_filter.parquet`
// SKOS prefLabels for Material / Sampled Feature / Specimen Type URIs.
// ~60 KB lookup; falls back to URI tail if a URI isn't covered.
vocab_labels_url = `${R2_BASE}/vocab_labels.parquet`
Expand Down
Loading
Loading