feat: subchunk write order#3826
Conversation
| end = "end" | ||
|
|
||
|
|
||
| class SubchunkWriteOrder(Enum): |
There was a problem hiding this comment.
advantage of an enum over Literal["morton", "unordered", "lexicographic", "colexicographic"]?
There was a problem hiding this comment.
Just copied what was done for ShardingCodecIndexLocation!
There was a problem hiding this comment.
I'm not a huge fan of enums in python (including ShardingCodecIndexingLocation), so unless you object I think it would be better to use a simple Literal + a final tuple of strings, like:
SubchunkWriteOrder = Literal["morton", "unordered", "lexicographic", "colexicographic"]
SUBCHUNK_WRITE_ORDER: Final[tuple[str, str, str, str]] = ("morton", "unordered", "lexicographic", "colexicographic")
There was a problem hiding this comment.
Done (hopefully)!
Co-authored-by: Davis Bennett <davis.v.bennett@gmail.com>
|
|
||
| if self._is_complete_shard_write(indexer, chunks_per_shard): | ||
| shard_dict = dict.fromkeys(morton_order_iter(chunks_per_shard)) | ||
| shard_dict = dict.fromkeys(np.ndindex(chunks_per_shard)) |
There was a problem hiding this comment.
cc @mkitti
Here and below, I don't think there is any need to construct the dict in morton order, right? There should be no correctness or performance hit here?
@d-v-b This now ensures we only shuffle in the unordered case once so the test is nice and clean - write once + get order, create a new codec with the same seed + create the iterator from that codec, match orders
There was a problem hiding this comment.
In Python, dicts are ordered and I think the optimal iteration order may need to be encoded in the dict the last time I examined the situation. I was just trying to preserve the situation before my edits.
There was a problem hiding this comment.
So this wasn't about dictionary order, but instead in the vectorized case, the order to ShardReader.to_dict_vectorized had to match that of what ShardReader was internally generating, as it turned out morton order. So I'm glad I caught this because I think it means the data was being corrupted for the other orders (which weren't getting hypothesis-tested).
So I'm going to add something to the hyptothesis tests for this.
I had the same feeling initially that the dictionary order mattered, but it turns out the final call to _encode_shard_dict actually handles the ordering for us to the output buffer while writing to the intermediate shard_dict can be done in any order, as long as the final buffer is done in the correct order
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3826 +/- ##
==========================================
+ Coverage 93.39% 93.43% +0.04%
==========================================
Files 88 88
Lines 11839 11874 +35
==========================================
+ Hits 11057 11095 +38
+ Misses 782 779 -3
🚀 New features to boost your workflow:
|
d-v-b
left a comment
There was a problem hiding this comment.
this looks good, thanks!
we have some interesting follow-up work:
- finding a nice way to expose the sharding codec settings in the signature of
create_array - writing a chunk writing routine that can incrementally stream out a shard
- formalizing the distinction between codec attributes that get serialized to JSON and runtime-only attributes
…#3826, partial-read opt zarr-developers#3004, _ShardIndex refactor zarr-developers#3975) Resolves conflicts in sharding.py (kept FusedCodecPipeline sync methods + main's _subchunk_order_iter / _load_partial_shard_maybe; fixed _ShardIndex construction to main's 2-arg signature), array.py (took main's cached regular_chunk_spec), test_codec_pipeline.py (kept the dual-pipeline suite + main's evolve test), .gitignore (union). 423 codec/sharding/parity + 807 codecs/indexing tests pass under both pipelines. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…al reads Three integration gaps surfaced when the Fused pipeline met main's new subchunk_write_order (zarr-developers#3826), partial-read coalescing (zarr-developers#3004), and _ShardIndex refactor. Under Fused these caused 25 sharding/parity failures (data was correct in the partial-read cases; the failures were write-order layout + IO-pattern divergence). Fixes: 1. Write order: _encode_shard_dict_sync laid out chunks in hardcoded morton order, ignoring subchunk_write_order. Now iterates _subchunk_order_iter(self.subchunk_write_order), matching the async _encode_shard_dict. Fixes lexicographic/colexicographic/unordered storage. 2. Coalesced sync partial reads: add Store.get_ranges_sync (a synchronous, coalescing counterpart of get_ranges, reusing coalesce_ranges) and ShardingCodec._load_partial_shard_maybe_sync; route _decode_partial_sync's partial branch through it. Sync stores now get zarr-developers#3004's byte-range coalescing without an event loop (fewer, merged reads). 3. Non-sync fallback: FusedCodecPipeline.read now routes non-sync stores (e.g. ZipStore) through the async partial-decode path when the AB codec supports it, instead of _async_read_fallback's whole-shard get(). Matches Batched's IO behavior; avoids over-reading whole shards on partial reads. Tests: the zarr-developers#3004 partial-read tests are made pipeline-aware (assert the active method family: get/get_ranges vs get_sync/get_ranges_sync, gated on store sync support). 573 sharding+parity+pipeline+indexing and 657 codec tests pass under BOTH pipelines (was 25 failing under Fused). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat: subchunk write order (#3826) * feat: subchunk write order * chore: export `SubchunkWriteOrder` * chore: docs * chore: relnote * rename * refactor: no enums * Update docs/user-guide/performance.md Co-authored-by: Davis Bennett <davis.v.bennett@gmail.com> * feat: deterministic but random order * fix: make vectorized fetching less reliant on matching order * chore: add hypothesis * refactor: dead code * refactor: more cleanup * don't shard unless there is something to shard * fix: dont mix chunk grid and sharding --------- Co-authored-by: Davis Bennett <davis.v.bennett@gmail.com> (cherry picked from commit 093a153) * refactor: unordered subchunk order means no-promise, not random Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: pin subchunk_write_order survival through pickle Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: remove rng from ShardingCodec; carry subchunk_write_order through pickle Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: describe unordered subchunk order as no-guarantee, drop rng Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: drive sharding strategy through serializer to exercise subchunk_write_order The hypothesis arrays() strategy passed both shards= and a ShardingCodec serializer, which nested the codecs and left subchunk_write_order governing only a 1-element inner grid. Drive sharding through the serializer alone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * harden: guard _subchunk_order_iter; document write-order is not persisted Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * cleanup: use np.ndindex for immaterial intermediate order; drop stale FIXME Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * polish: guard scalar arrays in sharding strategy; align doc value ordering Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Revert strategies.py changes — unrelated to this PR These changes were patching a latent bug in `arrays()` where ShardingCodec-as-serializer was being double-stacked with `shards=...`, producing nested sharding. Splitting to a follow-up PR so this one stays focused on removing the `rng`/random-subchunk-order surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Ilan Gold <ilanbassgold@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* perf: cache lexicographic chunk coords in sharding codec The subchunk_write_order feature (#3826) regressed sharded write performance: _encode_partial_single rebuilt the full per-shard chunk coordinate grid on every write via `np.array(list(_subchunk_order_iter(..., "lexicographic")))`, and `to_dict_vectorized` rebuilt a tuple key per row with `tuple(coords.ravel())`. For a single-chunk write into a shard with tens of thousands of chunks this roughly doubled write time (~22ms -> ~40ms on test_sharded_morton_write_single_chunk, matching the -44% CodSpeed regression). Add cached `_lexicographic_order` (array) and `_lexicographic_order_keys` (tuples) helpers in indexing.py, mirroring `_morton_order`/`_morton_order_keys`, and pass the cached keys into `to_dict_vectorized` instead of deriving them row-by-row. This restores write throughput to the pre-#3826 baseline while preserving identical chunk ordering (verified equal to np.ndindex across shapes including 0-d and empty). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(deps): bump the actions group across 1 directory with 8 updates (#176) Bumps the actions group with 8 updates in the / directory: | Package | From | To | | --- | --- | --- | | [prefix-dev/setup-pixi](https://github.com/prefix-dev/setup-pixi) | `0.9.5` | `0.9.6` | | [codecov/codecov-action](https://github.com/codecov/codecov-action) | `6.0.0` | `6.0.1` | | [github/issue-metrics](https://github.com/github/issue-metrics) | `4.2.2` | `4.2.7` | | [j178/prek-action](https://github.com/j178/prek-action) | `2.0.3` | `2.0.4` | | [actions/upload-artifact](https://github.com/actions/upload-artifact) | `7.0.0` | `7.0.1` | | [actions/download-artifact](https://github.com/actions/download-artifact) | `7.0.0` | `8.0.1` | | [pypa/gh-action-pypi-publish](https://github.com/pypa/gh-action-pypi-publish) | `1.13.0` | `1.14.0` | | [zizmorcore/zizmor-action](https://github.com/zizmorcore/zizmor-action) | `0.5.3` | `0.5.6` | Updates `prefix-dev/setup-pixi` from 0.9.5 to 0.9.6 - [Release notes](https://github.com/prefix-dev/setup-pixi/releases) - [Commits](prefix-dev/setup-pixi@1b2de7f...5185adf) Updates `codecov/codecov-action` from 6.0.0 to 6.0.1 - [Release notes](https://github.com/codecov/codecov-action/releases) - [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md) - [Commits](codecov/codecov-action@57e3a13...e79a696) Updates `github/issue-metrics` from 4.2.2 to 4.2.7 - [Release notes](https://github.com/github/issue-metrics/releases) - [Commits](github-community-projects/issue-metrics@c9e9838...1e38d5e) Updates `j178/prek-action` from 2.0.3 to 2.0.4 - [Release notes](https://github.com/j178/prek-action/releases) - [Commits](j178/prek-action@6ad8027...bdca6f1) Updates `actions/upload-artifact` from 7.0.0 to 7.0.1 - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](actions/upload-artifact@v7...043fb46) Updates `actions/download-artifact` from 7.0.0 to 8.0.1 - [Release notes](https://github.com/actions/download-artifact/releases) - [Commits](actions/download-artifact@v7...3e5f45b) Updates `pypa/gh-action-pypi-publish` from 1.13.0 to 1.14.0 - [Release notes](https://github.com/pypa/gh-action-pypi-publish/releases) - [Commits](pypa/gh-action-pypi-publish@v1.13.0...cef2210) Updates `zizmorcore/zizmor-action` from 0.5.3 to 0.5.6 - [Release notes](https://github.com/zizmorcore/zizmor-action/releases) - [Commits](zizmorcore/zizmor-action@b1d7e1f...5f14fd0) --- updated-dependencies: - dependency-name: prefix-dev/setup-pixi dependency-version: 0.9.6 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: actions - dependency-name: codecov/codecov-action dependency-version: 6.0.1 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: actions - dependency-name: github/issue-metrics dependency-version: 4.2.7 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: actions - dependency-name: j178/prek-action dependency-version: 2.0.4 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: actions - dependency-name: actions/upload-artifact dependency-version: 7.0.1 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: actions - dependency-name: actions/download-artifact dependency-version: 8.0.1 dependency-type: direct:production update-type: version-update:semver-major dependency-group: actions - dependency-name: pypa/gh-action-pypi-publish dependency-version: 1.14.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: actions - dependency-name: zizmorcore/zizmor-action dependency-version: 0.5.6 dependency-type: direct:production update-type: version-update:semver-patch dependency-group: actions ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * refactor(sharding): derive coords inside to_dict_vectorized Address review feedback: `_ShardReader.to_dict_vectorized` took the lexicographic coordinate array and key tuples as parameters, even though the reader already knows its own `chunks_per_shard` and both structures are `lru_cache`d. Thread nothing in — fetch them inside the method via `_lexicographic_order`/`_lexicographic_order_keys`. Same cache, so no perf change; the call site collapses to `to_dict_vectorized()`. Add a unit test covering the method directly across 0-d, 1-d, and 2-d shard grids: present chunks map to their stored bytes, empty chunks to None, and every lexicographic coordinate appears as a key. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Update src/zarr/core/indexing.py Co-authored-by: Ilan Gold <ilanbassgold@gmail.com> * refactor(sharding): drop redundant lexicographic_order_iter Address review feedback from @ilan-gold and @chuckwondo on the `lexicographic_order_iter` helper. `lexicographic_order_iter` returned a *lazy* iterator over an *eagerly-built, cached* tuple (`_lexicographic_order_keys`), which chuckwondo rightly flagged as confusing — and its output is byte-for-byte identical to the pre-existing, genuinely-lazy `c_order_iter` (verified across 0-d, empty, and N-d shapes). So the name promised laziness the implementation didn't provide, over a sequence we could already produce. Remove the wrapper and use the cached `_lexicographic_order_keys` tuple directly at the two `dict.fromkeys` call sites and in `_subchunk_order_iter`. This keeps the eager/cached coordinate tuples — which is the actual optimization: `dict.fromkeys` over the cached tuple is ~1.4x faster than over lazy `c_order_iter` at 32^3 (≈900us vs ≈1300us), because the cache amortizes tuple construction across repeated writes to same-shaped shards. Switching to `c_order_iter` would have reintroduced that cost, so it is deliberately not used here. Also drop the now-dead `tuple()` wrap in `morton_order_iter` (its argument is typed `tuple[int, ...]` and every caller passes one), per ilan-gold. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(indexing): prefer lexicographic_order_iter, soft-deprecate c_order_iter `c_order_iter` names a memory layout ("C order") rather than what the iterator actually yields. Reintroduce `lexicographic_order_iter` as the clearer name for the same row-major coordinate sequence, and make `c_order_iter` a thin alias that delegates to it, with a docstring note steering new code to the preferred name. No runtime warning — these are internal helpers. `lexicographic_order_iter` keeps the eager/cached implementation (iter over the lru_cached `_lexicographic_order_keys` tuple), which is ~1.4x faster than the old lazy `itertools.product` on the `dict.fromkeys` shard-write path and is the optimization this branch exists to deliver. The alias therefore changes `c_order_iter` from lazy to eager/cached; all in-repo callers (_ShardReader.__iter__, _is_total_shard, _subchunk_order_iter, and two tests) are migrated to `lexicographic_order_iter`, so nothing in-tree relies on the old laziness. Output is unchanged: lexicographic_order_iter, the c_order_iter alias, and np.ndindex all agree across 0-d, empty, and N-d shapes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(indexing): make lexicographic_order_iter the lazy primitive Per review from @mkitti: invert the relationship between the lazy iterator and the eagerly-collected tuple. `lexicographic_order_iter` is now a genuine lazy generator over the chunk-grid coordinates, and `_lexicographic_order_keys` collects it into a cached tuple — the eager version is "collect the lazy one", not the other way around. Previously lexicographic_order_iter returned iter() over the cached tuple, so any consumer that only needed a prefix still paid to materialize the entire grid. _is_total_shard does exactly that — an early-exit `all(coord in set for coord in ...)` — and on a cold cache for a 32^3 shard whose first coordinate is absent this dropped from ~15.8ms to ~24us (the lazy generator builds one coordinate and bails). The hot path is unchanged: the two dict.fromkeys sites consume the full grid and use the cached `_lexicographic_order_keys` tuple directly (~0.9ms at 32^3), so the regression fix this branch delivers is intact. This also resolves @chuckwondo's point — the iterator is now actually lazy rather than a thin wrapper over eager data. Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(indexing): make morton_order_iter the lazy primitive too Per @mkitti: the morton pair was backwards in the same way the lexicographic pair was. Invert it to match — `morton_order_iter` is now the lazy generator primitive and `_morton_order_keys` collects it into a cached tuple, mirroring `lexicographic_order_iter` / `_lexicographic_order_keys`. No behavioral change for the in-tree consumers (all fully consume the sequence) and the Z-order is identical; this keeps the two coordinate- order families symmetric and gives morton the same lazy/early-exit option lexicographic now has. Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(indexing): expose chunk-order coordinates as cached sequences Replace the morton/lexicographic order iterators (and the c_order_iter alias) with two cached, numpy-backed sequences: `morton_order_coords(shape)` and `lexicographic_order_coords(shape)`, each returning the grid coordinates in that order as a tuple of coordinate tuples. This addresses several points from review: - The earlier "lazy primitive" inversion de-optimized the hot write path: `morton_order_iter` rebuilt every coordinate tuple from the array on each call, and that path runs in `_encode_shard_dict` on every shard write (~16ms/write at 32^3 chunks-per-shard). The coords are a finite set of known length reused in full, so they are an indexable sequence built once and cached, not a lazily-rebuilt generator. (per @mkitti) - `lexicographic_order_iter` was never genuinely lazy — `_lexicographic_order` materializes the whole `np.indices` grid up front — so the early-exit framing was inaccurate. (per @Copilot, @chuckwondo) - Two functions differing only in caching vs laziness was redundant (per @ilan-gold); there is now one sequence per order. `_ShardReader.__iter__` wraps it in `iter()`, the only site that needs an iterator. - `_is_total_shard` no longer iterates the order at all: `all_chunk_coords` is always a subset of the shard grid (guaranteed by `validate`'s shard/chunk divisibility check), so a count check proves totality. A subset assertion documents the invariant. Coordinates are Python int tuples because every consumer uses them as dict keys / set members, which numpy arrays cannot be (unhashable, mutable); the numpy array is kept only for the vectorized index lookup in `to_dict_vectorized`. The per-shape cache holds ~prod(chunks_per_shard) tuples (~0.07% of shard size for multi-GB shards with (64,64,64) chunks), capped at 16 shapes per order. Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(bench): add warm-cache shard-write benchmark The existing test_sharded_morton_write_single_chunk clears the chunk-order cache before every iteration, so it only measures the cold grid-build cost. That made it blind to a regression where the per-shard coordinate tuples were rebuilt on every write instead of being reused from the cache — the cold benchmark could not distinguish the two (both pay the build each iteration). Add test_sharded_morton_write_single_chunk_warm_cache, which warms the cache once and then times repeated same-shape writes — the amortized regime the cache exists to optimize (many shards of one shape per array). Verified it discriminates: with the cached sequence it is ~4x faster than the cold benchmark, and a rebuild-every-write regression shows up as a ~4x slowdown here while staying invisible to the cold benchmark. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: update changelog for full-shard write coverage The fix caches the per-shard coordinate grid for every shard write, not only partial writes, and the win is amortized across repeated writes to same-shaped shards. Reword the note accordingly; keep it user-facing (the internal indexing helper refactor is not part of the public API). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * perf: build order-coord tuples via .tolist(); document dual representation `morton_order_coords` / `lexicographic_order_coords` built their tuple-of- tuples with a row-by-row `tuple(int(x) for x in row)` comprehension. Using `map(tuple, arr.tolist())` instead does the int conversion in a single C-level call, producing byte-identical native-int tuples ~8-9x faster (~16ms -> ~1.9ms cold build at 32^3). It is a per-shape cached build, so this only speeds the first write to each shard shape, but it is free. Also document in `to_dict_vectorized` why the chunk coordinates are needed in two forms — a numpy array for the vectorized index lookup and hashable tuples for the dict keys — since numpy rows are unhashable and a tuple list can't be used for the vectorized modulo/advanced-indexing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * perf,test: address code-review findings in the sharding coord cache - Drop the O(n_chunks) assert in _is_total_shard. It built a fresh set(lexicographic_order_coords(...)) on every partial read/write to check an invariant `validate` already guarantees, regressing the very partial-access hot path this PR optimizes (~673us vs ~112ns at 32^3 chunks-per-shard) and vanishing under -O. The invariant is documented in the comment; the count check alone proves totality. - Cache the colexicographic subchunk order. The colex branch of _subchunk_order_iter rebuilt the grid via uncached np.ndindex on every write while its morton/lexicographic siblings hit the cache; add colexicographic_order_coords (cached, derived from lexicographic_order_coords of the reversed shape) and use it. - Fix two benchmark docstrings: the cold benchmark now clears the lexicographic caches too (the write path builds that grid via dict.fromkeys / to_dict_vectorized, so a morton-only clear left it warm and under-reported the cold cost); the warm benchmark docstring now describes what it actually exercises (repeated writes to one shard, which reuse the cache identically to writes across same-shaped shards). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Ilan Gold <ilanbassgold@gmail.com> Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>
In order to encourage ecosystem compatibility + reserve runtime setting strings/enums (see zarrs/zarrs-python#160), subchunk write order is expanded from
mortonto includelexicographic,colexicographic, andunordered(which is randomized).TODO:
docs/user-guide/*.mdchanges/