Skip to content

feat: subchunk write order#3826

Merged
d-v-b merged 27 commits into
zarr-developers:mainfrom
ilan-gold:ig/shard_order
May 22, 2026
Merged

feat: subchunk write order#3826
d-v-b merged 27 commits into
zarr-developers:mainfrom
ilan-gold:ig/shard_order

Conversation

@ilan-gold

@ilan-gold ilan-gold commented Mar 24, 2026

Copy link
Copy Markdown
Contributor

In order to encourage ecosystem compatibility + reserve runtime setting strings/enums (see zarrs/zarrs-python#160), subchunk write order is expanded from morton to include lexicographic, colexicographic, and unordered (which is randomized).

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.md
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

Comment thread src/zarr/codecs/sharding.py Outdated
end = "end"


class SubchunkWriteOrder(Enum):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

advantage of an enum over Literal["morton", "unordered", "lexicographic", "colexicographic"]?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just copied what was done for ShardingCodecIndexLocation!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a huge fan of enums in python (including ShardingCodecIndexingLocation), so unless you object I think it would be better to use a simple Literal + a final tuple of strings, like:

SubchunkWriteOrder = Literal["morton", "unordered", "lexicographic", "colexicographic"]
SUBCHUNK_WRITE_ORDER: Final[tuple[str, str, str, str]] = ("morton", "unordered", "lexicographic", "colexicographic")

@ilan-gold ilan-gold Mar 24, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (hopefully)!

Comment thread src/zarr/codecs/sharding.py
@ilan-gold ilan-gold requested a review from d-v-b March 24, 2026 17:51
Comment thread src/zarr/codecs/sharding.py Outdated
Comment thread docs/user-guide/performance.md Outdated
Comment thread src/zarr/codecs/sharding.py Outdated

if self._is_complete_shard_write(indexer, chunks_per_shard):
shard_dict = dict.fromkeys(morton_order_iter(chunks_per_shard))
shard_dict = dict.fromkeys(np.ndindex(chunks_per_shard))

@ilan-gold ilan-gold Mar 27, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @mkitti

Here and below, I don't think there is any need to construct the dict in morton order, right? There should be no correctness or performance hit here?

@d-v-b This now ensures we only shuffle in the unordered case once so the test is nice and clean - write once + get order, create a new codec with the same seed + create the iterator from that codec, match orders

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Python, dicts are ordered and I think the optimal iteration order may need to be encoded in the dict the last time I examined the situation. I was just trying to preserve the situation before my edits.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

027c469

So this wasn't about dictionary order, but instead in the vectorized case, the order to ShardReader.to_dict_vectorized had to match that of what ShardReader was internally generating, as it turned out morton order. So I'm glad I caught this because I think it means the data was being corrupted for the other orders (which weren't getting hypothesis-tested).

So I'm going to add something to the hyptothesis tests for this.

I had the same feeling initially that the dictionary order mattered, but it turns out the final call to _encode_shard_dict actually handles the ordering for us to the output buffer while writing to the intermediate shard_dict can be done in any order, as long as the final buffer is done in the correct order

@ilan-gold ilan-gold requested a review from d-v-b March 27, 2026 16:36
@codecov

codecov Bot commented May 20, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 96.42857% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.43%. Comparing base (5ca1690) to head (18d952e).

Files with missing lines Patch % Lines
src/zarr/codecs/sharding.py 93.33% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3826      +/-   ##
==========================================
+ Coverage   93.39%   93.43%   +0.04%     
==========================================
  Files          88       88              
  Lines       11839    11874      +35     
==========================================
+ Hits        11057    11095      +38     
+ Misses        782      779       -3     
Files with missing lines Coverage Δ
src/zarr/codecs/__init__.py 100.00% <100.00%> (ø)
src/zarr/testing/strategies.py 94.66% <100.00%> (+1.77%) ⬆️
src/zarr/codecs/sharding.py 92.13% <93.33%> (+0.19%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@d-v-b d-v-b left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good, thanks!

we have some interesting follow-up work:

  • finding a nice way to expose the sharding codec settings in the signature of create_array
  • writing a chunk writing routine that can incrementally stream out a shard
  • formalizing the distinction between codec attributes that get serialized to JSON and runtime-only attributes

@d-v-b d-v-b merged commit 093a153 into zarr-developers:main May 22, 2026
30 checks passed
d-v-b added a commit to d-v-b/zarr-python that referenced this pull request Jun 4, 2026
…#3826, partial-read opt zarr-developers#3004, _ShardIndex refactor zarr-developers#3975)

Resolves conflicts in sharding.py (kept FusedCodecPipeline sync methods +
main's _subchunk_order_iter / _load_partial_shard_maybe; fixed _ShardIndex
construction to main's 2-arg signature), array.py (took main's cached
regular_chunk_spec), test_codec_pipeline.py (kept the dual-pipeline suite +
main's evolve test), .gitignore (union).

423 codec/sharding/parity + 807 codecs/indexing tests pass under both pipelines.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
d-v-b added a commit to d-v-b/zarr-python that referenced this pull request Jun 5, 2026
…al reads

Three integration gaps surfaced when the Fused pipeline met main's new
subchunk_write_order (zarr-developers#3826), partial-read coalescing (zarr-developers#3004), and _ShardIndex
refactor. Under Fused these caused 25 sharding/parity failures (data was
correct in the partial-read cases; the failures were write-order layout +
IO-pattern divergence). Fixes:

1. Write order: _encode_shard_dict_sync laid out chunks in hardcoded morton
   order, ignoring subchunk_write_order. Now iterates
   _subchunk_order_iter(self.subchunk_write_order), matching the async
   _encode_shard_dict. Fixes lexicographic/colexicographic/unordered storage.

2. Coalesced sync partial reads: add Store.get_ranges_sync (a synchronous,
   coalescing counterpart of get_ranges, reusing coalesce_ranges) and
   ShardingCodec._load_partial_shard_maybe_sync; route _decode_partial_sync's
   partial branch through it. Sync stores now get zarr-developers#3004's byte-range coalescing
   without an event loop (fewer, merged reads).

3. Non-sync fallback: FusedCodecPipeline.read now routes non-sync stores (e.g.
   ZipStore) through the async partial-decode path when the AB codec supports
   it, instead of _async_read_fallback's whole-shard get(). Matches Batched's
   IO behavior; avoids over-reading whole shards on partial reads.

Tests: the zarr-developers#3004 partial-read tests are made pipeline-aware (assert the active
method family: get/get_ranges vs get_sync/get_ranges_sync, gated on store sync
support). 573 sharding+parity+pipeline+indexing and 657 codec tests pass under
BOTH pipelines (was 25 failing under Fused).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
d-v-b added a commit that referenced this pull request Jun 5, 2026
* feat: subchunk write order (#3826)

* feat: subchunk write order

* chore: export `SubchunkWriteOrder`

* chore: docs

* chore: relnote

* rename

* refactor: no enums

* Update docs/user-guide/performance.md

Co-authored-by: Davis Bennett <davis.v.bennett@gmail.com>

* feat: deterministic but random order

* fix: make vectorized fetching less reliant on matching order

* chore: add hypothesis

* refactor: dead code

* refactor: more cleanup

* don't shard unless there is something to shard

* fix: dont mix chunk grid and sharding

---------

Co-authored-by: Davis Bennett <davis.v.bennett@gmail.com>
(cherry picked from commit 093a153)

* refactor: unordered subchunk order means no-promise, not random

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: pin subchunk_write_order survival through pickle

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: remove rng from ShardingCodec; carry subchunk_write_order through pickle

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: describe unordered subchunk order as no-guarantee, drop rng

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: drive sharding strategy through serializer to exercise subchunk_write_order

The hypothesis arrays() strategy passed both shards= and a ShardingCodec
serializer, which nested the codecs and left subchunk_write_order governing
only a 1-element inner grid. Drive sharding through the serializer alone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* harden: guard _subchunk_order_iter; document write-order is not persisted

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* cleanup: use np.ndindex for immaterial intermediate order; drop stale FIXME

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* polish: guard scalar arrays in sharding strategy; align doc value ordering

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Revert strategies.py changes — unrelated to this PR

These changes were patching a latent bug in `arrays()` where
ShardingCodec-as-serializer was being double-stacked with `shards=...`,
producing nested sharding. Splitting to a follow-up PR so this one stays
focused on removing the `rng`/random-subchunk-order surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d-v-b added a commit that referenced this pull request Jun 12, 2026
* perf: cache lexicographic chunk coords in sharding codec

The subchunk_write_order feature (#3826) regressed sharded write
performance: _encode_partial_single rebuilt the full per-shard chunk
coordinate grid on every write via
`np.array(list(_subchunk_order_iter(..., "lexicographic")))`, and
`to_dict_vectorized` rebuilt a tuple key per row with `tuple(coords.ravel())`.
For a single-chunk write into a shard with tens of thousands of chunks this
roughly doubled write time (~22ms -> ~40ms on test_sharded_morton_write_single_chunk,
matching the -44% CodSpeed regression).

Add cached `_lexicographic_order` (array) and `_lexicographic_order_keys`
(tuples) helpers in indexing.py, mirroring `_morton_order`/`_morton_order_keys`,
and pass the cached keys into `to_dict_vectorized` instead of deriving them
row-by-row. This restores write throughput to the pre-#3826 baseline while
preserving identical chunk ordering (verified equal to np.ndindex across
shapes including 0-d and empty).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(deps): bump the actions group across 1 directory with 8 updates (#176)

Bumps the actions group with 8 updates in the / directory:

| Package | From | To |
| --- | --- | --- |
| [prefix-dev/setup-pixi](https://github.com/prefix-dev/setup-pixi) | `0.9.5` | `0.9.6` |
| [codecov/codecov-action](https://github.com/codecov/codecov-action) | `6.0.0` | `6.0.1` |
| [github/issue-metrics](https://github.com/github/issue-metrics) | `4.2.2` | `4.2.7` |
| [j178/prek-action](https://github.com/j178/prek-action) | `2.0.3` | `2.0.4` |
| [actions/upload-artifact](https://github.com/actions/upload-artifact) | `7.0.0` | `7.0.1` |
| [actions/download-artifact](https://github.com/actions/download-artifact) | `7.0.0` | `8.0.1` |
| [pypa/gh-action-pypi-publish](https://github.com/pypa/gh-action-pypi-publish) | `1.13.0` | `1.14.0` |
| [zizmorcore/zizmor-action](https://github.com/zizmorcore/zizmor-action) | `0.5.3` | `0.5.6` |



Updates `prefix-dev/setup-pixi` from 0.9.5 to 0.9.6
- [Release notes](https://github.com/prefix-dev/setup-pixi/releases)
- [Commits](prefix-dev/setup-pixi@1b2de7f...5185adf)

Updates `codecov/codecov-action` from 6.0.0 to 6.0.1
- [Release notes](https://github.com/codecov/codecov-action/releases)
- [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md)
- [Commits](codecov/codecov-action@57e3a13...e79a696)

Updates `github/issue-metrics` from 4.2.2 to 4.2.7
- [Release notes](https://github.com/github/issue-metrics/releases)
- [Commits](github-community-projects/issue-metrics@c9e9838...1e38d5e)

Updates `j178/prek-action` from 2.0.3 to 2.0.4
- [Release notes](https://github.com/j178/prek-action/releases)
- [Commits](j178/prek-action@6ad8027...bdca6f1)

Updates `actions/upload-artifact` from 7.0.0 to 7.0.1
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](actions/upload-artifact@v7...043fb46)

Updates `actions/download-artifact` from 7.0.0 to 8.0.1
- [Release notes](https://github.com/actions/download-artifact/releases)
- [Commits](actions/download-artifact@v7...3e5f45b)

Updates `pypa/gh-action-pypi-publish` from 1.13.0 to 1.14.0
- [Release notes](https://github.com/pypa/gh-action-pypi-publish/releases)
- [Commits](pypa/gh-action-pypi-publish@v1.13.0...cef2210)

Updates `zizmorcore/zizmor-action` from 0.5.3 to 0.5.6
- [Release notes](https://github.com/zizmorcore/zizmor-action/releases)
- [Commits](zizmorcore/zizmor-action@b1d7e1f...5f14fd0)

---
updated-dependencies:
- dependency-name: prefix-dev/setup-pixi
  dependency-version: 0.9.6
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: actions
- dependency-name: codecov/codecov-action
  dependency-version: 6.0.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: actions
- dependency-name: github/issue-metrics
  dependency-version: 4.2.7
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: actions
- dependency-name: j178/prek-action
  dependency-version: 2.0.4
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: actions
- dependency-name: actions/upload-artifact
  dependency-version: 7.0.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: actions
- dependency-name: actions/download-artifact
  dependency-version: 8.0.1
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
- dependency-name: pypa/gh-action-pypi-publish
  dependency-version: 1.14.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: actions
- dependency-name: zizmorcore/zizmor-action
  dependency-version: 0.5.6
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: actions
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* refactor(sharding): derive coords inside to_dict_vectorized

Address review feedback: `_ShardReader.to_dict_vectorized` took the
lexicographic coordinate array and key tuples as parameters, even though
the reader already knows its own `chunks_per_shard` and both structures
are `lru_cache`d. Thread nothing in — fetch them inside the method via
`_lexicographic_order`/`_lexicographic_order_keys`. Same cache, so no
perf change; the call site collapses to `to_dict_vectorized()`.

Add a unit test covering the method directly across 0-d, 1-d, and 2-d
shard grids: present chunks map to their stored bytes, empty chunks to
None, and every lexicographic coordinate appears as a key.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Update src/zarr/core/indexing.py

Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>

* refactor(sharding): drop redundant lexicographic_order_iter

Address review feedback from @ilan-gold and @chuckwondo on the
`lexicographic_order_iter` helper.

`lexicographic_order_iter` returned a *lazy* iterator over an
*eagerly-built, cached* tuple (`_lexicographic_order_keys`), which
chuckwondo rightly flagged as confusing — and its output is byte-for-byte
identical to the pre-existing, genuinely-lazy `c_order_iter`
(verified across 0-d, empty, and N-d shapes). So the name promised
laziness the implementation didn't provide, over a sequence we could
already produce.

Remove the wrapper and use the cached `_lexicographic_order_keys` tuple
directly at the two `dict.fromkeys` call sites and in
`_subchunk_order_iter`. This keeps the eager/cached coordinate tuples —
which is the actual optimization: `dict.fromkeys` over the cached tuple
is ~1.4x faster than over lazy `c_order_iter` at 32^3 (≈900us vs ≈1300us),
because the cache amortizes tuple construction across repeated writes to
same-shaped shards. Switching to `c_order_iter` would have reintroduced
that cost, so it is deliberately not used here.

Also drop the now-dead `tuple()` wrap in `morton_order_iter` (its
argument is typed `tuple[int, ...]` and every caller passes one), per
ilan-gold.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(indexing): prefer lexicographic_order_iter, soft-deprecate c_order_iter

`c_order_iter` names a memory layout ("C order") rather than what the
iterator actually yields. Reintroduce `lexicographic_order_iter` as the
clearer name for the same row-major coordinate sequence, and make
`c_order_iter` a thin alias that delegates to it, with a docstring note
steering new code to the preferred name. No runtime warning — these are
internal helpers.

`lexicographic_order_iter` keeps the eager/cached implementation
(iter over the lru_cached `_lexicographic_order_keys` tuple), which is
~1.4x faster than the old lazy `itertools.product` on the
`dict.fromkeys` shard-write path and is the optimization this branch
exists to deliver. The alias therefore changes `c_order_iter` from lazy
to eager/cached; all in-repo callers (_ShardReader.__iter__,
_is_total_shard, _subchunk_order_iter, and two tests) are migrated to
`lexicographic_order_iter`, so nothing in-tree relies on the old laziness.

Output is unchanged: lexicographic_order_iter, the c_order_iter alias,
and np.ndindex all agree across 0-d, empty, and N-d shapes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(indexing): make lexicographic_order_iter the lazy primitive

Per review from @mkitti: invert the relationship between the lazy
iterator and the eagerly-collected tuple. `lexicographic_order_iter` is
now a genuine lazy generator over the chunk-grid coordinates, and
`_lexicographic_order_keys` collects it into a cached tuple — the eager
version is "collect the lazy one", not the other way around.

Previously lexicographic_order_iter returned iter() over the cached
tuple, so any consumer that only needed a prefix still paid to
materialize the entire grid. _is_total_shard does exactly that — an
early-exit `all(coord in set for coord in ...)` — and on a cold cache
for a 32^3 shard whose first coordinate is absent this dropped from
~15.8ms to ~24us (the lazy generator builds one coordinate and bails).

The hot path is unchanged: the two dict.fromkeys sites consume the full
grid and use the cached `_lexicographic_order_keys` tuple directly
(~0.9ms at 32^3), so the regression fix this branch delivers is intact.
This also resolves @chuckwondo's point — the iterator is now actually
lazy rather than a thin wrapper over eager data.

Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(indexing): make morton_order_iter the lazy primitive too

Per @mkitti: the morton pair was backwards in the same way the
lexicographic pair was. Invert it to match — `morton_order_iter` is now
the lazy generator primitive and `_morton_order_keys` collects it into a
cached tuple, mirroring `lexicographic_order_iter` /
`_lexicographic_order_keys`.

No behavioral change for the in-tree consumers (all fully consume the
sequence) and the Z-order is identical; this keeps the two coordinate-
order families symmetric and gives morton the same lazy/early-exit
option lexicographic now has.

Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(indexing): expose chunk-order coordinates as cached sequences

Replace the morton/lexicographic order iterators (and the c_order_iter
alias) with two cached, numpy-backed sequences:
`morton_order_coords(shape)` and `lexicographic_order_coords(shape)`,
each returning the grid coordinates in that order as a tuple of
coordinate tuples.

This addresses several points from review:

- The earlier "lazy primitive" inversion de-optimized the hot write
  path: `morton_order_iter` rebuilt every coordinate tuple from the
  array on each call, and that path runs in `_encode_shard_dict` on
  every shard write (~16ms/write at 32^3 chunks-per-shard). The coords
  are a finite set of known length reused in full, so they are an
  indexable sequence built once and cached, not a lazily-rebuilt
  generator. (per @mkitti)

- `lexicographic_order_iter` was never genuinely lazy — `_lexicographic_order`
  materializes the whole `np.indices` grid up front — so the early-exit
  framing was inaccurate. (per @Copilot, @chuckwondo)

- Two functions differing only in caching vs laziness was redundant
  (per @ilan-gold); there is now one sequence per order. `_ShardReader.__iter__`
  wraps it in `iter()`, the only site that needs an iterator.

- `_is_total_shard` no longer iterates the order at all: `all_chunk_coords`
  is always a subset of the shard grid (guaranteed by `validate`'s
  shard/chunk divisibility check), so a count check proves totality. A
  subset assertion documents the invariant.

Coordinates are Python int tuples because every consumer uses them as
dict keys / set members, which numpy arrays cannot be (unhashable,
mutable); the numpy array is kept only for the vectorized index lookup
in `to_dict_vectorized`. The per-shape cache holds ~prod(chunks_per_shard)
tuples (~0.07% of shard size for multi-GB shards with (64,64,64) chunks),
capped at 16 shapes per order.

Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(bench): add warm-cache shard-write benchmark

The existing test_sharded_morton_write_single_chunk clears the chunk-order
cache before every iteration, so it only measures the cold grid-build cost.
That made it blind to a regression where the per-shard coordinate tuples
were rebuilt on every write instead of being reused from the cache — the
cold benchmark could not distinguish the two (both pay the build each
iteration).

Add test_sharded_morton_write_single_chunk_warm_cache, which warms the
cache once and then times repeated same-shape writes — the amortized
regime the cache exists to optimize (many shards of one shape per array).
Verified it discriminates: with the cached sequence it is ~4x faster than
the cold benchmark, and a rebuild-every-write regression shows up as a ~4x
slowdown here while staying invisible to the cold benchmark.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: update changelog for full-shard write coverage

The fix caches the per-shard coordinate grid for every shard write, not
only partial writes, and the win is amortized across repeated writes to
same-shaped shards. Reword the note accordingly; keep it user-facing
(the internal indexing helper refactor is not part of the public API).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* perf: build order-coord tuples via .tolist(); document dual representation

`morton_order_coords` / `lexicographic_order_coords` built their tuple-of-
tuples with a row-by-row `tuple(int(x) for x in row)` comprehension. Using
`map(tuple, arr.tolist())` instead does the int conversion in a single
C-level call, producing byte-identical native-int tuples ~8-9x faster
(~16ms -> ~1.9ms cold build at 32^3). It is a per-shape cached build, so
this only speeds the first write to each shard shape, but it is free.

Also document in `to_dict_vectorized` why the chunk coordinates are needed
in two forms — a numpy array for the vectorized index lookup and hashable
tuples for the dict keys — since numpy rows are unhashable and a tuple list
can't be used for the vectorized modulo/advanced-indexing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* perf,test: address code-review findings in the sharding coord cache

- Drop the O(n_chunks) assert in _is_total_shard. It built a fresh
  set(lexicographic_order_coords(...)) on every partial read/write to
  check an invariant `validate` already guarantees, regressing the very
  partial-access hot path this PR optimizes (~673us vs ~112ns at 32^3
  chunks-per-shard) and vanishing under -O. The invariant is documented
  in the comment; the count check alone proves totality.

- Cache the colexicographic subchunk order. The colex branch of
  _subchunk_order_iter rebuilt the grid via uncached np.ndindex on every
  write while its morton/lexicographic siblings hit the cache; add
  colexicographic_order_coords (cached, derived from lexicographic_order_coords
  of the reversed shape) and use it.

- Fix two benchmark docstrings: the cold benchmark now clears the
  lexicographic caches too (the write path builds that grid via
  dict.fromkeys / to_dict_vectorized, so a morton-only clear left it warm
  and under-reported the cold cost); the warm benchmark docstring now
  describes what it actually exercises (repeated writes to one shard,
  which reuse the cache identically to writes across same-shaped shards).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
Co-authored-by: Mark Kittisopikul <mkitti@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants