Add Compaction encoding for normalizing arrays to compact form#8265
Draft
joseph-isaacs wants to merge 2 commits into
Draft
Add Compaction encoding for normalizing arrays to compact form#8265joseph-isaacs wants to merge 2 commits into
joseph-isaacs wants to merge 2 commits into
Conversation
Add a transient `Compaction` wrapper encoding that normalizes its child into a compact canonical form when executed: - ListView arrays are rebuilt to be zero-copy convertible to a ListArray (overlaps deduplicated, leading/trailing garbage trimmed). - VarBinView buffers are garbage collected. - Dict arrays are either decoded to flat canonical or garbage collected in place (dead values dropped, codes remapped), whichever is cheaper. This is driven by a `Dict` `CompactKernel` registered as an `execute_parent` of `compaction(dict(..))`. - Struct fields are recursively compacted. This extends the shallow `Canonical::compact()` idea into a recursive, encoding-aware operation exposed via `ArrayRef::compact`. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
`compact_canonical` is private to the module, so the rustdoc link could not resolve under -D warnings. Describe the behavior in prose instead. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Merging this PR will not alter performance
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | chunked_varbinview_canonical_into[(1000, 10)] |
162 µs | 198.2 µs | -18.26% |
| ❌ | Simulation | compare[15] |
120.4 µs | 146.5 µs | -17.8% |
| ❌ | Simulation | compare[14] |
118 µs | 142.2 µs | -17% |
| ❌ | Simulation | compare[13] |
116.1 µs | 138.5 µs | -16.14% |
| ⚡ | Simulation | varbinview_zip_block_mask |
3.7 ms | 2.9 ms | +27.51% |
| ⚡ | Simulation | bitwise_not_vortex_buffer_mut[128] |
275.3 ns | 216.9 ns | +26.89% |
| ⚡ | Simulation | bitwise_not_vortex_buffer_mut[1024] |
336.9 ns | 278.6 ns | +20.94% |
| ⚡ | Simulation | bitwise_not_vortex_buffer_mut[2048] |
400.6 ns | 342.2 ns | +17.05% |
| ⚡ | Simulation | varbinview_zip_fragmented_mask |
6.9 ms | 6.1 ms | +13.07% |
| ⚡ | Simulation | chunked_varbinview_canonical_into[(100, 100)] |
309.6 µs | 274.3 µs | +12.88% |
| ⚡ | Simulation | compare[5] |
77.5 µs | 70.1 µs | +10.5% |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/quirky-carson-1NaxF (1b3a817) with develop (d97d2bd)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is more of an idea for now
Summary
This PR introduces the
Compactionencoding, a transient array wrapper that normalizes its child into a compact canonical form when executed. This is similar to the existingSliceencoding in that it's non-serializable and exists primarily to drive execution.The
Compactionencoding handles structural normalization of various array types:ListViewarrays are rebuilt to be zero-copy convertible to Arrow-styleListArray(overlapping views deduplicated, leading/trailing garbage trimmed)VarBinViewarrays have their data buffers garbage collectedDictarrays are either decoded to flat canonical form or garbage collected in place (dead values removed, codes remapped) based on a cost heuristicStructfields are recursively compacted