Fix Iceberg ARRAY columns with dot-separated names returning empty lists by il9ue · Pull Request #1894 · Altinity/ClickHouse

il9ue · 2026-06-08T13:58:20Z

Backport of upstream ClickHouse/ClickHouse#105546 (commit f8467af) onto antalya-26.3. Re-opened from Altinity/ClickHouse (instead of a fork) so CI publishes direct .deb package URLs for clickhouse-regression. Cherry-pick applied cleanly with no contextual conflicts.

Upstream issue: ClickHouse/ClickHouse#90731

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix reading Iceberg tables whose ARRAY column names contain a dot (e.g. `a.b` ARRAY<STRING>), which previously returned empty arrays. Two upstream defects were responsible: ColumnsDescription::getAllRegisteredNames filtered out dotted names, and NestedUtils::getSubcolumnsOfNested misclassified lone dotted Array(T) columns as flattened Nested children.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Symptom

When querying an Iceberg table through the iceberg(...) table function or a DataLakeCatalog, a column whose name contains a . and whose type is Array(T) returned empty arrays instead of the stored values. The same data read by Spark returned the expected values.

-- Spark
CREATE TABLE table7 (`a.b` ARRAY<STRING>);
INSERT INTO table7 VALUES (ARRAY('a','b','c'));

-- ClickHouse (before fix)
SELECT `a.b` FROM iceberg('...');
-- got:      [ ]
-- expected: ['a','b','c']

Root cause

The Parquet V3 reader path (SchemaConverter + ColumnMapper + FormatFilterInfo) is already correct after the dotted-name field-id work in 0a218cd4e8b, 4b733bae561 and f24c1a46063 (present in antalya-26.3). The remaining symptom comes from two upstream defects, independent of Iceberg but exposed by it:

ColumnsDescription::getAllRegisteredNames explicitly filtered out any column whose name contained ., assuming such names were always flattened Nested subcolumns. A column whose stored name literally contains a dot (allowed by MergeTree with backticks, produced by Iceberg/Spark) is a first-class registered name. The function is only consumed by IHints-style suggestion paths (and StorageSystemZooKeeper column-name iteration, where no dotted names exist), so relaxing it has no effect on parsing, planning, storage, or wire protocol.
NestedUtils::getSubcolumnsOfNested treated every dotted-name Array(T) column as a flattened element of a synthetic Nested structure named after the prefix. This made the Arrow, ORC and pre-V3 Parquet readers look for a struct field with the prefix name rather than the literal dotted column, returning an empty array.

Fix

ColumnsDescription::getAllRegisteredNames — drop the dot filter; return every registered column name.
NestedUtils::getSubcolumnsOfNested — two-pass scan: a synthetic Nested entry is emitted only when at least two Array(T) columns share the same dotted prefix. A lone a.b: Array(T) no longer appears in the synthetic-Nested map. Genuine flattened Nested with multiple fields is unaffected; the existing early-continue on isNested() covers the one-field edge case.

Tests

tests/integration/test_storage_iceberg_with_spark/test_column_names_with_dots.py::test_dotted_array_column — end-to-end repro of Iceberg: Selecting from ARRAY column with dot-separated name returns empty lists ClickHouse/ClickHouse#90731 against s3, azure and local storage.
…::test_dotted_array_alongside_real_nested — mixed-schema guard: a lone dotted Array column coexists with a genuine flattened-Nested sibling group.
tests/queries/0_stateless/04259_dotted_array_not_nested.sql — isolates the NestedUtils fix without Iceberg (Memory engine).
tests/queries/0_stateless/04260_dotted_column_in_hints.sh — verifies the ColumnsDescription fix via misspelling-hint output.

Risk / scope

Low. Five-line removal in ColumnsDescription (hint suggestions only) and a localised two-pass refactor in NestedUtils::getSubcolumnsOfNested. No header changes, no new settings, no public API change. Parquet V3 path is not modified.

CI/CD Options

Exclude tests:

Regression jobs to run:

When querying an Iceberg table through the `iceberg(...)` table function or a DataLakeCatalog, a column whose name contains a `.` and whose type is `Array(T)` (e.g. `` `a.b` ARRAY<STRING> ``) returned empty arrays instead of the stored values. The same data read by Spark returned the expected values. Fixes ClickHouse#90731. The Parquet V3 reader path (`SchemaConverter` + `ColumnMapper` + `FormatFilterInfo`) is already correct after the dotted-name field-id work in 0a218cd, 4b733ba and f24c1a4. This change addresses two residual upstream defects that affect dotted-name `Array(T)` columns regardless of source: * `ColumnsDescription::getAllRegisteredNames` explicitly filtered out any column whose name contained `.`, under the assumption such names were always flattened Nested subcolumns. A column whose stored name literally contains a dot (allowed by MergeTree with backticks, and produced by Iceberg / Spark) is a first-class registered name and must appear in `IHints` misspelling suggestions. The function is only consumed by `IHints`-style suggestion paths (and by `StorageSystemZooKeeper` for column-name iteration, where no dotted names exist), so relaxing it has no effect on parsing, planning, storage, or wire protocol. * `NestedUtils::getSubcolumnsOfNested` treated every `Array(T)` column whose name contained `.` as a flattened element of a synthetic `Nested` structure named after the prefix. This caused the Arrow, ORC and pre-V3 Parquet readers to look for a struct field with the prefix name in the data file rather than the literal dotted column, returning an empty array. The fix uses a two-pass scan: a synthetic `Nested` entry is only emitted when at least two `Array(T)` columns share the same dotted prefix. A lone column such as `a.b: Array(T)` no longer appears in the synthetic-Nested map. Genuine flattened `Nested` with multiple fields is unaffected; the existing early-continue on `isNested()` also covers the one-field-Nested edge case. Tests: * `tests/integration/test_storage_iceberg_with_spark/test_column_names_with_dots.py::test_dotted_array_column` — end-to-end repro of ClickHouse#90731 against s3, azure and local storage. * `test_dotted_array_alongside_real_nested` in the same file — mixed- schema regression guard verifying a lone dotted `Array` column coexists with genuine flattened-Nested siblings. * `tests/queries/0_stateless/04259_dotted_array_not_nested.sql` — isolates Bug B without Iceberg. * `tests/queries/0_stateless/04260_dotted_column_in_hints.sh` — verifies Bug A by checking the misspelling hint output. Changelog category (leave one): - Bug Fix (user-visible misbehavior in an official stable release) Changelog entry: Fix reading Iceberg tables whose `ARRAY` column names contain a dot (e.g. `` `a.b` ARRAY<STRING> ``), which previously returned empty arrays. Two upstream defects were responsible: `ColumnsDescription::getAllRegisteredNames` filtered out dotted names, and `NestedUtils::getSubcolumnsOfNested` misclassified lone dotted `Array(T)` columns as flattened `Nested` children. (cherry picked from commit f8467af)

github-actions · 2026-06-08T13:59:28Z

Workflow [PR], commit [ef84f17]

…ya-26.3

il9ue · 2026-06-10T01:55:56Z

CI triage on `ef84f17`

— the failures are unrelated to this change:

Fast test (the only blocker in the PR workflow) reported Failed: 0, Passed: 2112, Skipped: 479, Broken: 0. It exited non-zero only on the harness-level checks Server died and clickhouse-test — i.e. the server process crashed during the run, not any functional test. No test failed, including the new 04259_dotted_array_not_nested and 04260_dotted_column_in_hints added here.

The Community PR workflow's 25 "errors" are also unrelated — they're GitHub Actions if-evaluation failures (fromJson: empty input from an empty Config output), not test results.

This looks like a flaky Server died. Could a maintainer re-run the PR workflow? Happy to dig in if it reproduces.

ianton-ru · 2026-06-10T07:50:32Z

In upstream Fast tests are failed too, with

[2026-06-03 06:02:09]       | [ip-172-31-42-164] 2026.06.03 07:59:48.401706 [ 254507 ] {e7385def-2016-4f26-8090-78dfb25fcee1} <Error> executeQuery: Code: 341. 
                      DB::Exception: Exception happened during execution of mutation 'mutation_2.txt' with part 'all_1_1_0' reason: 'Code: 190. DB::Exception: 
                      Elements 'nested.arr1' and 'nested.arr2' of Nested data structure (Array columns) have different array sizes (2 and 0 respectively) on row
                       0: while executing 'FUNCTION validateNestedArraySizes(1_UInt8 : 2, nested.arr1 : 0, nested.arr2 : 1) -> validateNestedArraySizes(1_UInt8,
                       nested.arr1, nested.arr2) UInt8 : 4'. (SIZES_OF_ARRAYS_DONT_MATCH) (version 26.6.1.1)'. This error maybe retryable or not. In case of 
                      unretryable error, mutation can be killed with KILL MUTATION query 
...
[2026-06-03 06:02:09]       | . (UNFINISHED) (version 26.6.1.1) (from [::ffff:127.0.0.1]:60466) (comment: 02565_update_empty_nested.sql-test_xxain04q) (query 6,
                       line 17) (in query: ALTER TABLE t_update_empty_nested UPDATE `nested.arr2` = `nested.arr1` WHERE 1;), Stack trace (when copying this 
                      message, always include the lines below):

Looks like reason is a changes in getSubcolumnsOfNested.

And can't find 04259_dotted_array_not_nested
https://altinity-build-artifacts.s3.amazonaws.com/json.html?PR=1894&sha=ef84f172dccb85f5e294eb2df384e8d59711b42b&name_0=PR&name_1=Fast+test&name_2=Tests, sorted by name:

zvonand

Two upstream defects were responsible: ColumnsDescription::getAllRegisteredNames filtered out dotted names, and NestedUtils::getSubcolumnsOfNested misclassified lone dotted Array(T) columns as flattened Nested children.

This part does not belong to changelog (both here and in your upstream PR). If you want, you can put it somewhere in description (however, I see, you already have more).
Changelog entry is a short user-facing line, one of tens of lines in release notes, it does not need all that implementatuon details.

mkmkme

The fast test is failing due to nullptr dereference. Please fix it

il9ue mentioned this pull request Jun 8, 2026

Fix Iceberg ARRAY columns with dot-separated names returning empty lists #1826

Closed

28 tasks

Merge branch 'antalya-26.3' into fix/iceberg-dotted-array-90731-antal…

ef84f17

…ya-26.3

zvonand requested changes Jun 10, 2026

View reviewed changes

mkmkme requested changes Jun 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Iceberg ARRAY columns with dot-separated names returning empty lists#1894

Fix Iceberg ARRAY columns with dot-separated names returning empty lists#1894
il9ue wants to merge 2 commits into
antalya-26.3from
fix/iceberg-dotted-array-90731-antalya-26.3

il9ue commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

il9ue commented Jun 10, 2026

Uh oh!

ianton-ru commented Jun 10, 2026 •

edited

Loading

Uh oh!

zvonand left a comment •

edited

Loading

Uh oh!

mkmkme left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

il9ue commented Jun 8, 2026

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

Symptom

Root cause

Fix

Tests

Risk / scope

CI/CD Options

Exclude tests:

Regression jobs to run:

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

il9ue commented Jun 10, 2026

CI triage on ef84f17

Uh oh!

ianton-ru commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zvonand left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkmkme left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

github-actions Bot commented Jun 8, 2026 •

edited

Loading

CI triage on `ef84f17`

ianton-ru commented Jun 10, 2026 •

edited

Loading

zvonand left a comment •

edited

Loading