Fix Iceberg ARRAY columns with dot-separated names returning empty lists#1894
Fix Iceberg ARRAY columns with dot-separated names returning empty lists#1894il9ue wants to merge 2 commits into
Conversation
When querying an Iceberg table through the `iceberg(...)` table function or a DataLakeCatalog, a column whose name contains a `.` and whose type is `Array(T)` (e.g. `` `a.b` ARRAY<STRING> ``) returned empty arrays instead of the stored values. The same data read by Spark returned the expected values. Fixes ClickHouse#90731. The Parquet V3 reader path (`SchemaConverter` + `ColumnMapper` + `FormatFilterInfo`) is already correct after the dotted-name field-id work in 0a218cd, 4b733ba and f24c1a4. This change addresses two residual upstream defects that affect dotted-name `Array(T)` columns regardless of source: * `ColumnsDescription::getAllRegisteredNames` explicitly filtered out any column whose name contained `.`, under the assumption such names were always flattened Nested subcolumns. A column whose stored name literally contains a dot (allowed by MergeTree with backticks, and produced by Iceberg / Spark) is a first-class registered name and must appear in `IHints` misspelling suggestions. The function is only consumed by `IHints`-style suggestion paths (and by `StorageSystemZooKeeper` for column-name iteration, where no dotted names exist), so relaxing it has no effect on parsing, planning, storage, or wire protocol. * `NestedUtils::getSubcolumnsOfNested` treated every `Array(T)` column whose name contained `.` as a flattened element of a synthetic `Nested` structure named after the prefix. This caused the Arrow, ORC and pre-V3 Parquet readers to look for a struct field with the prefix name in the data file rather than the literal dotted column, returning an empty array. The fix uses a two-pass scan: a synthetic `Nested` entry is only emitted when at least two `Array(T)` columns share the same dotted prefix. A lone column such as `a.b: Array(T)` no longer appears in the synthetic-Nested map. Genuine flattened `Nested` with multiple fields is unaffected; the existing early-continue on `isNested()` also covers the one-field-Nested edge case. Tests: * `tests/integration/test_storage_iceberg_with_spark/test_column_names_with_dots.py::test_dotted_array_column` — end-to-end repro of ClickHouse#90731 against s3, azure and local storage. * `test_dotted_array_alongside_real_nested` in the same file — mixed- schema regression guard verifying a lone dotted `Array` column coexists with genuine flattened-Nested siblings. * `tests/queries/0_stateless/04259_dotted_array_not_nested.sql` — isolates Bug B without Iceberg. * `tests/queries/0_stateless/04260_dotted_column_in_hints.sh` — verifies Bug A by checking the misspelling hint output. Changelog category (leave one): - Bug Fix (user-visible misbehavior in an official stable release) Changelog entry: Fix reading Iceberg tables whose `ARRAY` column names contain a dot (e.g. `` `a.b` ARRAY<STRING> ``), which previously returned empty arrays. Two upstream defects were responsible: `ColumnsDescription::getAllRegisteredNames` filtered out dotted names, and `NestedUtils::getSubcolumnsOfNested` misclassified lone dotted `Array(T)` columns as flattened `Nested` children. (cherry picked from commit f8467af)
CI triage on ef84f17— the failures are unrelated to this change: Fast test (the only blocker in the PR workflow) reported The Community PR workflow's 25 "errors" are also unrelated — they're GitHub Actions This looks like a flaky |
|
In upstream Fast tests are failed too, with Looks like reason is a changes in And can't find |
There was a problem hiding this comment.
Two upstream defects were responsible: ColumnsDescription::getAllRegisteredNames filtered out dotted names, and NestedUtils::getSubcolumnsOfNested misclassified lone dotted Array(T) columns as flattened Nested children.
This part does not belong to changelog (both here and in your upstream PR). If you want, you can put it somewhere in description (however, I see, you already have more).
Changelog entry is a short user-facing line, one of tens of lines in release notes, it does not need all that implementatuon details.
mkmkme
left a comment
There was a problem hiding this comment.
The fast test is failing due to nullptr dereference. Please fix it

Backport of upstream ClickHouse/ClickHouse#105546 (commit
f8467af) ontoantalya-26.3. Re-opened fromAltinity/ClickHouse(instead of a fork) so CI publishes direct.debpackage URLs for clickhouse-regression. Cherry-pick applied cleanly with no contextual conflicts.Upstream issue: ClickHouse/ClickHouse#90731
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fix reading Iceberg tables whose
ARRAYcolumn names contain a dot (e.g.`a.b` ARRAY<STRING>), which previously returned empty arrays. Two upstream defects were responsible:ColumnsDescription::getAllRegisteredNamesfiltered out dotted names, andNestedUtils::getSubcolumnsOfNestedmisclassified lone dottedArray(T)columns as flattenedNestedchildren.Documentation entry for user-facing changes
Symptom
When querying an Iceberg table through the
iceberg(...)table function or aDataLakeCatalog, a column whose name contains a.and whose type isArray(T)returned empty arrays instead of the stored values. The same data read by Spark returned the expected values.Root cause
The Parquet V3 reader path (
SchemaConverter+ColumnMapper+FormatFilterInfo) is already correct after the dotted-name field-id work in0a218cd4e8b,4b733bae561andf24c1a46063(present inantalya-26.3). The remaining symptom comes from two upstream defects, independent of Iceberg but exposed by it:ColumnsDescription::getAllRegisteredNamesexplicitly filtered out any column whose name contained., assuming such names were always flattened Nested subcolumns. A column whose stored name literally contains a dot (allowed by MergeTree with backticks, produced by Iceberg/Spark) is a first-class registered name. The function is only consumed byIHints-style suggestion paths (andStorageSystemZooKeepercolumn-name iteration, where no dotted names exist), so relaxing it has no effect on parsing, planning, storage, or wire protocol.NestedUtils::getSubcolumnsOfNestedtreated every dotted-nameArray(T)column as a flattened element of a syntheticNestedstructure named after the prefix. This made the Arrow, ORC and pre-V3 Parquet readers look for a struct field with the prefix name rather than the literal dotted column, returning an empty array.Fix
ColumnsDescription::getAllRegisteredNames— drop the dot filter; return every registered column name.NestedUtils::getSubcolumnsOfNested— two-pass scan: a syntheticNestedentry is emitted only when at least twoArray(T)columns share the same dotted prefix. A lonea.b: Array(T)no longer appears in the synthetic-Nested map. Genuine flattenedNestedwith multiple fields is unaffected; the existing early-continue onisNested()covers the one-field edge case.Tests
tests/integration/test_storage_iceberg_with_spark/test_column_names_with_dots.py::test_dotted_array_column— end-to-end repro of Iceberg: Selecting from ARRAY column with dot-separated name returns empty lists ClickHouse/ClickHouse#90731 againsts3,azureandlocalstorage.…::test_dotted_array_alongside_real_nested— mixed-schema guard: a lone dottedArraycolumn coexists with a genuine flattened-Nested sibling group.tests/queries/0_stateless/04259_dotted_array_not_nested.sql— isolates theNestedUtilsfix without Iceberg (Memory engine).tests/queries/0_stateless/04260_dotted_column_in_hints.sh— verifies theColumnsDescriptionfix via misspelling-hint output.Risk / scope
Low. Five-line removal in
ColumnsDescription(hint suggestions only) and a localised two-pass refactor inNestedUtils::getSubcolumnsOfNested. No header changes, no new settings, no public API change. Parquet V3 path is not modified.CI/CD Options
Exclude tests:
Regression jobs to run: