Skip to content

Fix Iceberg ARRAY columns with dot-separated names returning empty lists#1894

Open
il9ue wants to merge 2 commits into
antalya-26.3from
fix/iceberg-dotted-array-90731-antalya-26.3
Open

Fix Iceberg ARRAY columns with dot-separated names returning empty lists#1894
il9ue wants to merge 2 commits into
antalya-26.3from
fix/iceberg-dotted-array-90731-antalya-26.3

Conversation

@il9ue

@il9ue il9ue commented Jun 8, 2026

Copy link
Copy Markdown

Backport of upstream ClickHouse/ClickHouse#105546 (commit f8467af) onto antalya-26.3. Re-opened from Altinity/ClickHouse (instead of a fork) so CI publishes direct .deb package URLs for clickhouse-regression. Cherry-pick applied cleanly with no contextual conflicts.

Upstream issue: ClickHouse/ClickHouse#90731

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix reading Iceberg tables whose ARRAY column names contain a dot (e.g. `a.b` ARRAY<STRING>), which previously returned empty arrays. Two upstream defects were responsible: ColumnsDescription::getAllRegisteredNames filtered out dotted names, and NestedUtils::getSubcolumnsOfNested misclassified lone dotted Array(T) columns as flattened Nested children.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Symptom

When querying an Iceberg table through the iceberg(...) table function or a DataLakeCatalog, a column whose name contains a . and whose type is Array(T) returned empty arrays instead of the stored values. The same data read by Spark returned the expected values.

-- Spark
CREATE TABLE table7 (`a.b` ARRAY<STRING>);
INSERT INTO table7 VALUES (ARRAY('a','b','c'));

-- ClickHouse (before fix)
SELECT `a.b` FROM iceberg('...');
-- got:      [ ]
-- expected: ['a','b','c']

Root cause

The Parquet V3 reader path (SchemaConverter + ColumnMapper + FormatFilterInfo) is already correct after the dotted-name field-id work in 0a218cd4e8b, 4b733bae561 and f24c1a46063 (present in antalya-26.3). The remaining symptom comes from two upstream defects, independent of Iceberg but exposed by it:

  1. ColumnsDescription::getAllRegisteredNames explicitly filtered out any column whose name contained ., assuming such names were always flattened Nested subcolumns. A column whose stored name literally contains a dot (allowed by MergeTree with backticks, produced by Iceberg/Spark) is a first-class registered name. The function is only consumed by IHints-style suggestion paths (and StorageSystemZooKeeper column-name iteration, where no dotted names exist), so relaxing it has no effect on parsing, planning, storage, or wire protocol.
  2. NestedUtils::getSubcolumnsOfNested treated every dotted-name Array(T) column as a flattened element of a synthetic Nested structure named after the prefix. This made the Arrow, ORC and pre-V3 Parquet readers look for a struct field with the prefix name rather than the literal dotted column, returning an empty array.

Fix

  • ColumnsDescription::getAllRegisteredNames — drop the dot filter; return every registered column name.
  • NestedUtils::getSubcolumnsOfNested — two-pass scan: a synthetic Nested entry is emitted only when at least two Array(T) columns share the same dotted prefix. A lone a.b: Array(T) no longer appears in the synthetic-Nested map. Genuine flattened Nested with multiple fields is unaffected; the existing early-continue on isNested() covers the one-field edge case.

Tests

  • tests/integration/test_storage_iceberg_with_spark/test_column_names_with_dots.py::test_dotted_array_column — end-to-end repro of Iceberg: Selecting from ARRAY column with dot-separated name returns empty lists ClickHouse/ClickHouse#90731 against s3, azure and local storage.
  • …::test_dotted_array_alongside_real_nested — mixed-schema guard: a lone dotted Array column coexists with a genuine flattened-Nested sibling group.
  • tests/queries/0_stateless/04259_dotted_array_not_nested.sql — isolates the NestedUtils fix without Iceberg (Memory engine).
  • tests/queries/0_stateless/04260_dotted_column_in_hints.sh — verifies the ColumnsDescription fix via misspelling-hint output.

Risk / scope

Low. Five-line removal in ColumnsDescription (hint suggestions only) and a localised two-pass refactor in NestedUtils::getSubcolumnsOfNested. No header changes, no new settings, no public API change. Parquet V3 path is not modified.


CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • S3 Export (2h)
  • Swarms (30m)
  • Tiered Storage (2h)

When querying an Iceberg table through the `iceberg(...)` table function
or a DataLakeCatalog, a column whose name contains a `.` and whose type
is `Array(T)` (e.g. `` `a.b` ARRAY<STRING> ``) returned empty arrays
instead of the stored values. The same data read by Spark returned the
expected values. Fixes ClickHouse#90731.

The Parquet V3 reader path (`SchemaConverter` + `ColumnMapper` +
`FormatFilterInfo`) is already correct after the dotted-name field-id
work in 0a218cd, 4b733ba and f24c1a4. This change addresses
two residual upstream defects that affect dotted-name `Array(T)`
columns regardless of source:

* `ColumnsDescription::getAllRegisteredNames` explicitly filtered out
  any column whose name contained `.`, under the assumption such names
  were always flattened Nested subcolumns. A column whose stored name
  literally contains a dot (allowed by MergeTree with backticks, and
  produced by Iceberg / Spark) is a first-class registered name and
  must appear in `IHints` misspelling suggestions. The function is only
  consumed by `IHints`-style suggestion paths (and by
  `StorageSystemZooKeeper` for column-name iteration, where no dotted
  names exist), so relaxing it has no effect on parsing, planning,
  storage, or wire protocol.

* `NestedUtils::getSubcolumnsOfNested` treated every `Array(T)` column
  whose name contained `.` as a flattened element of a synthetic
  `Nested` structure named after the prefix. This caused the Arrow,
  ORC and pre-V3 Parquet readers to look for a struct field with the
  prefix name in the data file rather than the literal dotted column,
  returning an empty array. The fix uses a two-pass scan: a synthetic
  `Nested` entry is only emitted when at least two `Array(T)` columns
  share the same dotted prefix. A lone column such as `a.b: Array(T)`
  no longer appears in the synthetic-Nested map. Genuine flattened
  `Nested` with multiple fields is unaffected; the existing
  early-continue on `isNested()` also covers the one-field-Nested
  edge case.

Tests:
* `tests/integration/test_storage_iceberg_with_spark/test_column_names_with_dots.py::test_dotted_array_column` —
  end-to-end repro of ClickHouse#90731 against s3, azure and local storage.
* `test_dotted_array_alongside_real_nested` in the same file — mixed-
  schema regression guard verifying a lone dotted `Array` column
  coexists with genuine flattened-Nested siblings.
* `tests/queries/0_stateless/04259_dotted_array_not_nested.sql` —
  isolates Bug B without Iceberg.
* `tests/queries/0_stateless/04260_dotted_column_in_hints.sh` —
  verifies Bug A by checking the misspelling hint output.

Changelog category (leave one):
- Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry:
Fix reading Iceberg tables whose `ARRAY` column names contain a dot
(e.g. `` `a.b` ARRAY<STRING> ``), which previously returned empty
arrays. Two upstream defects were responsible:
`ColumnsDescription::getAllRegisteredNames` filtered out dotted names,
and `NestedUtils::getSubcolumnsOfNested` misclassified lone dotted
`Array(T)` columns as flattened `Nested` children.

(cherry picked from commit f8467af)
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Workflow [PR], commit [ef84f17]

@il9ue

il9ue commented Jun 10, 2026

Copy link
Copy Markdown
Author

CI triage on ef84f17

— the failures are unrelated to this change:

Fast test (the only blocker in the PR workflow) reported Failed: 0, Passed: 2112, Skipped: 479, Broken: 0. It exited non-zero only on the harness-level checks Server died and clickhouse-test — i.e. the server process crashed during the run, not any functional test. No test failed, including the new 04259_dotted_array_not_nested and 04260_dotted_column_in_hints added here.

The Community PR workflow's 25 "errors" are also unrelated — they're GitHub Actions if-evaluation failures (fromJson: empty input from an empty Config output), not test results.

This looks like a flaky Server died. Could a maintainer re-run the PR workflow? Happy to dig in if it reproduces.

@ianton-ru

ianton-ru commented Jun 10, 2026

Copy link
Copy Markdown

In upstream Fast tests are failed too, with

[2026-06-03 06:02:09]       | [ip-172-31-42-164] 2026.06.03 07:59:48.401706 [ 254507 ] {e7385def-2016-4f26-8090-78dfb25fcee1} <Error> executeQuery: Code: 341. 
                      DB::Exception: Exception happened during execution of mutation 'mutation_2.txt' with part 'all_1_1_0' reason: 'Code: 190. DB::Exception: 
                      Elements 'nested.arr1' and 'nested.arr2' of Nested data structure (Array columns) have different array sizes (2 and 0 respectively) on row
                       0: while executing 'FUNCTION validateNestedArraySizes(1_UInt8 : 2, nested.arr1 : 0, nested.arr2 : 1) -> validateNestedArraySizes(1_UInt8,
                       nested.arr1, nested.arr2) UInt8 : 4'. (SIZES_OF_ARRAYS_DONT_MATCH) (version 26.6.1.1)'. This error maybe retryable or not. In case of 
                      unretryable error, mutation can be killed with KILL MUTATION query 
...
[2026-06-03 06:02:09]       | . (UNFINISHED) (version 26.6.1.1) (from [::ffff:127.0.0.1]:60466) (comment: 02565_update_empty_nested.sql-test_xxain04q) (query 6,
                       line 17) (in query: ALTER TABLE t_update_empty_nested UPDATE `nested.arr2` = `nested.arr1` WHERE 1;), Stack trace (when copying this 
                      message, always include the lines below):

Looks like reason is a changes in getSubcolumnsOfNested.

And can't find 04259_dotted_array_not_nested
https://altinity-build-artifacts.s3.amazonaws.com/json.html?PR=1894&sha=ef84f172dccb85f5e294eb2df384e8d59711b42b&name_0=PR&name_1=Fast+test&name_2=Tests, sorted by name:
image

@zvonand zvonand left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two upstream defects were responsible: ColumnsDescription::getAllRegisteredNames filtered out dotted names, and NestedUtils::getSubcolumnsOfNested misclassified lone dotted Array(T) columns as flattened Nested children.

This part does not belong to changelog (both here and in your upstream PR). If you want, you can put it somewhere in description (however, I see, you already have more).
Changelog entry is a short user-facing line, one of tens of lines in release notes, it does not need all that implementatuon details.

@mkmkme mkmkme left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fast test is failing due to nullptr dereference. Please fix it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants