Skip to content

feat(sqllogictest): SLT-opt-in memory overdraft factor + actionable error#22776

Open
avantgardnerio wants to merge 5 commits into
apache:mainfrom
coralogix:brent/headroom-developer-ergonomics
Open

feat(sqllogictest): SLT-opt-in memory overdraft factor + actionable error#22776
avantgardnerio wants to merge 5 commits into
apache:mainfrom
coralogix:brent/headroom-developer-ergonomics

Conversation

@avantgardnerio
Copy link
Copy Markdown
Contributor

@avantgardnerio avantgardnerio commented Jun 5, 2026

Which issue does this PR close?

Rationale for this change

@alamb's concern in #22723 was that a contributor working on something unrelated could hit OverdraftPanic and have no idea what to do — the message rendered as allocator overdraft: account balance at panic = -1384887 bytes and the underlying allocation was lost across spawn boundaries. The ask was: (1) self-explanatory error message with a remediation path, (2) per-test opt-in instead of a global headroom knob, (3) a written long-term plan.

What changes are included in this PR?

  1. Per-test opt-in for overdraft factor. New SLT-only knob SET datafusion.sqllogictest.memory_overdraft_factor = N lifts the bank multiplier for the next SET datafusion.runtime.memory_limit, then auto-resets. The DataFusion parser doesn't know about this variable — the SLT runner intercepts it. No effect on datafusion-cli or anything else.

  2. Chained panic hook (install_overdraft_panic_hook). OverdraftPanic frequently fires inside spawned tokio tasks (e.g. RepartitionExec workers); tokio's task harness catches the unwind on the worker and the joiner converts it into DataFusionError::External("task NN panicked") — original payload lost. The hook runs on the worker before the harness eats it and writes the actionable message + forced backtrace to stderr. The runner's catch_unwind arm now reduces to a one-line pointer at the stderr.

  3. DEFAULT_BUDGET defaults to isize::MAX / 2 (effectively infinite). An un-armed bank is now a no-op; real enforcement only kicks in once --default-pool-size-mb or SET datafusion.runtime.memory_limit arms the bank. Previously, file-setup allocations settled into a 0-budget bank and tripped before any real work ran, with a misleading backtrace pointing at RuntimeEnv::default.

  4. Self-explanatory error text. The stderr message now reads:

    Memory accounting mismatch: this thread allocated N bytes more than DataFusion's MemoryPool accounted for. Some operator is allocating outside the pool's tracking — this is a real accounting bug worth fixing.

    If you made changes to an operator or UDF this is probably related to your work and should be investigated. If you do not believe this is related to your change, the fastest path forward is to opt the failing SLT into a larger overdraft tolerance and file the gap against the epic so we can pay it down:

    SET datafusion.sqllogictest.memory_overdraft_factor = N;

    where N is roughly 2 * <bytes the query actually needs> / datafusion.runtime.memory_limit. The override applies to the next SET datafusion.runtime.memory_limit only, then auto-resets.

    Please record the query + observed overshoot at:
    [EPIC] A collection of Memory Accounting Limitations and Improvements #22758

    The untracked allocation is in this backtrace: …

  5. README documents the workflow — when the message appears, how to opt out, why the default factor is loose (8.0), and the long-term plan to drive it toward 1.1 via [EPIC] A collection of Memory Accounting Limitations and Improvements #22758.

Are these changes tested?

  • Unit tests cover set_memory_overdraft_factor overriding then resetting, and the SET-statement interceptor parsing bare / quoted / signed / bogus values.
  • End-to-end verification: tightening nested_loop_join_spill.slt to SET datafusion.sqllogictest.memory_overdraft_factor = 1.0 (with --default-pool-size-mb 16384) trips the hook and produces a backtrace pointing at nested_loop_join.rs:2156arrow_arith::add_wrappingVec::collect — a real untracked allocation worth filing as a sub-issue.
  • Test suite continues to pass with or without --default-pool-size-mb (default-pool-mb-absent is now a no-op rather than tripping spuriously during RuntimeEnv::default).

Are there any user-facing changes?

Only SLT-runner-facing:

  • New SLT variable datafusion.sqllogictest.memory_overdraft_factor.
  • New panic hook installed when running with --features memory-accounting.
  • README has a new sub-section on the developer workflow when the check fires.

Three small fixes to the developer ergonomics of the memory-accounting
SLT mode, addressing alamb's feedback in apache#22723:

1. Install a chained panic hook (`install_overdraft_panic_hook`) that
   formats `OverdraftPanic` payloads as the actionable remediation text +
   forced backtrace on stderr. Without it, OverdraftPanic firing inside
   a tokio worker (e.g. RepartitionExec) became "External error: task NN
   panicked" with the payload lost and the operator backtrace gone.

2. Slim the runner's `catch_unwind` arm — the long remediation message
   now lives in the hook, so the SLT failure surface is a one-line
   pointer to stderr instead of a duplicate.

3. Default `DEFAULT_BUDGET` to `isize::MAX / 2`. An un-armed bank is
   now a no-op; real enforcement only kicks in after either
   `--default-pool-size-mb` or `SET datafusion.runtime.memory_limit`
   arms the bank. Otherwise file-setup allocations settled into a
   0-budget bank and tripped before any real work ran, with a
   misleading backtrace pointing at `RuntimeEnv::default`.

Verified end-to-end by tightening `nested_loop_join_spill.slt` to
`SET datafusion.sqllogictest.memory_overdraft_factor = 1.0` and
confirming the panic fires with a backtrace pointing at
`nested_loop_join.rs:2156` → `arrow_arith::add_wrapping` →
`Vec::collect` — i.e. a real untracked allocation, ready to file as a
sub-issue of apache#22758.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant