feat(sqllogictest): SLT-opt-in memory overdraft factor + actionable error#22776
Open
avantgardnerio wants to merge 5 commits into
Open
feat(sqllogictest): SLT-opt-in memory overdraft factor + actionable error#22776avantgardnerio wants to merge 5 commits into
avantgardnerio wants to merge 5 commits into
Conversation
Three small fixes to the developer ergonomics of the memory-accounting SLT mode, addressing alamb's feedback in apache#22723: 1. Install a chained panic hook (`install_overdraft_panic_hook`) that formats `OverdraftPanic` payloads as the actionable remediation text + forced backtrace on stderr. Without it, OverdraftPanic firing inside a tokio worker (e.g. RepartitionExec) became "External error: task NN panicked" with the payload lost and the operator backtrace gone. 2. Slim the runner's `catch_unwind` arm — the long remediation message now lives in the hook, so the SLT failure surface is a one-line pointer to stderr instead of a duplicate. 3. Default `DEFAULT_BUDGET` to `isize::MAX / 2`. An un-armed bank is now a no-op; real enforcement only kicks in after either `--default-pool-size-mb` or `SET datafusion.runtime.memory_limit` arms the bank. Otherwise file-setup allocations settled into a 0-budget bank and tripped before any real work ran, with a misleading backtrace pointing at `RuntimeEnv::default`. Verified end-to-end by tightening `nested_loop_join_spill.slt` to `SET datafusion.sqllogictest.memory_overdraft_factor = 1.0` and confirming the panic fires with a backtrace pointing at `nested_loop_join.rs:2156` → `arrow_arith::add_wrapping` → `Vec::collect` — i.e. a real untracked allocation, ready to file as a sub-issue of apache#22758.
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
@alamb's concern in #22723 was that a contributor working on something unrelated could hit
OverdraftPanicand have no idea what to do — the message rendered asallocator overdraft: account balance at panic = -1384887 bytesand the underlying allocation was lost across spawn boundaries. The ask was: (1) self-explanatory error message with a remediation path, (2) per-test opt-in instead of a global headroom knob, (3) a written long-term plan.What changes are included in this PR?
Per-test opt-in for overdraft factor. New SLT-only knob
SET datafusion.sqllogictest.memory_overdraft_factor = Nlifts the bank multiplier for the nextSET datafusion.runtime.memory_limit, then auto-resets. The DataFusion parser doesn't know about this variable — the SLT runner intercepts it. No effect ondatafusion-clior anything else.Chained panic hook (
install_overdraft_panic_hook).OverdraftPanicfrequently fires inside spawned tokio tasks (e.g. RepartitionExec workers); tokio's task harness catches the unwind on the worker and the joiner converts it intoDataFusionError::External("task NN panicked")— original payload lost. The hook runs on the worker before the harness eats it and writes the actionable message + forced backtrace to stderr. The runner'scatch_unwindarm now reduces to a one-line pointer at the stderr.DEFAULT_BUDGETdefaults toisize::MAX / 2(effectively infinite). An un-armed bank is now a no-op; real enforcement only kicks in once--default-pool-size-mborSET datafusion.runtime.memory_limitarms the bank. Previously, file-setup allocations settled into a 0-budget bank and tripped before any real work ran, with a misleading backtrace pointing atRuntimeEnv::default.Self-explanatory error text. The stderr message now reads:
README documents the workflow — when the message appears, how to opt out, why the default factor is loose (
8.0), and the long-term plan to drive it toward1.1via [EPIC] A collection of Memory Accounting Limitations and Improvements #22758.Are these changes tested?
set_memory_overdraft_factoroverriding then resetting, and the SET-statement interceptor parsing bare / quoted / signed / bogus values.nested_loop_join_spill.slttoSET datafusion.sqllogictest.memory_overdraft_factor = 1.0(with--default-pool-size-mb 16384) trips the hook and produces a backtrace pointing atnested_loop_join.rs:2156→arrow_arith::add_wrapping→Vec::collect— a real untracked allocation worth filing as a sub-issue.--default-pool-size-mb(default-pool-mb-absent is now a no-op rather than tripping spuriously duringRuntimeEnv::default).Are there any user-facing changes?
Only SLT-runner-facing:
datafusion.sqllogictest.memory_overdraft_factor.--features memory-accounting.