PHOENIX-7765 :- Eliminate spurious ERROR log on first cluster-role fetch#2516
Merged
lokiore merged 1 commit intoJun 12, 2026
Conversation
GetClusterRoleRecordUtil.getClusterRoleRecord and ValidateLastDDLTimestampUtil.validateLastDDLTimestamp call ConnectionQueryServicesImpl.getLiveRegionServers(), which returns null until refreshLiveRegionServers() runs. The list is only auto-populated at CQSI init when LAST_DDL_TIMESTAMP_VALIDATION_ENABLED is true (default false), so callers on the default config path always NPE on first invocation. The existing catch block already recovers (refreshes the list and recurses), but logs an ERROR line per first call -- N false errors per CCF cluster init, polluting operator log surfaces. Add inline null-check + refresh + re-fetch at the call site, before the NPE could fire. The retry catch block remains for genuine exceptions (auth, transport, RPC). First-call path no longer logs ERROR; only real failures do. Verified empirically on FailoverPhoenixConnection2IT: pre-fix the output log carried 36 "Error in getting ClusterRoleRecord" lines; post-fix it carries 34 lines, and zero of those 34 have a NullPointerException in their stack -- the remaining errors are legitimate ZK-down / failover / connection-restart scenarios. Generated-by: Claude Code (Opus 4.7)
tkhurana
approved these changes
Jun 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Defensive null/empty check at the two production callsites of
ConnectionQueryServicesImpl.getLiveRegionServers():phoenix-core-client/.../util/GetClusterRoleRecordUtil.java::getClusterRoleRecord(~L92) — used byHighAvailabilityGroup.getClusterRoleRecordFromEndpoint(...)on the CCF cluster-role fetch path.phoenix-core-client/.../util/ValidateLastDDLTimestampUtil.java::validateLastDDLTimestamp(~L103) — used by query-path callers gated onLAST_DDL_TIMESTAMP_VALIDATION_ENABLED.If
getLiveRegionServers()returnsnullor an empty list, refresh inline viarefreshLiveRegionServers()and re-fetch. If the list is still null/empty after refresh, throw a descriptiveSQLExceptionso the caller sees a clear error instead of a stack-walkedNullPointerException.The existing retry/recovery block at
GetClusterRoleRecordUtil.java:122-148is unchanged. It catches genericExceptionand refreshes/recurses, and continues to do so for genuine failures (auth, transport, RPC). What this PR removes is its NPE-recovery responsibility: now there is no NPE to catch on the first-call path, and the catch block only fires for real failures.JIRA: https://issues.apache.org/jira/browse/PHOENIX-7765 (sub-task of PHOENIX-7562)
Why are the changes needed?
ConnectionQueryServicesImpldeclaresliveRegionServersas a volatile field with no initializer, defaulting tonull. The list is auto-populated at CQSI init only whenLAST_DDL_TIMESTAMP_VALIDATION_ENABLEDis true (defaultfalse). On the default config path, the field stays null after init.Both util callers (
getClusterRoleRecord,validateLastDDLTimestamp) callgetLiveRegionServers()and immediately invokeregionServers.get(ThreadLocalRandom.current().nextInt(regionServers.size()))—.size()on a null reference produces aNullPointerExceptionon first invocation.The existing catch block recovers transparently — it logs an
ERROR, refreshes the list, and recurses withdoRetry=false. The recursive call sees the now-populated list and returns successfully. So no exception escapes to the caller; this is not a user-visible defect.What is user-visible is the spurious ERROR log line per first invocation. On a busy CCF cluster, every client connection's first cluster-role fetch logs:
These are not real errors — they are lazy-init noise. They flood operator log surfaces, mask genuine errors, and create false-positive alerts in log-based monitoring.
This PR addresses the lazy-init noise, not the (already-handled) NPE escape. The framing is "remove spurious ERROR log churn on first cluster-role / DDL-timestamp call," not "fix NPE."
Does this PR introduce any user-facing change?
No
This is a pre-launch CCF feature branch (
PHOENIX-7562-feature-new). The change is operator-log-cleanup only; no behavioral or wire-format change. The first-call path that previously logged ERROR + recovered now silently populates the list and proceeds.How was this patch tested?
Local commands run on
PHOENIX-7562-feature-newHEAD8f9655a69e:Empirical evidence —
LOGGER.errorcount delta onFailoverPhoenixConnection2IT:Verified all 34 remaining errors are genuine — searched the post-fix output for stack frames containing
NullPointerExceptionco-occurring withError in getting ClusterRoleRecord:The 34 remaining errors are from genuine ZK-down / failover / connection-restart scenarios:
Caller sweep: the only two production callers of
getLiveRegionServers()inphoenix-core-client/are the two files this PR patches. Verified via:Was this patch authored or co-authored using generative AI tooling?
This change was developed with assistance from an AI coding tool (Claude Code, model Opus 4.7) per the ASF Generative Tooling policy. All code was reviewed by a human committer before commit; all tests were authored and verified locally.
Generated-by: Claude Code (Opus 4.7)