PHOENIX-7765 :- Eliminate spurious ERROR log on first cluster-role fetch by lokiore · Pull Request #2516 · apache/phoenix

lokiore · 2026-06-11T05:32:41Z

What changes were proposed in this pull request?

Defensive null/empty check at the two production callsites of ConnectionQueryServicesImpl.getLiveRegionServers():

phoenix-core-client/.../util/GetClusterRoleRecordUtil.java::getClusterRoleRecord (~L92) — used by HighAvailabilityGroup.getClusterRoleRecordFromEndpoint(...) on the CCF cluster-role fetch path.
phoenix-core-client/.../util/ValidateLastDDLTimestampUtil.java::validateLastDDLTimestamp (~L103) — used by query-path callers gated on LAST_DDL_TIMESTAMP_VALIDATION_ENABLED.

If getLiveRegionServers() returns null or an empty list, refresh inline via refreshLiveRegionServers() and re-fetch. If the list is still null/empty after refresh, throw a descriptive SQLException so the caller sees a clear error instead of a stack-walked NullPointerException.

The existing retry/recovery block at GetClusterRoleRecordUtil.java:122-148 is unchanged. It catches generic Exception and refreshes/recurses, and continues to do so for genuine failures (auth, transport, RPC). What this PR removes is its NPE-recovery responsibility: now there is no NPE to catch on the first-call path, and the catch block only fires for real failures.

JIRA: https://issues.apache.org/jira/browse/PHOENIX-7765 (sub-task of PHOENIX-7562)

Why are the changes needed?

ConnectionQueryServicesImpl declares liveRegionServers as a volatile field with no initializer, defaulting to null. The list is auto-populated at CQSI init only when LAST_DDL_TIMESTAMP_VALIDATION_ENABLED is true (default false). On the default config path, the field stays null after init.

Both util callers (getClusterRoleRecord, validateLastDDLTimestamp) call getLiveRegionServers() and immediately invoke regionServers.get(ThreadLocalRandom.current().nextInt(regionServers.size())) — .size() on a null reference produces a NullPointerException on first invocation.

The existing catch block recovers transparently — it logs an ERROR, refreshes the list, and recurses with doRetry=false. The recursive call sees the now-populated list and returns successfully. So no exception escapes to the caller; this is not a user-visible defect.

What is user-visible is the spurious ERROR log line per first invocation. On a busy CCF cluster, every client connection's first cluster-role fetch logs:

ERROR util.GetClusterRoleRecordUtil(125): Error in getting ClusterRoleRecord for <ha-group> from url <url>
java.lang.NullPointerException
  at org.apache.phoenix.util.GetClusterRoleRecordUtil.getClusterRoleRecord(GetClusterRoleRecordUtil.java:95)
  ...

These are not real errors — they are lazy-init noise. They flood operator log surfaces, mask genuine errors, and create false-positive alerts in log-based monitoring.

This PR addresses the lazy-init noise, not the (already-handled) NPE escape. The framing is "remove spurious ERROR log churn on first cluster-role / DDL-timestamp call," not "fix NPE."

Does this PR introduce any user-facing change?

No

This is a pre-launch CCF feature branch (PHOENIX-7562-feature-new). The change is operator-log-cleanup only; no behavioral or wire-format change. The first-call path that previously logged ERROR + recovered now silently populates the list and proceeds.

How was this patch tested?

Local commands run on PHOENIX-7562-feature-new HEAD 8f9655a69e:

mvn install -pl phoenix-core-client,phoenix-core,phoenix-core-server -am -DskipTests
# BUILD SUCCESS, 5 reactor modules, 14.863s

mvn spotless:apply -DskipTests -q
# silent success, zero drift

mvn -pl phoenix-core failsafe:integration-test \
    -Dit.test=FailoverPhoenixConnection2IT -DfailIfNoTests=false
# Tests run: 13, Failures: 1, Errors: 0, Skipped: 1
# (1 pre-existing unrelated failure on testAllWrappedConnectionsClosedAfterStandbyAndZKDownAsync —
#  connection-cleanup timing in STANDBY+ZK-down test, unaffected by this change)

Empirical evidence — LOGGER.error count delta on FailoverPhoenixConnection2IT:

$ grep -c "Error in getting ClusterRoleRecord" failsafe-output.txt
# pre-fix:  36
# post-fix: 34

Verified all 34 remaining errors are genuine — searched the post-fix output for stack frames containing NullPointerException co-occurring with Error in getting ClusterRoleRecord:

$ grep -B1 "java.lang.NullPointerException" failsafe-output.txt \
    | grep "Error in getting ClusterRoleRecord" | wc -l
# 0

The 34 remaining errors are from genuine ZK-down / failover / connection-restart scenarios:

   9 testConnectionWhenTwoZKRestarts
   7 testFailoverCanFinishWhenOneZKDownWithCQS
   7 testFailoverCanFinishWhenOneZKDown
   5 testAllWrappedConnectionsClosedAfterStandbyAndZKDownAsync
   3 testConnectionWhenStandbyZKRestarts
   1 testStaleCrrVersionDetection
   1 testAllWrappedConnectionsNotClosedAfterStandbyURLChange
   1 testAllWrappedConnectionsClosedAfterActiveURLChange

Caller sweep: the only two production callers of getLiveRegionServers() in phoenix-core-client/ are the two files this PR patches. Verified via:

$ git grep -n "getLiveRegionServers()" phoenix-core-client/src/main/java/
phoenix-core-client/src/main/java/org/apache/phoenix/query/ConnectionQueryServices.java:203:  List<ServerName> getLiveRegionServers();
phoenix-core-client/src/main/java/org/apache/phoenix/query/ConnectionQueryServicesImpl.java:5957:  return this.liveRegionServers;
phoenix-core-client/src/main/java/org/apache/phoenix/query/ConnectionlessQueryServicesImpl.java:564: ...
phoenix-core-client/src/main/java/org/apache/phoenix/query/DelegateConnectionQueryServices.java:236: ...
phoenix-core-client/src/main/java/org/apache/phoenix/util/GetClusterRoleRecordUtil.java:96: ← patched
phoenix-core-client/src/main/java/org/apache/phoenix/util/ValidateLastDDLTimestampUtil.java:103: ← patched

Was this patch authored or co-authored using generative AI tooling?

This change was developed with assistance from an AI coding tool (Claude Code, model Opus 4.7) per the ASF Generative Tooling policy. All code was reviewed by a human committer before commit; all tests were authored and verified locally.

Generated-by: Claude Code (Opus 4.7)

GetClusterRoleRecordUtil.getClusterRoleRecord and ValidateLastDDLTimestampUtil.validateLastDDLTimestamp call ConnectionQueryServicesImpl.getLiveRegionServers(), which returns null until refreshLiveRegionServers() runs. The list is only auto-populated at CQSI init when LAST_DDL_TIMESTAMP_VALIDATION_ENABLED is true (default false), so callers on the default config path always NPE on first invocation. The existing catch block already recovers (refreshes the list and recurses), but logs an ERROR line per first call -- N false errors per CCF cluster init, polluting operator log surfaces. Add inline null-check + refresh + re-fetch at the call site, before the NPE could fire. The retry catch block remains for genuine exceptions (auth, transport, RPC). First-call path no longer logs ERROR; only real failures do. Verified empirically on FailoverPhoenixConnection2IT: pre-fix the output log carried 36 "Error in getting ClusterRoleRecord" lines; post-fix it carries 34 lines, and zero of those 34 have a NullPointerException in their stack -- the remaining errors are legitimate ZK-down / failover / connection-restart scenarios. Generated-by: Claude Code (Opus 4.7)

tkhurana approved these changes Jun 11, 2026

View reviewed changes

lokiore merged commit efe34a5 into apache:PHOENIX-7562-feature-new Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PHOENIX-7765 :- Eliminate spurious ERROR log on first cluster-role fetch#2516

PHOENIX-7765 :- Eliminate spurious ERROR log on first cluster-role fetch#2516
lokiore merged 1 commit into
apache:PHOENIX-7562-feature-newfrom
lokiore:PHOENIX-7765/clusterrole-npe-fix

lokiore commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lokiore commented Jun 11, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants