Skip to content

PHOENIX-7765 :- Eliminate spurious ERROR log on first cluster-role fetch#2516

Merged
lokiore merged 1 commit into
apache:PHOENIX-7562-feature-newfrom
lokiore:PHOENIX-7765/clusterrole-npe-fix
Jun 12, 2026
Merged

PHOENIX-7765 :- Eliminate spurious ERROR log on first cluster-role fetch#2516
lokiore merged 1 commit into
apache:PHOENIX-7562-feature-newfrom
lokiore:PHOENIX-7765/clusterrole-npe-fix

Conversation

@lokiore

@lokiore lokiore commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Defensive null/empty check at the two production callsites of ConnectionQueryServicesImpl.getLiveRegionServers():

  1. phoenix-core-client/.../util/GetClusterRoleRecordUtil.java::getClusterRoleRecord (~L92) — used by HighAvailabilityGroup.getClusterRoleRecordFromEndpoint(...) on the CCF cluster-role fetch path.
  2. phoenix-core-client/.../util/ValidateLastDDLTimestampUtil.java::validateLastDDLTimestamp (~L103) — used by query-path callers gated on LAST_DDL_TIMESTAMP_VALIDATION_ENABLED.

If getLiveRegionServers() returns null or an empty list, refresh inline via refreshLiveRegionServers() and re-fetch. If the list is still null/empty after refresh, throw a descriptive SQLException so the caller sees a clear error instead of a stack-walked NullPointerException.

The existing retry/recovery block at GetClusterRoleRecordUtil.java:122-148 is unchanged. It catches generic Exception and refreshes/recurses, and continues to do so for genuine failures (auth, transport, RPC). What this PR removes is its NPE-recovery responsibility: now there is no NPE to catch on the first-call path, and the catch block only fires for real failures.

JIRA: https://issues.apache.org/jira/browse/PHOENIX-7765 (sub-task of PHOENIX-7562)

Why are the changes needed?

ConnectionQueryServicesImpl declares liveRegionServers as a volatile field with no initializer, defaulting to null. The list is auto-populated at CQSI init only when LAST_DDL_TIMESTAMP_VALIDATION_ENABLED is true (default false). On the default config path, the field stays null after init.

Both util callers (getClusterRoleRecord, validateLastDDLTimestamp) call getLiveRegionServers() and immediately invoke regionServers.get(ThreadLocalRandom.current().nextInt(regionServers.size())).size() on a null reference produces a NullPointerException on first invocation.

The existing catch block recovers transparently — it logs an ERROR, refreshes the list, and recurses with doRetry=false. The recursive call sees the now-populated list and returns successfully. So no exception escapes to the caller; this is not a user-visible defect.

What is user-visible is the spurious ERROR log line per first invocation. On a busy CCF cluster, every client connection's first cluster-role fetch logs:

ERROR util.GetClusterRoleRecordUtil(125): Error in getting ClusterRoleRecord for <ha-group> from url <url>
java.lang.NullPointerException
  at org.apache.phoenix.util.GetClusterRoleRecordUtil.getClusterRoleRecord(GetClusterRoleRecordUtil.java:95)
  ...

These are not real errors — they are lazy-init noise. They flood operator log surfaces, mask genuine errors, and create false-positive alerts in log-based monitoring.

This PR addresses the lazy-init noise, not the (already-handled) NPE escape. The framing is "remove spurious ERROR log churn on first cluster-role / DDL-timestamp call," not "fix NPE."

Does this PR introduce any user-facing change?

No

This is a pre-launch CCF feature branch (PHOENIX-7562-feature-new). The change is operator-log-cleanup only; no behavioral or wire-format change. The first-call path that previously logged ERROR + recovered now silently populates the list and proceeds.

How was this patch tested?

Local commands run on PHOENIX-7562-feature-new HEAD 8f9655a69e:

mvn install -pl phoenix-core-client,phoenix-core,phoenix-core-server -am -DskipTests
# BUILD SUCCESS, 5 reactor modules, 14.863s

mvn spotless:apply -DskipTests -q
# silent success, zero drift

mvn -pl phoenix-core failsafe:integration-test \
    -Dit.test=FailoverPhoenixConnection2IT -DfailIfNoTests=false
# Tests run: 13, Failures: 1, Errors: 0, Skipped: 1
# (1 pre-existing unrelated failure on testAllWrappedConnectionsClosedAfterStandbyAndZKDownAsync —
#  connection-cleanup timing in STANDBY+ZK-down test, unaffected by this change)

Empirical evidence — LOGGER.error count delta on FailoverPhoenixConnection2IT:

$ grep -c "Error in getting ClusterRoleRecord" failsafe-output.txt
# pre-fix:  36
# post-fix: 34

Verified all 34 remaining errors are genuine — searched the post-fix output for stack frames containing NullPointerException co-occurring with Error in getting ClusterRoleRecord:

$ grep -B1 "java.lang.NullPointerException" failsafe-output.txt \
    | grep "Error in getting ClusterRoleRecord" | wc -l
# 0

The 34 remaining errors are from genuine ZK-down / failover / connection-restart scenarios:

   9 testConnectionWhenTwoZKRestarts
   7 testFailoverCanFinishWhenOneZKDownWithCQS
   7 testFailoverCanFinishWhenOneZKDown
   5 testAllWrappedConnectionsClosedAfterStandbyAndZKDownAsync
   3 testConnectionWhenStandbyZKRestarts
   1 testStaleCrrVersionDetection
   1 testAllWrappedConnectionsNotClosedAfterStandbyURLChange
   1 testAllWrappedConnectionsClosedAfterActiveURLChange

Caller sweep: the only two production callers of getLiveRegionServers() in phoenix-core-client/ are the two files this PR patches. Verified via:

$ git grep -n "getLiveRegionServers()" phoenix-core-client/src/main/java/
phoenix-core-client/src/main/java/org/apache/phoenix/query/ConnectionQueryServices.java:203:  List<ServerName> getLiveRegionServers();
phoenix-core-client/src/main/java/org/apache/phoenix/query/ConnectionQueryServicesImpl.java:5957:  return this.liveRegionServers;
phoenix-core-client/src/main/java/org/apache/phoenix/query/ConnectionlessQueryServicesImpl.java:564: ...
phoenix-core-client/src/main/java/org/apache/phoenix/query/DelegateConnectionQueryServices.java:236: ...
phoenix-core-client/src/main/java/org/apache/phoenix/util/GetClusterRoleRecordUtil.java:96: ← patched
phoenix-core-client/src/main/java/org/apache/phoenix/util/ValidateLastDDLTimestampUtil.java:103: ← patched

Was this patch authored or co-authored using generative AI tooling?

This change was developed with assistance from an AI coding tool (Claude Code, model Opus 4.7) per the ASF Generative Tooling policy. All code was reviewed by a human committer before commit; all tests were authored and verified locally.

Generated-by: Claude Code (Opus 4.7)

GetClusterRoleRecordUtil.getClusterRoleRecord and
ValidateLastDDLTimestampUtil.validateLastDDLTimestamp call
ConnectionQueryServicesImpl.getLiveRegionServers(), which returns
null until refreshLiveRegionServers() runs. The list is only
auto-populated at CQSI init when LAST_DDL_TIMESTAMP_VALIDATION_ENABLED
is true (default false), so callers on the default config path always
NPE on first invocation. The existing catch block already recovers
(refreshes the list and recurses), but logs an ERROR line per first
call -- N false errors per CCF cluster init, polluting operator log
surfaces.

Add inline null-check + refresh + re-fetch at the call site, before
the NPE could fire. The retry catch block remains for genuine
exceptions (auth, transport, RPC). First-call path no longer logs
ERROR; only real failures do.

Verified empirically on FailoverPhoenixConnection2IT: pre-fix the
output log carried 36 "Error in getting ClusterRoleRecord" lines;
post-fix it carries 34 lines, and zero of those 34 have a
NullPointerException in their stack -- the remaining errors are
legitimate ZK-down / failover / connection-restart scenarios.

Generated-by: Claude Code (Opus 4.7)
@lokiore lokiore merged commit efe34a5 into apache:PHOENIX-7562-feature-new Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants