Skip to content

feat(QTDI-2030): Add health and readiness endpoints to component-server#1245

Open
undx wants to merge 1 commit into
masterfrom
copilot/QTDI-2030_health_readiness_endpoints
Open

feat(QTDI-2030): Add health and readiness endpoints to component-server#1245
undx wants to merge 1 commit into
masterfrom
copilot/QTDI-2030_health_readiness_endpoints

Conversation

@undx

@undx undx commented Jun 23, 2026

Copy link
Copy Markdown
Member

Summary

Implements two new Kubernetes-style probe endpoints for the component-server, resolving recurring cluster restart failures in Talend Cloud (AU/AP incidents).

The existing /environment endpoint was insufficient because it returns 200 even when the server cannot process requests. This PR introduces:

  • GET /api/v1/health (liveness probe) — checks heap memory availability, component index validity, and Vault connectivity. Returns 503 {"status":"DOWN","cause":"..."\} if any check fails, triggering a pod restart.
  • GET /api/v1/readiness (readiness probe) — checks that the component index has finished loading. Returns 503 while the index is being built, preventing premature traffic routing.

Key design decisions:

  • Memory threshold is configurable via talend.server.health.memory.threshold (default: 10%)
  • Runtime.maxMemory() == Long.MAX_VALUE guard prevents false DOWN on unbounded heap
  • Vault check is skipped when vaultUrl == "no-vault" (standalone mode)
  • catch (Throwable t) in component index check is intentional — required to catch OOME (documented in PR comment below)

Jira: QTDI-2030

Checklist

  • Tests added or updated (7 UT + 3 IT — all GREEN)
  • mvn clean verify passes locally
  • Spotless formatting applied (mvn spotless:apply)
  • No new Sonar issues (pending CI confirmation)

AI generated code

https://internal.qlik.dev/general/ways-of-working/code-reviews/#guidelines-for-ai-generated-code

  • this PR has been written with the help of GitHub Copilot or another generative AI tool

Introduces GET /api/v1/health (liveness probe) and GET /api/v1/readiness
(readiness probe) as dedicated Kubernetes-style endpoints, replacing the
insufficient /environment endpoint for cluster health monitoring.

Health checks: heap memory availability (configurable threshold),
component index validity, and Vault connectivity via VaultClient.ping().
Readiness check: component index load state (isStarted()).

Both endpoints return {"status":"UP"|"DOWN","cause":"..."} with HTTP 200/503.

Closes QTDI-2030
Co-Authored-By: GitHub Copilot <copilot@noreply.github.com>
@undx

undx commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

Scope & Design Review — Round 1 — APPROVED

Findings

BLOCKER B1 — Wrong path for readiness endpoint (FIXED ✅)

  • File: HealthResource.java, HealthResourceImpl.java, HealthResourceImplIT.java
  • Issue: Readiness was at /api/v1/health/readiness instead of /api/v1/readiness as confirmed by Dev.
  • Fix: Created ReadinessResource.java and ReadinessResourceImpl.java as separate root-level resources at @Path("readiness"). Updated IT to use /readiness path.

MAJOR M1 — Memory check divide-by-zero risk with unbounded heap (FIXED ✅)

  • File: HealthService.java#checkMemory()
  • Issue: When JVM is launched without -Xmx, Runtime.maxMemory() returns Long.MAX_VALUE. The formula availableMemory * 100L / maxMemory would overflow or produce 0%, causing a false DOWN.
  • Fix: Added if (maxMemory == Long.MAX_VALUE) return UP guard before the calculation.

MINOR Mi1 — Throwable catch for component index check (ACCEPTED — intentional)

  • File: HealthService.java#checkComponentIndex()
  • Justification: The approved plan explicitly requires catching OOME, which demands Throwable. t.getMessage() may return null for OOM but result is "Component index check failed: null" which is acceptable.

Compliance Check — Summary

Severity Count
Critical (fixed) 1
Warning (accepted) 2

CRITICAL C1 — catch (final Exception e) in VaultClient.ping() (FIXED ✅)

  • Fix: Changed to catch (final javax.ws.rs.ProcessingException e) — the correct transport-level exception for JAX-RS client failures.

WARNING W1 — catch (final Throwable t) in HealthService.checkComponentIndex() (ACCEPTED)

  • Justification: Intentional per approved plan — OOME can only be caught with Throwable.

WARNING W2 — MockitoAnnotations.openMocks() instead of @ExtendWith(MockitoExtension.class) (ACCEPTABLE)

  • Justification: mockito-junit-jupiter:4.8.1 is incompatible with JUnit 5.10.0 on the test classpath. MockitoAnnotations.openMocks() achieves the same result; @InjectMocks and @Mock annotations are still used as prescribed.

AC Coverage

AC Covered?
GET /api/v1/health → 200 when healthy ✅ UT + IT
GET /api/v1/readiness → 200 when ready ✅ UT + IT
/health → 503 + cause when memory low ✅ UT
/health → 503 + cause when Vault fails ✅ UT
/readiness → 503 + cause when index not ready ✅ UT
/environment independent from health/readiness ✅ IT
Memory threshold is configurable ✅ config property
Vault check uses VaultClient.ping() ✅ new method added
Error message with cause in JSON ✅ HealthStatus.cause field

@sonar-rnd

sonar-rnd Bot commented Jun 23, 2026

Copy link
Copy Markdown

Failed Quality Gate failed

  • 0.00% Coverage on New Code (is less than 80.00%)
  • 10 New Issues (is greater than 0)

Project ID: org.talend.sdk.component:component-runtime

View in SonarQube

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant