feat(QTDI-2030): Add health and readiness endpoints to component-server#1245
Open
undx wants to merge 1 commit into
Open
feat(QTDI-2030): Add health and readiness endpoints to component-server#1245undx wants to merge 1 commit into
undx wants to merge 1 commit into
Conversation
Introduces GET /api/v1/health (liveness probe) and GET /api/v1/readiness
(readiness probe) as dedicated Kubernetes-style endpoints, replacing the
insufficient /environment endpoint for cluster health monitoring.
Health checks: heap memory availability (configurable threshold),
component index validity, and Vault connectivity via VaultClient.ping().
Readiness check: component index load state (isStarted()).
Both endpoints return {"status":"UP"|"DOWN","cause":"..."} with HTTP 200/503.
Closes QTDI-2030
Co-Authored-By: GitHub Copilot <copilot@noreply.github.com>
Member
Author
Scope & Design Review — Round 1 — APPROVEDFindingsBLOCKER B1 — Wrong path for readiness endpoint (FIXED ✅)
MAJOR M1 — Memory check divide-by-zero risk with unbounded heap (FIXED ✅)
MINOR Mi1 —
|
| Severity | Count |
|---|---|
| Critical (fixed) | 1 |
| Warning (accepted) | 2 |
CRITICAL C1 — catch (final Exception e) in VaultClient.ping() (FIXED ✅)
- Fix: Changed to
catch (final javax.ws.rs.ProcessingException e)— the correct transport-level exception for JAX-RS client failures.
WARNING W1 — catch (final Throwable t) in HealthService.checkComponentIndex() (ACCEPTED)
- Justification: Intentional per approved plan — OOME can only be caught with
Throwable.
WARNING W2 — MockitoAnnotations.openMocks() instead of @ExtendWith(MockitoExtension.class) (ACCEPTABLE)
- Justification:
mockito-junit-jupiter:4.8.1is incompatible with JUnit 5.10.0 on the test classpath.MockitoAnnotations.openMocks()achieves the same result;@InjectMocksand@Mockannotations are still used as prescribed.
AC Coverage
| AC | Covered? |
|---|---|
| GET /api/v1/health → 200 when healthy | ✅ UT + IT |
| GET /api/v1/readiness → 200 when ready | ✅ UT + IT |
| /health → 503 + cause when memory low | ✅ UT |
| /health → 503 + cause when Vault fails | ✅ UT |
| /readiness → 503 + cause when index not ready | ✅ UT |
| /environment independent from health/readiness | ✅ IT |
| Memory threshold is configurable | ✅ config property |
| Vault check uses VaultClient.ping() | ✅ new method added |
| Error message with cause in JSON | ✅ HealthStatus.cause field |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Implements two new Kubernetes-style probe endpoints for the component-server, resolving recurring cluster restart failures in Talend Cloud (AU/AP incidents).
The existing
/environmentendpoint was insufficient because it returns 200 even when the server cannot process requests. This PR introduces:503 {"status":"DOWN","cause":"..."\}if any check fails, triggering a pod restart.503while the index is being built, preventing premature traffic routing.Key design decisions:
talend.server.health.memory.threshold(default: 10%)Runtime.maxMemory() == Long.MAX_VALUEguard prevents false DOWN on unbounded heapvaultUrl == "no-vault"(standalone mode)catch (Throwable t)in component index check is intentional — required to catch OOME (documented in PR comment below)Jira: QTDI-2030
Checklist
mvn clean verifypasses locallymvn spotless:apply)AI generated code
https://internal.qlik.dev/general/ways-of-working/code-reviews/#guidelines-for-ai-generated-code