Remove grpc.WaitForReady to make eviction handler more robust#1069
Remove grpc.WaitForReady to make eviction handler more robust#1069kushnaidu wants to merge 5 commits into
Conversation
✅ Deploy Preview for kpt-porch ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>
9b9c314 to
0a07281
Compare
There was a problem hiding this comment.
Pull request overview
This PR aims to eliminate long hangs and reduce false-positive evictions in the function-runner pod cache by removing grpc.WaitForReady(true) usage and instead gating pod availability on gRPC connection readiness, while making eviction behavior consult actual Kubernetes Pod state.
Changes:
- Added a gRPC readiness wait (
waitForGrpcReady) during pod creation, with a test-only skip flag. - Removed
grpc.WaitForReady(true)fromEvaluateFunctionRPCs. - Updated eviction handling to check the Kubernetes Pod state before removing from cache and to delete only Pods (not Services); updated/added related tests.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
func/internal/podmanager.go |
Adds waitForGrpcReady and uses it during pod creation (with test skip flag). |
func/internal/podevaluator.go |
Removes WaitForReady and updates retry rationale around Unavailable. |
func/internal/podcachemanager.go |
Changes eviction to consult Kubernetes Pod state and delete only Pods. |
func/internal/podevaluator_podmanager_test.go |
Sets skipGrpcReadyCheck for tests. |
func/internal/podevaluator_podcachemanager_test.go |
Sets skipGrpcReadyCheck for tests. |
func/internal/podcachemanager_eventloop_test.go |
Refactors eviction tests into a table-driven test and sets skipGrpcReadyCheck. |
Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>
Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>
|
Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>
Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>
| } else if k8sPod.Status.Phase != corev1.PodRunning || k8sPod.DeletionTimestamp != nil { | ||
| klog.Infof("Evicting dead pod %s from cache for image %s (Unavailable)", evict.podKey.Name, evict.image) | ||
| if fn.pods[idx].grpcConnection != nil { | ||
| fn.pods[idx].grpcConnection.Close() | ||
| } | ||
| pcm.DeletePodInBackground(k8sPod) | ||
| fn.pods = slices.Delete(fn.pods, idx, idx+1) | ||
| } |
There was a problem hiding this comment.
Just a question on this, if the Pod is Running, don't we end up leaving the same dead grpcConnection in the cache?
There was a problem hiding this comment.
Yes, it will be dead and the grpc will timeout eventually and let go of it. We don't want to close it because it might crash other parallel in RPC calls. This can happen especially when the liveliness probe makes the pod go down under high load. However, the GC as well on its next scan interval will clean up if the pod is dead.


Title
Remove grpc.WaitForReady to make eviction handler more robust
Description
What changed:
Removed WaitForReady(true) from EvaluateFunction call options — no longer needed since pods in cache are guaranteed reachable.
Added waitForGrpcReady function in podmanager.go that calls cc.Connect() and waits for the gRPC connection state to reach READY before marking a pod as available.
Added skipGrpcReadyCheck field to podManager struct for test skipping.
Eviction handler now checks k8s pod state (Get + status check) instead of unconditionally removing — keeps healthy Running pods, only evicts if pod is gone/Failed/deleting.
Eviction deletes only the Pod, not the Service. Cleaned up by GC.
Why it’s needed:
WaitForReady(true) on retries blocked indefinitely on dead pods (whose IP was unreachable), causing 4+ minute hangs until parent context timeout
Without waitForGrpcReady, pods were added to cache before their gRPC server was listening, causing immediate "connection refused" on first use and triggering spurious evictions of healthy starting pods
Unconditional eviction (old handler) deleted services during burst scale-up, causing cascading "service not found" failures for new pods trying to reuse the same service name
How it works:
During pod creation, waitForGrpcReady triggers cc.Connect() and polls connection state transitions until READY or podReadyTimeout (60s) expires — pod only enters cache once gRPC is confirmed reachable
Since all cached pods have verified connections, Unavailable means the pod genuinely died — no ambiguity with starting pods
Eviction handler does a single k8s Get on the specific pod: if not found, Failed, or DeletionTimestamp set → removes from cache + deletes pod in background. If still Running → keeps it (false positive from transient issue, GC handles within 1 scan interval(default))
Only the Pod is deleted during eviction — Service persists for reuse by retrieveOrCreateService on the next scale-up
Related Issue(s)
Type of Change
Checklist
Testing Instructions (Optional)
Additional Notes (Optional)
AI Disclosure
If so, please describe how: