Remove grpc.WaitForReady to make eviction handler more robust by kushnaidu · Pull Request #1069 · kptdev/porch

kushnaidu · 2026-06-23T07:28:03Z

Title

Remove grpc.WaitForReady to make eviction handler more robust

Description

What changed:
Removed WaitForReady(true) from EvaluateFunction call options — no longer needed since pods in cache are guaranteed reachable.
Added waitForGrpcReady function in podmanager.go that calls cc.Connect() and waits for the gRPC connection state to reach READY before marking a pod as available.
Added skipGrpcReadyCheck field to podManager struct for test skipping.
Eviction handler now checks k8s pod state (Get + status check) instead of unconditionally removing — keeps healthy Running pods, only evicts if pod is gone/Failed/deleting.
Eviction deletes only the Pod, not the Service. Cleaned up by GC.
Why it’s needed:
WaitForReady(true) on retries blocked indefinitely on dead pods (whose IP was unreachable), causing 4+ minute hangs until parent context timeout
Without waitForGrpcReady, pods were added to cache before their gRPC server was listening, causing immediate "connection refused" on first use and triggering spurious evictions of healthy starting pods
Unconditional eviction (old handler) deleted services during burst scale-up, causing cascading "service not found" failures for new pods trying to reuse the same service name
How it works:
During pod creation, waitForGrpcReady triggers cc.Connect() and polls connection state transitions until READY or podReadyTimeout (60s) expires — pod only enters cache once gRPC is confirmed reachable
Since all cached pods have verified connections, Unavailable means the pod genuinely died — no ambiguity with starting pods
Eviction handler does a single k8s Get on the specific pod: if not found, Failed, or DeletionTimestamp set → removes from cache + deletes pod in background. If still Running → keeps it (false positive from transient issue, GC handles within 1 scan interval(default))
Only the Pod is deleted during eviction — Service persists for reuse by retrieveOrCreateService on the next scale-up

Related Issue(s)

Closes/Fixes #

Type of Change

Checklist

Testing Instructions (Optional)

Additional Notes (Optional)

Known issues:
Further improvements:
Review notes:

AI Disclosure

I have used AI in the creation of this PR.

If so, please describe how:

Microsoft Copilot to analyse the code.
Kiro to generate unit tests and code analysis.

netlify · 2026-06-23T07:28:08Z

✅ Deploy Preview for kpt-porch ready!

Name	Link
🔨 Latest commit	`5d98dc1`
🔍 Latest deploy log	https://app.netlify.com/projects/kpt-porch/deploys/6a3ab231b4014e0008ebf1d4
😎 Deploy Preview	https://deploy-preview-1069--kpt-porch.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.
🤖 Make changes	Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>

Copilot

Pull request overview

This PR aims to eliminate long hangs and reduce false-positive evictions in the function-runner pod cache by removing grpc.WaitForReady(true) usage and instead gating pod availability on gRPC connection readiness, while making eviction behavior consult actual Kubernetes Pod state.

Changes:

Added a gRPC readiness wait (waitForGrpcReady) during pod creation, with a test-only skip flag.
Removed grpc.WaitForReady(true) from EvaluateFunction RPCs.
Updated eviction handling to check the Kubernetes Pod state before removing from cache and to delete only Pods (not Services); updated/added related tests.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`func/internal/podmanager.go`	Adds `waitForGrpcReady` and uses it during pod creation (with test skip flag).
`func/internal/podevaluator.go`	Removes `WaitForReady` and updates retry rationale around `Unavailable`.
`func/internal/podcachemanager.go`	Changes eviction to consult Kubernetes Pod state and delete only Pods.
`func/internal/podevaluator_podmanager_test.go`	Sets `skipGrpcReadyCheck` for tests.
`func/internal/podevaluator_podcachemanager_test.go`	Sets `skipGrpcReadyCheck` for tests.
`func/internal/podcachemanager_eventloop_test.go`	Refactors eviction tests into a table-driven test and sets `skipGrpcReadyCheck`.

Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>

sonarqubecloud · 2026-06-23T08:48:32Z

Quality Gate failed

Failed conditions
75.0% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>

rendre-greyling · 2026-06-24T07:39:58Z

+				} else if k8sPod.Status.Phase != corev1.PodRunning || k8sPod.DeletionTimestamp != nil {
+					klog.Infof("Evicting dead pod %s from cache for image %s (Unavailable)", evict.podKey.Name, evict.image)
+					if fn.pods[idx].grpcConnection != nil {
+						fn.pods[idx].grpcConnection.Close()
+					}
+					pcm.DeletePodInBackground(k8sPod)
+					fn.pods = slices.Delete(fn.pods, idx, idx+1)
+				}


Just a question on this, if the Pod is Running, don't we end up leaving the same dead grpcConnection in the cache?

Yes, it will be dead and the grpc will timeout eventually and let go of it. We don't want to close it because it might crash other parallel in RPC calls. This can happen especially when the liveliness probe makes the pod go down under high load. However, the GC as well on its next scan interval will clean up if the pod is dead.

kushnaidu requested review from a team and Copilot June 23, 2026 07:28

kushnaidu requested review from efiacor, kispaljr, liamfallon and mozesl-nokia as code owners June 23, 2026 07:28

dosubot Bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Jun 23, 2026

Copilot started reviewing on behalf of kushnaidu June 23, 2026 07:28 View session

Remove grpc.WaitForReady to make eviction handler more robust

0a07281

Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>

kushnaidu force-pushed the evict-pod-on-demand2 branch from 9b9c314 to 0a07281 Compare June 23, 2026 07:29

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Comment thread func/internal/podevaluator.go

Comment thread func/internal/podcachemanager.go

Comment thread func/internal/podmanager.go

Address copilot comments

007b0cd

Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>

Copilot AI review requested due to automatic review settings June 23, 2026 07:55

Copilot started reviewing on behalf of kushnaidu June 23, 2026 07:56 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Address copilot comments and add waitForGrpcReady before podReadyCh

ab9ab40

Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>

kushnaidu requested a review from rendre-greyling June 23, 2026 08:47

kushnaidu self-assigned this Jun 23, 2026

kushnaidu added the do-not-merge/hold #ededed label Jun 23, 2026

Increment concurrency counter in redistributeLoad

bb47571

Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>

Copilot AI review requested due to automatic review settings June 23, 2026 16:09

kushnaidu removed the do-not-merge/hold #ededed label Jun 23, 2026

Copilot started reviewing on behalf of kushnaidu June 23, 2026 16:10 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Comment thread func/internal/podcachemanager.go

Comment thread func/internal/podcachemanager.go

Address copilot comments

5d98dc1

Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>

rendre-greyling reviewed Jun 24, 2026

View reviewed changes

rendre-greyling approved these changes Jun 24, 2026

View reviewed changes

dosubot Bot added the lgtm #ededed label Jun 24, 2026

kushnaidu added the do-not-merge/hold #ededed label Jun 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove grpc.WaitForReady to make eviction handler more robust#1069

Remove grpc.WaitForReady to make eviction handler more robust#1069
kushnaidu wants to merge 5 commits into
kptdev:mainfrom
Nordix:evict-pod-on-demand2

kushnaidu commented Jun 23, 2026 •

edited

Loading

Uh oh!

netlify Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud Bot commented Jun 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

rendre-greyling Jun 24, 2026

Uh oh!

kushnaidu Jun 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kushnaidu commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Title

Description

Related Issue(s)

Type of Change

Checklist

Testing Instructions (Optional)

Additional Notes (Optional)

AI Disclosure

Uh oh!

netlify Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kpt-porch ready!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud Bot commented Jun 23, 2026

Quality Gate failed

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

rendre-greyling Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

kushnaidu Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kushnaidu commented Jun 23, 2026 •

edited

Loading

netlify Bot commented Jun 23, 2026 •

edited

Loading

kushnaidu Jun 24, 2026 •

edited

Loading