Skip to content

Remove grpc.WaitForReady to make eviction handler more robust#1069

Open
kushnaidu wants to merge 5 commits into
kptdev:mainfrom
Nordix:evict-pod-on-demand2
Open

Remove grpc.WaitForReady to make eviction handler more robust#1069
kushnaidu wants to merge 5 commits into
kptdev:mainfrom
Nordix:evict-pod-on-demand2

Conversation

@kushnaidu

@kushnaidu kushnaidu commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Title

Remove grpc.WaitForReady to make eviction handler more robust


Description

  • What changed:
    Removed WaitForReady(true) from EvaluateFunction call options — no longer needed since pods in cache are guaranteed reachable.
    Added waitForGrpcReady function in podmanager.go that calls cc.Connect() and waits for the gRPC connection state to reach READY before marking a pod as available.
    Added skipGrpcReadyCheck field to podManager struct for test skipping.
    Eviction handler now checks k8s pod state (Get + status check) instead of unconditionally removing — keeps healthy Running pods, only evicts if pod is gone/Failed/deleting.
    Eviction deletes only the Pod, not the Service. Cleaned up by GC.

  • Why it’s needed:
    WaitForReady(true) on retries blocked indefinitely on dead pods (whose IP was unreachable), causing 4+ minute hangs until parent context timeout
    Without waitForGrpcReady, pods were added to cache before their gRPC server was listening, causing immediate "connection refused" on first use and triggering spurious evictions of healthy starting pods
    Unconditional eviction (old handler) deleted services during burst scale-up, causing cascading "service not found" failures for new pods trying to reuse the same service name

  • How it works:
    During pod creation, waitForGrpcReady triggers cc.Connect() and polls connection state transitions until READY or podReadyTimeout (60s) expires — pod only enters cache once gRPC is confirmed reachable
    Since all cached pods have verified connections, Unavailable means the pod genuinely died — no ambiguity with starting pods
    Eviction handler does a single k8s Get on the specific pod: if not found, Failed, or DeletionTimestamp set → removes from cache + deletes pod in background. If still Running → keeps it (false positive from transient issue, GC handles within 1 scan interval(default))
    Only the Pod is deleted during eviction — Service persists for reuse by retrieveOrCreateService on the next scale-up


Related Issue(s)

  • Closes/Fixes #

Type of Change

  • Bug fix
  • New feature
  • Enhancement
  • Refactor
  • Documentation
  • Tests
  • Other: ________

Checklist

  • Code follows project style guidelines
  • Self-reviewed changes
  • Tests added/updated
  • Documentation added/updated
  • All tests and gating checks pass

Testing Instructions (Optional)


Additional Notes (Optional)

  • Known issues:
  • Further improvements:
  • Review notes:

AI Disclosure

  • I have used AI in the creation of this PR.

If so, please describe how:

  • Microsoft Copilot to analyse the code.
  • Kiro to generate unit tests and code analysis.

@kushnaidu kushnaidu requested review from a team and Copilot June 23, 2026 07:28
@netlify

netlify Bot commented Jun 23, 2026

Copy link
Copy Markdown

Deploy Preview for kpt-porch ready!

Name Link
🔨 Latest commit 5d98dc1
🔍 Latest deploy log https://app.netlify.com/projects/kpt-porch/deploys/6a3ab231b4014e0008ebf1d4
😎 Deploy Preview https://deploy-preview-1069--kpt-porch.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
🤖 Make changes Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

@dosubot dosubot Bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Jun 23, 2026
Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>
@kushnaidu kushnaidu force-pushed the evict-pod-on-demand2 branch from 9b9c314 to 0a07281 Compare June 23, 2026 07:29

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to eliminate long hangs and reduce false-positive evictions in the function-runner pod cache by removing grpc.WaitForReady(true) usage and instead gating pod availability on gRPC connection readiness, while making eviction behavior consult actual Kubernetes Pod state.

Changes:

  • Added a gRPC readiness wait (waitForGrpcReady) during pod creation, with a test-only skip flag.
  • Removed grpc.WaitForReady(true) from EvaluateFunction RPCs.
  • Updated eviction handling to check the Kubernetes Pod state before removing from cache and to delete only Pods (not Services); updated/added related tests.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
func/internal/podmanager.go Adds waitForGrpcReady and uses it during pod creation (with test skip flag).
func/internal/podevaluator.go Removes WaitForReady and updates retry rationale around Unavailable.
func/internal/podcachemanager.go Changes eviction to consult Kubernetes Pod state and delete only Pods.
func/internal/podevaluator_podmanager_test.go Sets skipGrpcReadyCheck for tests.
func/internal/podevaluator_podcachemanager_test.go Sets skipGrpcReadyCheck for tests.
func/internal/podcachemanager_eventloop_test.go Refactors eviction tests into a table-driven test and sets skipGrpcReadyCheck.

Comment thread func/internal/podevaluator.go
Comment thread func/internal/podcachemanager.go
Comment thread func/internal/podmanager.go
Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>
Copilot AI review requested due to automatic review settings June 23, 2026 07:55

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Comment thread func/internal/podcachemanager.go
Comment thread func/internal/podcachemanager.go
Comment thread func/internal/podcachemanager.go
Comment thread func/internal/podevaluator.go
Comment thread func/internal/podevaluator_podmanager_test.go Outdated
Comment thread func/internal/podcachemanager_eventloop_test.go
Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>
@kushnaidu kushnaidu self-assigned this Jun 23, 2026
@sonarqubecloud

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
75.0% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>
Copilot AI review requested due to automatic review settings June 23, 2026 16:09
@kushnaidu kushnaidu removed the do-not-merge/hold #ededed label Jun 23, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comment thread func/internal/podcachemanager.go
Comment thread func/internal/podcachemanager.go
Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>
Comment on lines +235 to +242
} else if k8sPod.Status.Phase != corev1.PodRunning || k8sPod.DeletionTimestamp != nil {
klog.Infof("Evicting dead pod %s from cache for image %s (Unavailable)", evict.podKey.Name, evict.image)
if fn.pods[idx].grpcConnection != nil {
fn.pods[idx].grpcConnection.Close()
}
pcm.DeletePodInBackground(k8sPod)
fn.pods = slices.Delete(fn.pods, idx, idx+1)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question on this, if the Pod is Running, don't we end up leaving the same dead grpcConnection in the cache?

@kushnaidu kushnaidu Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will be dead and the grpc will timeout eventually and let go of it. We don't want to close it because it might crash other parallel in RPC calls. This can happen especially when the liveliness probe makes the pod go down under high load. However, the GC as well on its next scan interval will clean up if the pod is dead.

@dosubot dosubot Bot added the lgtm #ededed label Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold #ededed lgtm #ededed size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants