feat(dind): multi-queue support, Docker mode fixes, and gVisor sandbox support by michaellzc · Pull Request #876 · sourcegraph/deploy-sourcegraph-helm

michaellzc · 2026-06-04T02:19:00Z

related https://github.com/sourcegraph/sourcegraph/pull/12881

Summary

Multi-queue & executor improvements

Add queues list to values: when set, renders one Deployment per queue instead of the single executor Deployment; each queue merges with global executor.env (queue overrides global) and supports replicaCount and resources overrides
Add structured executor fields (frontendUrl, frontendPassword, queueName/queueNames, log, etc.) with validation that blocks managed env vars from env
Per-queue queueName/queueNames fields (deployment name suffix vs queue env var are now independent)

Security

Bind dockerd to 127.0.0.1:2375 instead of 0.0.0.0 — prevents unauthenticated Docker API access from other cluster pods

gVisor support (`dind.gVisor.enabled: true`)

Enables running executor pods on GKE node pools with sandbox-type=gvisor, providing stronger container-to-host isolation.
See: https://gvisor.dev/docs/tutorials/docker-in-gvisor/

Test plan

Deployed to dev cluster with both batches and codeintel queues
Verified executors are running in Docker mode (no RBAC errors)
Verified gVisor pods reach 2/2 Running on sandbox.gke.io/runtime=gvisor nodes with 0 restarts
helm template renders correctly for both multi-queue and single-deployment modes

🤖 Generated with Claude Code

marcleblanc2 · 2026-06-04T02:29:15Z

This syntax seems awkward to operate (easier to get wrong than to get right for a new customer's junior admin)

queues:
  - name: batches
    replicaCount: 4
    env:
      EXECUTOR_QUEUE_NAME:
        value: batches
  - name: codeintel
    replicaCount: 4
    env:
      EXECUTOR_QUEUE_NAME:
        value: codeintel

If possible, this would be simpler:

queues:
  batches:
    replicaCount: 4
  codeintel:
    replicaCount: 4

Or this, if either queue can use the same number of replicas

queues:
  batches:
  codeintel:

marcleblanc2 · 2026-06-04T02:32:26Z

Actually, to help reduce the transition cost of moving customers from the k8s Helm chart to this one, it'd be ideal if this Helm chart could read the same values from an override file which was successful with the k8s Helm chart @michaellzc

marcleblanc2 · 2026-06-04T02:41:04Z

Some examples here: https://sourcegraph.sourcegraph.com/search?q=repo:%5Egithub%5C.com/sourcegraph/customer-assets$+lang:YAML+%5Eexecutor:$&patternType=regexp&sm=0

Also to note, it's common to use one Helm values override.yaml file when deploying multiple of our Helm charts, so the values which get read from multiple charts don't need to be maintained in duplicate in separate override files.

marcleblanc2

Left comments in the PR thread re: ergonomics of using the added YAML syntax; it looks easier to get wrong than it is to get right, especially for some of our self-hosted customer admins.

- Add `queues` list to values: when set, renders one Deployment + Service per queue instead of the single executor Deployment - Each queue merges with global executor.env (queue overrides global) and supports replicaCount and resources overrides - Add global executor.resources and per-queue resources override - Fix readiness probe path: /ready does not exist, use /healthz - Set EXECUTOR_USE_KUBERNETES=false to prevent the executor from auto-detecting Kubernetes mode via KUBERNETES_SERVICE_HOST/PORT (which are always injected into pods); dind chart runs in Docker mode using the dind sidecar - Set enableServiceLinks: false to reduce noise in pod env - Fix private docker registry image tag (registry:3) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

--registry-mirror was pointing at http://executor:5000 which doesn't resolve; the private registry service is private-docker-registry:5000. Also fix ConfigMap condition to render in both single and multi-queue modes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

StatefulSet gives stable pod identity and manages its own PVC via volumeClaimTemplates, replacing the separate PersistentVolumeClaim. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add first-class values for options that map directly to executor env vars, mirroring the k8s chart interface: - executor.frontendUrl/frontendPassword/frontendExistingSecret - executor.queueName/queueNames (single-deployment mode) - executor.maximumNumJobs/maximumRuntimePerJob - executor.log.level/format - executor.dockerAddHostGateway - executor.debug.keepWorkspaces In multi-queue mode, each queue's EXECUTOR_QUEUE_NAME is set automatically from its name field. Add executor.validateEnv helper that fails at render time if executor.env or any queue.env contains a managed env var name. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- executor.enabled had no meaningful use; the chart always deploys an executor, so the flag is removed - dind.daemonConfig is now a values map rendered as daemon.json, allowing arbitrary Docker daemon configuration via values overrides Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Extract the full Deployment body into executor.deployment in _helpers.tpl. executor.Deployment.yaml is now pure dispatch — resolves the per-queue vs single-deployment args and delegates to the shared template. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Helm trims whitespace between range iterations, causing the --- inside executor.deployment to land on the same line as the prior document's last character. Both deployments ended up in a single YAML document, making helm track them incorrectly. Move the separator to the call site. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Each queue now supports explicit queueName (EXECUTOR_QUEUE_NAME) and queueNames (EXECUTOR_QUEUE_NAMES) fields, mirroring the global executor.queueName/queueNames. The deployment name still comes from queue.name; queueName defaults to name when omitted. The two env vars are mutually exclusive: queueNames takes precedence. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Change --host=tcp://0.0.0.0:2375 to 127.0.0.1:2375 so the unauthenticated Docker API is not reachable from other cluster pods - Add dind.gVisor.enabled (default false): sets runtimeClassName: gvisor and replaces privileged: true with explicit capabilities per https://gvisor.dev/docs/tutorials/docker-in-gke-sandbox/ gVisor intercepts these capabilities in-sandbox; no host privileges are granted regardless of the capability list Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

gVisor's kernel does not implement the ioctls required by overlay2, aufs, or devicemapper — all three fail with "invalid argument" or "not found". vfs is the only storage driver that works inside the gVisor sandbox. Add --storage-driver=vfs to the dockerd command when dind.gVisor.enabled is true so the daemon starts successfully on gVisor nodes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

gVisor's virtual network layer exposes docker0 as a plain interface, not a bridge, causing dockerd to fail at network controller initialization. --iptables=false bypasses the bridge/netfilter setup path that requires kernel modules (bridge, br_netfilter, nf_conntrack) unavailable in gVisor. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Docker 29 + gVisor requires specific setup before dockerd can start: - ip_forward must be enabled manually (gVisor doesn't auto-enable it) - iptables-legacy NAT POSTROUTING rules must be set up before dockerd since --iptables=false prevents dockerd from managing them - docker0 must be deleted before restart to avoid "existing interface docker0 is not a bridge" crash on CrashLoopBackOff recovery - --feature containerd-snapshotter=false is required for Docker 29+ on gVisor since the containerd image store needs tmpfs as upper layer - --ip6tables=false required alongside --iptables=false per gVisor docs Also bumps dind liveness/readiness initialDelaySeconds (5/10 → 15/20) to give the setup script time to run before the probe fires. References: https://gvisor.dev/docs/tutorials/docker-in-gvisor/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Two bugs: 1. `ip link del docker0` in gVisor silently fails (or docker0 is pre-existing and can't be deleted), leaving a non-bridge docker0 when dockerd starts. Fix: explicitly create docker0 as a bridge (del, add type bridge, assign 172.17.0.1/16, bring up) before starting dockerd so Docker finds a proper bridge interface. 2. `sed '\s'` is not portable on Alpine's busybox sed — the \s escape is not recognized, causing the device-name variable to capture all ip-route output lines. Fix: use awk to extract the first 'dev' token from the default route. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ating gVisor's netlink implementation does not report IFLA_INFO_KIND="bridge" when querying an interface created with `ip link add type bridge`. This causes Docker's bridge driver to fail the pre-existing interface type check (link.Type() != "bridge"). The fix: don't pre-create docker0. When Docker creates it via Go's netlink library (bridgeAlreadyExists=false), the type check is skipped entirely. Keep `ip link del docker0` for restart resilience — on CrashLoopBackOff restarts, docker0 from the previous run is deleted before dockerd starts, so Docker always sees it as a fresh create. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

tcpSocket probes connect from the kubelet using the pod IP, not 127.0.0.1, so they always fail when dockerd is bound to localhost only. After failureThreshold (5 × 5s = 25s past initialDelay), kubelet kills dind — leaving docker0 in the network namespace and causing the "existing interface docker0 is not a bridge" error on the next restart. Switch to exec probe using `docker -H tcp://127.0.0.1:2375 version`, which runs inside the container and can reach the loopback address. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…erwrite Instead of appending gVisor-specific flags to the dockerd command, merge them into daemon.json via the ConfigMap template using mergeOverwrite. - Add dind.gVisor.daemonConfig with the gVisor defaults: storage-driver, iptables, ip6tables, features.containerd-snapshotter - ConfigMap deep-merges gVisor.daemonConfig into dind.daemonConfig when gVisor is enabled, so users can override individual keys if needed - Exec line is now identical for both gVisor and non-gVisor paths Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… body Move the {{- if gVisor }} inside the command: block instead of duplicating the whole block, eliminating the repeated dockerd arguments. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

tls, mtu, hosts, registry-mirrors are now in dind.daemonConfig so the command is just [dockerd] (non-gVisor) or [/bin/sh, -c, setup+exec dockerd] (gVisor). No flags passed on the command line. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

filiphaftek

@michaellzc the changes are very good and makes executore more mature, I only wonder about backward compatibility - I confirmed with DS that this might break existing executors, do we fix this for customers?

marcleblanc2 · 2026-06-05T15:44:13Z

I only wonder about backward compatibility - I confirmed with DS that this might break existing executors, do we fix this for customers?

@filiphaftek I'm not aware of any customers currently using the DinD deployment method for Executors, and haven't found anything while searching for it in Slack. I'm not sure if we can decipher this via telemetry, or if a new telemetry field needs to be added.

Extract the hardcoded capabilities list and shell setup script out of the template and into dind.gVisor.securityContext and dind.gVisor.setupScript, following the same pattern as daemonConfig. Both fields are fully overridable in user values files. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ommand Add dind.command (default: [dockerd]) and dind.gVisor.command (the gVisor shell wrapper) as plain YAML arrays in values. Template renders whichever is active via toYaml — no string interpolation needed. Removes the setupScript field. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

michaellzc · 2026-06-05T19:42:57Z

@michaellzc the changes are very good and makes executore more mature, I only wonder about backward compatibility - I confirmed with DS that this might break existing executors, do we fix this for customers?

there is no backward compatibility concern. the chart has been broken from day one.

for example, the executor deployment manifest has been missing .metadata.name, the registry image has a typo regisry instead of registry.

Replace `docker version` with `wget -qO- http://127.0.0.1:2375/_ping` — lighter check against Docker's purpose-built ping endpoint, using busybox wget which is always present in the Alpine-based dind image. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

some suggestion has been addressed

sourcegraph-release-bot · 2026-06-05T21:27:14Z

The backport to 7.3.x failed at https://github.com/sourcegraph/deploy-sourcegraph-helm/actions/runs/27037785081:

Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the sourcegraph/deploy-sourcegraph-helm repository.

To backport this PR manually, you can either:

Via the sg tool

Use the sg backport command to backport your commit to the release branch.

sg backport -r 7.3.x -p 876

Via your terminal

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-7.3.x 7.3.x
# Navigate to the new working tree
cd .worktrees/backport-7.3.x
# Create a new branch
git switch --create backport-876-to-7.3.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 09972342271f0863de1ed98be81ca25327d86933
# Push it to GitHub
git push --set-upstream origin backport-876-to-7.3.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-7.3.x

If you encouter conflict, first resolve the conflict and stage all files, then run the commands below:

git cherry-pick --continue
# Push it to GitHub
git push --set-upstream origin backport-876-to-7.3.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-7.3.x

Follow above instructions to backport the commit.
Create a pull request where the base branch is 7.3.x and the compare/head branch is backport-876-to-7.3.x., click here to create the pull request.

Once the pull request has been created, please ensure the following:

Make sure to tag @sourcegraph/release in the pull request description.
kindly remove the release-blocker from this pull request.

…and gVisor sandbox support (#879) Backport 0997234 from #876 related sourcegraph/sourcegraph#12881 ## Summary ### Multi-queue & executor improvements - Add `queues` list to values: when set, renders one Deployment per queue instead of the single executor Deployment; each queue merges with global `executor.env` (queue overrides global) and supports `replicaCount` and `resources` overrides - Add structured `executor` fields (`frontendUrl`, `frontendPassword`, `queueName`/`queueNames`, `log`, etc.) with validation that blocks managed env vars from `env` - Per-queue `queueName`/`queueNames` fields (deployment name suffix vs queue env var are now independent) ### Security - Bind dockerd to `127.0.0.1:2375` instead of `0.0.0.0` — prevents unauthenticated Docker API access from other cluster pods ### gVisor support (`dind.gVisor.enabled: true`) Enables running executor pods on GKE node pools with `sandbox-type=gvisor`, providing stronger container-to-host isolation. See: https://gvisor.dev/docs/tutorials/docker-in-gvisor/ ## Test plan - [x] Deployed to dev cluster with both `batches` and `codeintel` queues - [x] Verified executors are running in Docker mode (no RBAC errors) - [x] Verified gVisor pods reach `2/2 Running` on `sandbox.gke.io/runtime=gvisor` nodes with 0 restarts - [x] `helm template` renders correctly for both multi-queue and single-deployment modes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Michael Lin <mlzc@hey.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

michaellzc force-pushed the michaellzc/dind-multi-queue branch from eca8932 to 33125ec Compare June 4, 2026 02:21

michaellzc requested review from a team June 4, 2026 02:22

marcleblanc2 previously requested changes Jun 4, 2026

View reviewed changes

DaedalusG approved these changes Jun 4, 2026

View reviewed changes

michaellzc requested review from a team and DaedalusG June 5, 2026 02:34

michaellzc marked this pull request as draft June 5, 2026 04:22

michaellzc marked this pull request as ready for review June 5, 2026 05:52

michaellzc changed the title ~~feat(dind): multi-queue support and Docker mode fixes~~ feat(dind): multi-queue support, Docker mode fixes, and gVisor sandbox support Jun 5, 2026

michaellzc and others added 16 commits June 4, 2026 22:55

remove executor Service — executors are pull-based, no Service needed

da47321

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(dind): convert private-docker-registry to StatefulSet

b3ad090

StatefulSet gives stable pod identity and manages its own PVC via volumeClaimTemplates, replacing the separate PersistentVolumeClaim. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

remove dind.override.yaml from tracking

00f3227

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: gitignore dind.override.yaml

d0d5ae3

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix up

ecc6ea6

michaellzc and others added 9 commits June 4, 2026 22:55

debug: add set -x to gVisor dind startup script

5d37aae

refactor(dind): unify command block — single declaration, conditional…

ecefa73

… body Move the {{- if gVisor }} inside the command: block instead of duplicating the whole block, eliminating the repeated dockerd arguments. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix up

2396cf9

fix up

365f89a

michaellzc force-pushed the michaellzc/dind-multi-queue branch from 0aae929 to 365f89a Compare June 5, 2026 05:55

filiphaftek approved these changes Jun 5, 2026

View reviewed changes

michaellzc and others added 2 commits June 5, 2026 09:12

michaellzc added 2 commits June 5, 2026 12:55

fix up

5fff990

fix up

2530b2b

michaellzc added the backport 7.3.x Backport to 7.3.x release branch label Jun 5, 2026

michaellzc enabled auto-merge (squash) June 5, 2026 20:13

michaellzc merged commit 0997234 into main Jun 5, 2026
5 checks passed

michaellzc deleted the michaellzc/dind-multi-queue branch June 5, 2026 20:14

sourcegraph-release-bot mentioned this pull request Jun 5, 2026

[Backport 7.3.x] feat(dind): multi-queue support, Docker mode fixes, and gVisor sandbox support #879

Merged

4 tasks

sourcegraph-release-bot added backports release-blocker failed-backport-to-7.3.x labels Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dind): multi-queue support, Docker mode fixes, and gVisor sandbox support#876

feat(dind): multi-queue support, Docker mode fixes, and gVisor sandbox support#876
michaellzc merged 30 commits into
mainfrom
michaellzc/dind-multi-queue

michaellzc commented Jun 4, 2026 •

edited

Loading

Uh oh!

marcleblanc2 commented Jun 4, 2026 •

edited

Loading

Uh oh!

marcleblanc2 commented Jun 4, 2026 •

edited

Loading

Uh oh!

marcleblanc2 commented Jun 4, 2026 •

edited

Loading

Uh oh!

marcleblanc2 left a comment

Uh oh!

filiphaftek left a comment

Uh oh!

marcleblanc2 commented Jun 5, 2026

Uh oh!

michaellzc commented Jun 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

sourcegraph-release-bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

michaellzc commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Multi-queue & executor improvements

Security

gVisor support (dind.gVisor.enabled: true)

Test plan

Uh oh!

marcleblanc2 commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marcleblanc2 commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marcleblanc2 commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marcleblanc2 left a comment

Choose a reason for hiding this comment

Uh oh!

filiphaftek left a comment

Choose a reason for hiding this comment

Uh oh!

marcleblanc2 commented Jun 5, 2026

Uh oh!

michaellzc commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

sourcegraph-release-bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

michaellzc commented Jun 4, 2026 •

edited

Loading

gVisor support (`dind.gVisor.enabled: true`)

marcleblanc2 commented Jun 4, 2026 •

edited

Loading

marcleblanc2 commented Jun 4, 2026 •

edited

Loading

marcleblanc2 commented Jun 4, 2026 •

edited

Loading

michaellzc commented Jun 5, 2026 •

edited

Loading