Skip to content

feat(sdk): Go SDK for in-process SeiNetwork/SeiNode orchestration (WS-E)#421

Merged
bdchatham merged 8 commits into
mainfrom
feat/wse-sdk-in-controller
Jun 20, 2026
Merged

feat(sdk): Go SDK for in-process SeiNetwork/SeiNode orchestration (WS-E)#421
bdchatham merged 8 commits into
mainfrom
feat/wse-sdk-in-controller

Conversation

@bdchatham

Copy link
Copy Markdown
Collaborator

What

Lands the WS-E integration refactor — a Go SDK under sdk/ — so CICD chaos/test harnesses can provision a genesis SeiNetwork + a follower SeiNode fleet, wait for readiness, read endpoints, and tear down in one Go program (replacing the bash in platform's k8s_nightly).

API (one-way-door surface)

  • sei.Open(ctx, name) (*Client, error)database/sql-style; provider registry with blank-import flavor selection (_ ".../sdk/sei/provider/k8s"), env precedence (explicit → SEI_PROVIDER → presence: SEI_NODE_CLUSTER⇒k8s / SEI_LOCAL⇒local; both-set fail-fast).
  • ProvisionNetwork / ProvisionFleet / typed Endpoints() (per-pod, read verbatim from .status.endpoint — never reconstructed) / idempotent Teardown. Class error enum + IsTimeout/IsFailed.
  • Canonical readiness probe (the SDK is now the single source of truth): the Sei CometBFT unwrapped-envelope /status decode + the consensus-honest gate (height>1 AND catching_up==false) then eth_blockNumber. Canonical sei.io/role=node / sei.io/seinetwork label constants.
  • provider/k8s (SSA + label stamping + .status.endpoint reads + serial fan-out); provider/local (registered stub); cmd/sei thin up shell that dogfoods the API.

Why in-module (not a standalone repo)

Per the decision to land in an existing repo: importing api/v1alpha1 is now in-module — no cross-module version skew, no replace. seitask lives here too, so its convergence onto the canonical probe is a trivial in-module import (follow-up issue). External importers (seictl/harnesses) pull the full controller graph — the accepted tradeoff; the api/ leaf-module split (#175) would lighten that later.

Validation

  • go build ./..., go vet ./..., go test ./... -race, golangci-lint run ./sdk/... (0 issues), gofmt -s -l sdk/ (clean), and make test over the whole module — all green.
  • SSA via ctrlclient.Apply intentionally matches the controller's internal/task pattern (one //nolint:staticcheck SA1019 with that rationale; module-wide migration tracked separately).
  • .golangci.yml: extended the existing internal/*/api/* lll/dupl exclusion to sdk/* (peer application code).

Provenance

WS-E LLD signed off (xreview RESOLVED R1→R2; SDK-canonical-probe decision); implementation idiom-reviewed clean (the cross-namespace peer-wiring bug it caught is fixed + regression-tested). Convergence follow-ups (seitask imports the canonical probe; #175; the provider-key drift guard) filed separately.

🤖 Generated with Claude Code

…tion (WS-E)

Lands the WS-E integration refactor in-module under sdk/ so CICD harnesses can
provision a genesis SeiNetwork + follower SeiNode fleet and read endpoints in
ONE Go program (replacing the bash in platform's k8s_nightly):

- sdk/sei: database/sql-style Open(ctx, name) + provider registry with
  blank-import flavor selection (SEI_NODE_CLUSTER⇒k8s, SEI_LOCAL⇒local;
  both-set fail-fast). Class error enum (Usage/Timeout/Failed/Infra) +
  IsTimeout/IsFailed. ProvisionNetwork / ProvisionFleet / typed Endpoints /
  idempotent Teardown.
- Canonical readiness probe (the SDK is the single source of truth): the
  CometBFT unwrapped-envelope /status decode + the consensus-honest gate
  (height>1 AND catching_up==false) then eth_blockNumber. Canonical
  sei.io/role=node + sei.io/seinetwork label constants.
- sdk/sei/provider/k8s: SSA apply + object-label stamping + .status.endpoint
  typed reads + serial fan-out; sdk/sei/provider/local: registered stub.
- sdk/cmd/sei: thin `up` shell dogfooding the API.

In-module landing (Brandon's call): imports api/v1alpha1 directly — no
cross-module version skew, no replace. seitask convergence onto the canonical
probe is now an in-module import (tracked follow-up). External importers
(seictl/harnesses) pull the full controller graph — accepted tradeoff; the
api/ leaf-module split (#175) would lighten that later.

Design + xreview: WS-E LLD RESOLVED (R1→R2); idiom review clean (cross-namespace
peer-wiring bug fixed + verified). build/vet/test -race + golangci(sdk) + the
full module suite all green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@cursor

cursor Bot commented Jun 20, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
New stable public API and k8s SSA writes under sei-sdk can affect cluster objects and fleet-wide label/peer contracts; behavior is heavily tested but real-cluster impact depends on harness adoption.

Overview
Adds a new sdk/ tree with a database/sql-style API: blank-import providers, sei.Open(ctx, mode) (explicit mode or exactly one of SEI_NODE_CLUSTER / SEI_LOCAL / SEI_DOCKER), and CRUD handles CreateNetwork / GetNetwork / CreateNode / GetNode with caller-owned WaitReady and idempotent Delete. The SDK is intentionally not an orchestrator—no auto-teardown, no multi-node fan-out, no resource registry.

The k8s provider SSA-applies SeiNetwork/SeiNode under field owner sei-sdk, stamps canonical labels (sei.io/role, sei.io/seinetwork), wires peers (including NetworkNamespace for cross-namespace genesis), reads endpoints from .status, and runs light post-phase serve-probes (TM /status or EVM eth_blockNumber). local and docker register but return ErrNotImplemented. Shared constants, IsTimeout, docs (sdk/CLAUDE.md), and an Example_lifecycle compile-time example ship with broad unit/fake-client tests.

.golangci.yml extends existing dupl/lll path exclusions to sdk/*.

Reviewed by Cursor Bugbot for commit 2b96fc8. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread sdk/sei/provider/k8s/k8s.go Outdated
…default

Bugbot (High): FleetSpec.Namespace="" defaulted follower creation to the
provider default (p.defaultNS) while peer discovery targets the network's
namespace — so a network in a non-default namespace got followers created
where discovery couldn't find them, failing readiness with a misleading
ClassTimeout. spec.go documents "" => same as Network.

Default nodeNS to networkNS when FleetSpec.Namespace is empty (explicit still
wins). All downstream uses (create, wait, status re-read, fleetHandle) already
flow through nodeNS. Adds TestProvisionFleet_DefaultsNamespaceToNetwork
(fail-before/pass-after).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread sdk/sei/provider/k8s/k8s.go Outdated
Comment thread sdk/sei/provider/k8s/render.go
Two Bugbot Medium findings:
- ProvisionFleet best-effort deletes the SeiNodes it created on any error path
  after the first apply (named-error-return + deferred cleanup), so partial
  fleets don't orphan (the SDK's nodes have no Workflow ownerRef to cascade).
  The original provisioning error stays primary (IsTimeout/IsFailed still
  branch); a cleanup failure is surfaced as annotated context, never masks it.
  Tests: cleanup-on-failure (fail-before/pass-after) + annotate-not-mask.
- Remove NetworkSpec.Set / FleetSpec.Set — documented strategic-merge escape
  hatches that render* never read (silent no-op public fields). Deferred to
  when a consumer needs seictl --set parity; Overrides is the MVP config path
  (confirmed genuinely applied: genesis.overrides + spec.overrides).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread sdk/sei/provider/k8s/k8s.go Outdated
Comment thread sdk/cmd/sei/main.go Outdated
…context

Two Bugbot Medium findings (cleanup correctness):
- ProvisionNetwork now best-effort deletes the SeiNetwork it created on any
  post-apply error path (named-return + deferred cleanupNetwork), mirroring
  ProvisionFleet — networks have no owner ref, so a failed ready-wait would
  otherwise orphan one. Original error stays primary; cleanup failure annotated.
- SDK-internal rollback (cleanupFleet/cleanupNetwork) and cmd/sei's deferred
  Teardown now run under a FRESH context.Background()-derived timeout, not the
  provisioning ctx — on a deadline/SIGINT exit the provisioning ctx is already
  canceled, so reusing it made the deletes silently no-op exactly when needed.
  Teardown doc comments advise callers to pass a fresh ctx post-cancellation
  (signatures unchanged — caller owns that ctx).

Audited all Delete sites (4: 2 caller Teardowns, 2 internal rollbacks) — only
the internal ones use the fresh ctx. Tests: network cleanup-on-failure +
network/fleet cleanup-runs-on-canceled-ctx (interceptor asserts the Delete ctx
is live; non-vacuous).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread sdk/sei/provider/k8s/k8s.go Outdated
…eout

Bugbot (Medium): poll/probeReady wrapped a canceled context (SIGINT/SIGTERM or
caller cancel) as ClassTimeout, so IsTimeout / cmd/sei exit-3 treated an
explicit abort like a readiness timeout.

Complete the error model (additive — enum unpublished): add ClassCanceled +
IsCanceled. poll/probeReady now split context.Canceled -> ClassCanceled vs
context.DeadlineExceeded -> ClassTimeout (reliable: PollUntilContextTimeout
returns Canceled on parent cancel, DeadlineExceeded on budget elapse).
cmd/sei maps a canceled error to exit 130 (distinct from timeout's 3).

Note: exit 130 becomes part of cmd/sei's exit-code contract (sibling to D4).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread sdk/sei/provider/k8s/k8s.go Outdated
Comment thread sdk/sei/provider/k8s/client.go
…wline

Two Bugbot Medium findings:
- Rollback regression: ProvisionNetwork/ProvisionFleet deleted the created
  resource on ANY post-apply error, including post-Ready/post-Running re-reads
  and the readiness probe — so a transient failure after the resource was
  healthy destroyed it. Adopt a principled rule: a `provisioned` flag (set
  after waitNetworkReady/waitFleetRunning) gates the deferred cleanup to
  `err != nil && !provisioned`. Failure to come up still rolls back (round-4
  orphan case intact); a later error returns and leaves the healthy resource
  for the caller to Teardown. Doc comments state the rule.
- SA-namespace file carried a trailing newline ("nightly\n") → defaulted
  namespace 404'd on Get. TrimSpace it; defaultNamespace(saFile) made testable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread sdk/cmd/sei/main.go Outdated
Per the honed purpose: the SDK is a thin, stateless, multi-mode Go-native CRUD
API for SeiNetwork/SeiNode lifecycle — NOT an orchestrator. Flow: create network
-> WaitReady -> create rpc nodes as peers -> WaitReady -> run tests against the
returned handles. Cleanup/GC/rollback/composition belong to the caller.

- Open(ctx, mode) selects k8s|local|docker (arg, else exactly one of
  SEI_NODE_CLUSTER/SEI_LOCAL/SEI_DOCKER; never guesses). Providers self-register
  via blank import. Stateless: holds only the mode connection, tracks no
  resources (runtime owns state).
- Mode-agnostic Network/Node handles: WaitReady (phase + ONE light serve-probe),
  endpoint getters from .status, caller-invoked Delete (SDK never auto-deletes),
  Object() any escape to the raw *v1alpha1 CR in k8s mode.
- docker provider stub added alongside local (both registered, "not implemented")
  so the mode seam exists from day one.

Removed (the orchestration the thin layer must not own): auto-rollback/cleanup
(cleanupFleet/Network, provisioned-disarm, fresh-ctx machinery), ProvisionFleet
composite + N-node fan-out + workflow-vars, cmd/sei `up`, the Class/ClassCanceled
error taxonomy (-> plain wrapped errors + IsTimeout), the heavy two-stage
catching_up+EVM gate, the typed Endpoints/FleetEndpoints leaves. Caller loops
CreateNode for N rpc nodes.

build/vet/test -race + golangci(sdk)=0 + gofmt all green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread sdk/sei/provider/k8s/render.go
Bugbot (Medium): renderNode hardwired the synthesized LabelPeerSource.Namespace
to the node's own namespace, so a follower in a different namespace than its
SeiNetwork searched for genesis validators in the wrong place and never wired up.
NodeSpec had no way to express the network's namespace.

Add NodeSpec.NetworkNamespace ("" => same as Namespace, the co-located common
case); CreateNode threads it to the peer selector while the node's own
metadata.namespace stays NodeSpec.Namespace. Test: cross-ns (node rpc-ns,
NetworkNamespace genesis-ns) wires the selector to genesis-ns; co-located
default wires to the node's ns. (fail-before/pass-after)

Idiom nits folded: merge stray stdlib import-group split (handle.go); soften the
mode-const "MUST equal" comment to "kept in sync"; nil-guard the endpoint getters.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 2b96fc8. Configure here.

Comment thread sdk/sei/example_test.go
return
}
defer func() { _ = node.Delete(ctx) }()
nodes = append(nodes, node)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loop defer deletes wrong nodes

Medium Severity

In Example_lifecycle, each loop iteration registers defer func() { _ = node.Delete(ctx) }() but every closure captures the same node variable. When defers run, only the last created node is deleted (possibly multiple times); earlier RPC nodes are left on the cluster.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 2b96fc8. Configure here.

@bdchatham

Copy link
Copy Markdown
Collaborator Author

Bugbot's last finding ("Loop defer deletes wrong nodes", example_test.go:55) is a false positive: node is declared with := inside the loop body (line 47), so it's a fresh variable each iteration — each defer closes over its own node and all RPC nodes are deleted correctly. This is distinct from the loop-variable-capture bug (which is about i), and the module is go 1.26 where loop variables are per-iteration anyway. No change needed.

Merging: the reshaped thin-CRUD SDK is design-confirmed, idiom-reviewed clean (zero findings), and CI-green; the earlier (real) Bugbot findings were all in the orchestration complexity that this reshape removed.

@bdchatham bdchatham merged commit 2523fde into main Jun 20, 2026
5 checks passed
@bdchatham bdchatham deleted the feat/wse-sdk-in-controller branch June 20, 2026 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant