ROX-35000: Add optional OpenShift Virtualization support to openshift-4 flavor#1855
ROX-35000: Add optional OpenShift Virtualization support to openshift-4 flavor#1855vikin91 wants to merge 8 commits into
Conversation
Co-authored-by: Cursor <cursoragent@cursor.com>
…rtifact Co-authored-by: Cursor <cursoragent@cursor.com>
…artifact Co-authored-by: Cursor <cursoragent@cursor.com>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR adds OpenShift Virtualization (KubeVirt) VM provisioning capability to the infra server. It exposes virtualization parameters in both openshift-4 flavors, implements a three-stage workflow that provisions virt nodes, installs the HyperConverged operator, and creates KubeVirt VirtualMachines with VSOCK and registry pull secret integration. ChangesOpenShift Virtualization (KubeVirt) provisioning
🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
A single node development cluster (infra-pr-1855) was allocated in production infra for this PR. CI will attempt to deploy 🔌 You can connect to this cluster with: 🛠️ And pull infractl from the deployed dev infra-server with: 🚲 You can then use the dev infra instance e.g.: Further Development☕ If you make changes, you can commit and push and CI will take care of updating the development cluster. 🚀 If you only modify configuration (chart/infra-server/configuration) or templates (chart/infra-server/{static,templates}), you can get a faster update with: LogsLogs for the development infra depending on your @redhat.com authuser: Or: |
|
@coderabbitai full review |
✅ Action performedFull review finished. |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@chart/infra-server/static/workflow-openshift-4.yaml`:
- Around line 591-611: The loop waiting for VMI (using VM_NAME, NAMESPACE,
PHASE, timeout, elapsed) currently only logs a warning on timeout and exits 0;
change it so that when elapsed >= timeout you first persist any needed debugging
artifacts (e.g., oc get vmi -o yaml for "$VM_NAME" and logs) and then exit
non-zero to fail the virt path (use a descriptive error log and exit 1). Locate
the while-true block that checks PHASE=$(oc get vmi "$VM_NAME" -n "$NAMESPACE"
-o jsonpath='{.status.phase}') and replace the timeout branch to collect/persist
debug output before calling exit 1 so the workflow fails instead of succeeding.
- Around line 493-519: The script currently generates VM_PASSWORD and rewrites
vm-access.md before checking whether the VM already exists, causing
stale/unapplied credentials to be published; change the flow in the VM creation
logic so that VM_PASSWORD is only generated and vm-access.md is only
created/overwritten after confirming the VM does not already exist (i.e., move
the VM_PASSWORD assignment and the vm-access.md write block to run inside the
branch where the VM will be created), and if you must support resume/retry, read
any existing credential artifact or the VM/secret first and skip
regenerating/writing when credentials already exist.
- Around line 444-455: After patching KVM_EMULATION in install-virt-operator,
add a reconciliation wait before the create-vm step: poll the HyperConverged
resource (e.g., name hyperconverged-cluster or the HCO resource used in this
flow) in namespace openshift-cnv and loop until its status.conditions show
Available=True, Progressing=False and Degraded=False (and/or confirm the
subscription/env update has taken effect), with a timeout and backoff; only
proceed to the create-vm step when those health checks pass to ensure
KubeVirt/HCO has reconciled the new KVM_EMULATION setting.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 656e2693-b4ae-4d7e-9497-ae0c3ec2dad1
📒 Files selected for processing (2)
chart/infra-server/static/flavors.yamlchart/infra-server/static/workflow-openshift-4.yaml
| # Patch KVM_EMULATION | ||
| current_kvm=$(oc get subscription kubevirt-hyperconverged -n openshift-cnv -o jsonpath='{.spec.config.env[?(@.name=="KVM_EMULATION")].value}' 2>/dev/null || echo "") | ||
| if [ "$current_kvm" = "true" ]; then | ||
| echo "KVM_EMULATION already set" | ||
| else | ||
| echo "Patching subscription with KVM_EMULATION..." | ||
| oc patch subscription kubevirt-hyperconverged \ | ||
| -n openshift-cnv \ | ||
| --type=merge \ | ||
| -p '{"spec":{"config":{"selector":{"matchLabels":{"name":"hyperconverged-cluster-operator"}},"env":[{"name":"KVM_EMULATION","value":"true"}]}}}' | ||
| echo "KVM_EMULATION patched" | ||
| fi |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In OpenShift Virtualization / HyperConverged Operator, after setting KVM_EMULATION=true through the Subscription config, what condition should be waited on before creating a VM?
💡 Result:
When you update the Subscription for the HyperConverged Operator (HCO) to include KVM_EMULATION=true, the operator reconciles this change by updating the underlying KubeVirt configuration [1][2]. You should wait for the HyperConverged Cluster Operator (HCO) to complete its reconciliation and for the KubeVirt custom resource (CR) to reflect the updated configuration [2]. Specifically, you should verify that the HCO is in a ready state and that the KubeVirt operand has successfully processed the new configuration [2]. While there is no single "KVM_EMULATION_READY" status condition to watch for, you can confirm the state by ensuring the HyperConverged operator pod(s) have finished reconciling and checking the KubeVirt status or associated virt-handler logs to ensure the system is correctly identifying the emulation mode (or lack of KVM) [3][4][5]. In many environments, these changes are applied immediately by the operator [2].
Citations:
- 1: https://github.com/kubevirt/hyperconverged-cluster-operator/blob/main/deploy/kustomize/kvm_emulation/subscription.patch.yaml
- 2: https://patriciocerda.com/?p=1629
- 3: https://access.redhat.com/solutions/7092883
- 4: kubevirt/kubevirt@0fb325a
- 5: https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=1992608
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="chart/infra-server/static/workflow-openshift-4.yaml"
ls -l "$FILE"
sed -n '1,220p' "$FILE" | nl -ba | head -n 220
echo "----"
sed -n '220,420p' "$FILE" | nl -ba | head -n 200
echo "----"
sed -n '420,520p' "$FILE" | nl -ba
echo "----"
# Find where create-vm template defined and where it is invoked
rg -n "template: create-vm|name: create-vm|create-vm" "$FILE"
rg -n "KVM_EMULATION" "$FILE"
rg -n "install-virt-operator|template: install-virt-operator|name: install-virt-operator" "$FILE"Repository: stackrox/infra
Length of output: 221
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="chart/infra-server/static/workflow-openshift-4.yaml"
echo "== file =="
wc -l "$FILE"
sed -n '1,140p' "$FILE" | nl -ba
echo "== mid =="
sed -n '140,340p' "$FILE" | nl -ba
echo "== end region around patch =="
sed -n '400,520p' "$FILE" | nl -ba
echo "== rg sections =="
rg -n "KVM_EMULATION|create-vm|install-virt-operator|create-vm" "$FILE"Repository: stackrox/infra
Length of output: 186
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="chart/infra-server/static/workflow-openshift-4.yaml"
echo "== file =="
ls -l "$FILE"
wc -l "$FILE"
echo "== head (1-160) =="
sed -n '1,160p' "$FILE" | nl -ba
echo "== mid (160-360) =="
sed -n '160,360p' "$FILE" | nl -ba
echo "== around patch (420-510) =="
sed -n '420,510p' "$FILE" | nl -ba
echo "== around templates (scan more) =="
sed -n '360,440p' "$FILE" | nl -ba
echo "== rg matches =="
rg -n "KVM_EMULATION|create-vm|install-virt-operator|create-vm:|template: create-vm|template: install-virt-operator|name: create-vm|name: install-virt-operator" "$FILE"Repository: stackrox/infra
Length of output: 307
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="chart/infra-server/static/workflow-openshift-4.yaml"
echo "== file =="
ls -l "$FILE"
wc -l "$FILE"
echo "== head (1-160) =="
sed -n '1,160p' "$FILE" | nl -ba
echo "== mid (160-360) =="
sed -n '160,360p' "$FILE" | nl -ba
echo "== around patch (420-510) =="
sed -n '420,510p' "$FILE" | nl -ba
echo "== around templates (360-440) =="
sed -n '360,440p' "$FILE" | nl -ba
echo "== rg matches =="
rg -n "KVM_EMULATION|create-vm|install-virt-operator|template: create-vm|template: install-virt-operator|name: create-vm|name: install-virt-operator" "$FILE"Repository: stackrox/infra
Length of output: 307
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="chart/infra-server/static/workflow-openshift-4.yaml"
echo "== rg matches =="
rg -n "template: create-vm|name: create-vm|template: install-virt-operator|name: install-virt-operator|KVM_EMULATION|create-vm" "$FILE" || true
echo "== patch area (420-520) with line numbers via awk =="
sed -n '420,520p' "$FILE" | awk '{printf "%d\t%s\n", NR+419, $0}'
echo "== earlier context (360-440) =="
sed -n '360,440p' "$FILE" | awk '{printf "%d\t%s\n", NR+359, $0}'
echo "== find where create-vm is called/invoked (search for 'template: create-vm' blocks) =="
rg -n "template:\s*create-vm" "$FILE" || true
rg -n "name:\s*create-vm" "$FILE" || trueRepository: stackrox/infra
Length of output: 10137
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="chart/infra-server/static/workflow-openshift-4.yaml"
# Print the workflow steps section where install-virt-operator and create-vm are invoked
sed -n '1,120p' "$FILE" | awk '{printf "%d\t%s\n", NR, $0}'
echo "----"
# Also print around the invocation list lines ~50-90 if needed
sed -n '40,95p' "$FILE" | awk '{printf "%d\t%s\n", NR, $0}'Repository: stackrox/infra
Length of output: 5122
Add a post-KVM_EMULATION reconciliation wait before create-vm.
install-virt-operator patches KVM_EMULATION at the end (lines ~444–455) and the workflow then immediately runs create-vm (lines ~67–74 in the start steps). Add a second wait after the patch to ensure the HyperConverged Operator/KubeVirt has reconciled the updated subscription config (there’s no single KVM_EMULATION_READY condition to watch; re-check HCO’s health conditions like Available=True, Progressing=False, Degraded=False after the patch, and/or verify the updated operand state).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@chart/infra-server/static/workflow-openshift-4.yaml` around lines 444 - 455,
After patching KVM_EMULATION in install-virt-operator, add a reconciliation wait
before the create-vm step: poll the HyperConverged resource (e.g., name
hyperconverged-cluster or the HCO resource used in this flow) in namespace
openshift-cnv and loop until its status.conditions show Available=True,
Progressing=False and Degraded=False (and/or confirm the subscription/env update
has taken effect), with a timeout and backoff; only proceed to the create-vm
step when those health checks pass to ensure KubeVirt/HCO has reconciled the new
KVM_EMULATION setting.
There was a problem hiding this comment.
Good point. The install-virt-operator template already waits for HCO health (Available/Progressing/Degraded) before patching KVM_EMULATION, and the create-vm template waits up to 5 minutes for the VMI to reach Running (which will now fail hard on timeout). If the operator hasn't reconciled KVM_EMULATION by then, the VMI wait will catch it and fail. I'll defer an additional reconciliation loop as defense-in-depth to a follow-up if we see flakiness in practice.
There was a problem hiding this comment.
@vikin91, that reasoning makes sense — the VMI Running wait does act as a practical gate: if KubeVirt hasn't reconciled KVM_EMULATION in time, the VM won't come up and the step will fail, surfacing the problem. The deferral is a reasonable call.
Would you like me to open a follow-up GitHub issue to track adding the explicit post-patch HCO reconciliation wait as defense-in-depth, so it doesn't get lost?
| VM_PASSWORD=$(openssl rand -hex 10) | ||
|
|
||
| echo "=== Creating VM: $VM_NAME (OS: $VM_OS, dedicated=$DEDICATED) ===" | ||
|
|
||
| # Copy the already-provisioned Quay pull secret from the infra namespace into the target cluster | ||
| echo "Creating pull secret for quay.io/rhacs-eng in namespace $NAMESPACE from mounted infra secret..." | ||
| if [ ! -s /infra-secrets/quay/.dockerconfigjson ]; then | ||
| echo "ERROR: mounted registry secret /infra-secrets/quay/.dockerconfigjson is missing or empty" | ||
| exit 1 | ||
| fi | ||
| cat <<EOFSECRET | oc apply -f - | ||
| apiVersion: v1 | ||
| kind: Secret | ||
| metadata: | ||
| name: ${PULL_SECRET_NAME} | ||
| namespace: ${NAMESPACE} | ||
| type: kubernetes.io/dockerconfigjson | ||
| data: | ||
| .dockerconfigjson: $(base64 < /infra-secrets/quay/.dockerconfigjson | tr -d '\n') | ||
| EOFSECRET | ||
| echo "Pull secret created" | ||
|
|
||
| # Check if VM already exists | ||
| if oc get vm "$VM_NAME" -n "$NAMESPACE" &>/dev/null; then | ||
| STATUS=$(oc get vm "$VM_NAME" -n "$NAMESPACE" -o jsonpath='{.status.printableStatus}' 2>/dev/null || echo "Unknown") | ||
| echo "VM $VM_NAME already exists (status: $STATUS)" | ||
| fi |
There was a problem hiding this comment.
Don't regenerate exported credentials when the VM already exists.
VM_PASSWORD is generated before the existence check, and vm-access.md is always rewritten. On retries or resumed workflows, the artifact can publish a password that was never applied to the guest.
Also applies to: 624-647
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@chart/infra-server/static/workflow-openshift-4.yaml` around lines 493 - 519,
The script currently generates VM_PASSWORD and rewrites vm-access.md before
checking whether the VM already exists, causing stale/unapplied credentials to
be published; change the flow in the VM creation logic so that VM_PASSWORD is
only generated and vm-access.md is only created/overwritten after confirming the
VM does not already exist (i.e., move the VM_PASSWORD assignment and the
vm-access.md write block to run inside the branch where the VM will be created),
and if you must support resume/retry, read any existing credential artifact or
the VM/secret first and skip regenerating/writing when credentials already
exist.
The add-virt-node template now auto-detects whether the cluster uses legacy MachineSets (openshift-machine-api) or Cluster API resources (openshift-cluster-api-guests) and creates the virt worker accordingly. CAPI path (OCP 4.19+ on GCP): - Finds a worker GCPMachine, creates a GCPMachineTemplate with n2-standard-8, then creates a CAPI MachineSet (cluster.x-k8s.io) - Extracts bootstrap dataSecretName from existing worker Machine - Falls back to labeling via Machine.status.nodeRef if CAPI label propagation is slow - Applies NoSchedule taint manually (CAPI has no taint field) Co-authored-by: Cursor <cursoragent@cursor.com>
Move VM existence check before password generation so retries don't publish stale credentials that were never applied to the guest. Exit non-zero with VMI/event diagnostics when VMI fails to reach Running within the timeout. Co-authored-by: Cursor <cursoragent@cursor.com>
CAPI resources (cluster.x-k8s.io) don't exist on running OCP clusters. The installer uses CAPI only during provisioning via a local envtest control plane, then shuts it down. The running cluster uses the legacy Machine API (machine.openshift.io/v1beta1) exclusively. The real issue was timing: the Machine API operator needs a moment after cluster creation to reconcile worker MachineSets. Replace the CAPI fallback with a retry loop (up to 300s) that waits for worker MachineSets to appear in openshift-machine-api. Co-authored-by: Cursor <cursoragent@cursor.com>
MachineSet cloning failed on OCP 4.22 because the jq label filter relied on legacy labels that no longer exist. Instead of cloning, scale up an existing worker MachineSet by 1, wait for the new node, then label and optionally taint it for virt workloads. Also update install-virt help text to recommend n2 instances for best VM performance (KVM emulation fallback on e2). Co-authored-by: Cursor <cursoragent@cursor.com>
Add
install-virt,vm-os, andvirt-node-dedicatedparameters to theopenshift-4(andopenshift-4-perf-scale) flavor so users can optionally get a cluster with OpenShift Virtualization, a dedicated virt worker node, and a downloadable VM access artifact.When
install-virt=true, three conditional post-install steps run after cluster creation:node-role.kubernetes.io/virt(optionally tainted whenvirt-node-dedicated=true)quay.io/rhacs-eng/vm-imagesusing the selectedvm-os(rhel9 or rhel10), generates a random password, and writes a downloadablevm-accessartifact with SSH credentialsWhen
install-virt=false(the default), existing behavior is unchanged.Usage examples
How I validated my change
CheckWorkflowEquivalence): extracted workflow and flavor parameter names, diffed — bothopenshift-4andopenshift-4-perf-scalematchinstall-virt=trueand verify the virt node, operator, and VM are running, and thevm-accessartifact is downloadableAI-Assisted: cursor, ~90% generated from Superpowers implementation plan, user reviewed the plan twice.