Newer
Older
openstack-caracal-ipv4 / runbooks / phase-08-workload-cluster-acceptance.md

Phase 08 -- Workload-Cluster Acceptance (D-011)

Prove tenant self-service Kubernetes end to end: create a workload cluster from the capi-k8s-v1-32 template, confirm it converges (Ready nodes, CNI, CCM/CSI, API LB), then run the D-011 acceptance bar. Passing D-011 is the gate that unlocks the project-completion tasks.

Decisions: D-011 (acceptance bar; amended by D-019 -- item 8 Designate deferred), D-031/D-036 (driver/engine/chart coherence), D-039 (app-cred roles incl. load-balancer_member), D-040 (reserved-host-memory), D-041 (non-HA mgmt manual start), D-042 (driver contract coherence -> health HEALTHY after phase-07). Troubleshooting: appendix-A -- stuck-delete finalizer, LB-failover, OOM/manual-start, uninitialized-taint, CNI-label, DOCFIX-021.


Prerequisites (must be true entering phase-08)

  • phase-04 done: the external provider network (provider-ext) exists. The workload cluster's API-LB floating IP and node FIPs are allocated from it.
  • phase-05 done: Octavia enabled and healthy. The magnum-capi-helm driver ALWAYS provisions an Octavia LB for the apiserver (--master-lb-enabled), so Octavia is a hard prerequisite for workload-cluster create (not optional).
  • phase-07 EXIT GATE passed: conductor grafted, contract-coherent driver (1.4.0). On a FRESH DEPLOY the HEALTHY + regression items of that gate are deferred to THIS phase (8.2 health gate; 8.1-8.5 create path). On an existing-cluster graft, health_status already reports HEALTHY (if the phase-07 1.4.0 upgrade was skipped, expect the COSMETIC UNHEALTHY of D-042 -- functional, but not an acceptance pass).
  • Image ubuntu-jammy-kube-v1.32.13 present AND carrying Glance properties (8.0 below verifies, and on a fresh deploy imports it from the jumphost-staged qcow2) kube_version (e.g. v1.32.13) and os_distro=ubuntu. The driver reads the k8s version from the IMAGE, not a template label (P6-CONTRACT / L-P6-3); a missing property fails create.
  • Cluster template capi-k8s-v1-32 present (8.0 verifies/creates it).
  • D-039: the Magnum service path mints app-creds carrying load-balancer_member (+ member, reader). A frozen pre-D-039 app-cred 403s on the Octavia LB step and wedges create/delete (appendix-A: stuck-delete).
  • D-040: nova-compute reserved-host-memory = 8192 in effect on all compute hosts (baked into the hardened bundle; verify below). Default 512 over-commits the hyperconverged hosts and OOM-kills guests.

Constants and env-literals (TAG: confirm per site / run on rebuild)

  • ENV(project) capi-mgmt (id 674171fd28d446d3a37073b6a761e910)
  • ENV(admin-project) admin (id 65ce73e6798e4d1e8dd066609b7033ef)
  • ENV(template) capi-k8s-v1-32 (uuid e2549d8b-4b89-4947-8b9a-0f4fdbe87d59)
  • ENV(image) ubuntu-jammy-kube-v1.32.13 (id de69c243-bd1f-4182-8e9e-33933e926857)
  • ENV(ext-net) provider-ext (id 70b34bb2-3afb-4b43-96d3-f520dbcbf9a8)
  • ENV(keypair) capi-mgmt-key
  • ENV(cluster) capi-test-1
  • ENV(workload-cidr) 10.20.16.0/24
  • ENV(flavors) master gp.mid (8192/2) ; worker capi.node (4096/2)
  • run-specific (do NOT hardcode -- capture at run): API LB id, LB VIP (10.20.16.x), workload API FIP (10.12.7.180 on the as-built run).

Scope-hygiene preambles (the project-scope leak guard)

Capi-mgmt-scoped (cluster CRUD, show, config):

source ~/admin-openrc
unset OS_PROJECT_NAME OS_PROJECT_ID OS_TENANT_NAME OS_TENANT_ID OS_PROJECT_DOMAIN_ID OS_PROJECT_DOMAIN_NAME
export OS_PROJECT_ID=674171fd28d446d3a37073b6a761e910      # ENV(project)

Admin-scoped (LB amphora/failover -- these 403 under tenant member scope):

source ~/admin-openrc
unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME            # token -> admin 65ce73e6...

Step 8.0 -- Verify prerequisites; create the template if absent

# RUN: jumphost (capi-mgmt scope). Read-only checks consolidated; template create gated separately. (NOTE: template + image are tenant-setup artifacts; on a fully fresh build they may be produced by the magnum-setup step -- this phase verifies/creates the template for self-containment.)

( {
  set -u
  echo "=== image present + carries kube_version / os_distro ==="
  openstack image show ubuntu-jammy-kube-v1.32.13 -f json \
    | python3 -c 'import json,sys;d=json.load(sys.stdin);p=d.get("properties",d);print("kube_version=",d.get("kube_version") or p.get("kube_version"));print("os_distro=",d.get("os_distro") or p.get("os_distro"))'
  echo "=== reserved-host-memory (D-040) on a compute unit ==="
  juju ssh nova-compute/0 'sudo grep -i reserved_host_memory /etc/nova/nova.conf' </dev/null   # expect 8192
  echo "=== template present? ==="
  openstack coe cluster template show capi-k8s-v1-32 -f value -c uuid 2>/dev/null \
    && echo "template OK" || echo "template ABSENT -- create it below"
} )

If the image is ABSENT (fresh deploy -- nothing survives teardown), import it from the jumphost-staged qcow2. The command is the VERBATIM 2026-06-08 as-executed path (glance-direct; plain web-download 403s on this cloud). With the hardened bundle's glance image-conversion: true the stored disk_format lands raw on the redeploy (expected -- the as-built run stored qcow2 because conversion was off then):

( {
  set -u
  source ~/admin-openrc
  if openstack image show ubuntu-jammy-kube-v1.32.13 >/dev/null 2>&1; then
    echo "[SKIP] image ubuntu-jammy-kube-v1.32.13 present"
  else
    SRC="$HOME/ubuntu-jammy-kube-v1.32.13-260401-2014.qcow2"
    [ -f "$SRC" ] || { echo "ABORT: $SRC missing on the jumphost (azimuth-images source; see appendix-B)"; exit 1; }
    glance image-create-via-import \
      --import-method glance-direct \
      --file "$SRC" \
      --container-format bare --disk-format qcow2 \
      --property os_distro=ubuntu --property kube_version=v1.32.13 \
      --name ubuntu-jammy-kube-v1.32.13
  fi
  echo "=== poll to active (3.7G stage + conversion; allow ~10 min) ==="
  for i in $(seq 1 40); do
    ST=$(openstack image show ubuntu-jammy-kube-v1.32.13 -f value -c status 2>/dev/null || echo '?')
    echo "[$i] status=$ST"
    [ "$ST" = active ] && break
    sleep 15
  done
} )

GATE: image active and the 8.0 property check above passes (kube_version v1.32.13 / os_distro ubuntu). Then create the template only if absent (spec from the as-built capture; the two labels are intentionally the whole config -- chart 0.25.1 + the conf.d drop-in govern the rest). --network-driver is OMITTED deliberately: under the 1.4.0 driver the option IS honored (it maps to the chart network_driver), so to keep the as-built chart default (Calico) we leave it unset. Setting flannel here would now switch the CNI -- do that only if Calico is being intentionally replaced (appendix-A: CNI-label / 1.4.0).

openstack coe cluster template create capi-k8s-v1-32 \
  --coe kubernetes --server-type vm \
  --image ubuntu-jammy-kube-v1.32.13 \
  --external-network provider-ext \
  --master-flavor gp.mid --flavor capi.node \
  --master-lb-enabled --floating-ip-enabled \
  --dns-nameserver 8.8.8.8 \
  --docker-storage-driver overlay2 \
  --labels fixed_subnet_cidr=10.20.16.0/24,octavia_provider=amphora

Step 8.1 -- Create the workload cluster (MUTATION)

# RUN: jumphost (capi-mgmt scope). 1 control-plane + 2 workers, matching the as-built capi-test-1. The driver auto-mints the app-cred (D-039) and always provisions an Octavia LB (+FIP) for the API.

openstack coe cluster create capi-test-1 \
  --cluster-template capi-k8s-v1-32 \
  --keypair capi-mgmt-key \
  --master-count 1 --node-count 2
openstack coe cluster show capi-test-1 -f value -c uuid -c status

Step 8.2 -- Watch to CREATE_COMPLETE; capture the LB/FIP

# RUN: jumphost (capi-mgmt scope). Poll; capture run-specific LB id + FIP.

( {
  for i in $(seq 1 40); do
    S=$(openstack coe cluster show capi-test-1 -f value -c status 2>/dev/null)
    echo "[$i] status=$S"
    case "$S" in CREATE_COMPLETE|CREATE_FAILED) break;; esac
    sleep 30
  done
  echo "=== api endpoint + node counts ==="
  openstack coe cluster show capi-test-1 -f value -c api_address -c master_count -c node_count -c health_status
} )

GATE: status = CREATE_COMPLETE. Record api_address (the FIP endpoint, e.g. https://10.12.7.180:6443) for 8.3. If CREATE_FAILED, see appendix-A (stuck-delete / app-cred 403 / OOM). With phase-07's driver, health_status should read HEALTHY.

Step 8.3 -- Retrieve the workload kubeconfig; verify nodes / CNI / addons

# RUN: jumphost. Pull the cluster's kubeconfig via Magnum, then inspect.

# capi-mgmt scope
openstack coe cluster config capi-test-1 --dir ~/capi-test-1 --force
export KUBECONFIG=~/capi-test-1/config
# LIVE-REVIEW: confirm `coe cluster config` returns a usable kubeconfig under the
#   capi-helm driver; alternative is the CAPI kubeconfig secret on the mgmt cluster:
#   KUBECONFIG=~/capi-mgmt.kubeconfig clusterctl -n <magnum-ns> get kubeconfig <cluster-name-suffix>

( {
  export KUBECONFIG=~/capi-test-1/config
  echo "=== nodes (expect 3 Ready, v1.32.13: 1 control-plane + 2 workers) ==="
  kubectl get nodes -o wide
  echo "=== CNI = Calico (chart default; --network-driver omitted) ==="
  kubectl -n kube-system get pods | grep -Ei 'calico|tigera' || kubectl get pods -A | grep -Ei 'calico|tigera'
  echo "=== CCM (OpenStack cloud-controller-manager) + Cinder CSI + CoreDNS Running ==="
  kubectl get pods -A | grep -Ei 'cloud-controller|openstack-cloud|cinder-csi|coredns'
  echo "=== any not-Running pods? (expect none) ==="
  kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
} )

GATE: 3 nodes Ready; Calico pods Running; CCM Running (NOT crash-looping -- this is D-011 item 5); Cinder CSI + CoreDNS Running; no stuck pods.

================================================================================

Step 8.4 -- D-011 acceptance bar (the gate)

================================================================================ Run each; record pass/fail. Wording adapted to the as-built IP-only endpoints (B5) where the original D-011 said "hostname."

  • D-011.1 -- All charms active/idle. # RUN: jumphost juju status --format=short | grep -vE 'active|idle' || echo "all active/idle" Pass: nothing but active/idle (phase-03 re-confirmed here).

  • D-011.2 -- API reachability from the jumphost (all public VIPs). # RUN: jumphost IP-only: hit each service VIP, e.g. Keystone: curl -sk https://10.12.4.50:5000/v3 -o /dev/null -w '%{http_code}\n' (expect 200/300). Repeat per public VIP (.50-.60 block). Pass: all respond.

  • D-011.3 -- API reachability from a tenant VM (Option B). # RUN: mgmt VM The generalized phase-06 GATE 1: a tenant VM reaches the provider VIP. ssh ... ubuntu@10.12.7.40 "timeout 6 bash -c 'exec 3<>/dev/tcp/10.12.4.50/5000' && echo VIP-OK || echo VIP-FAIL" </dev/null Pass: VIP-OK (proves the shared-L2 Option B path).

  • D-011.4 -- Octavia LB pattern re-passes (round-robin, failover, recovery). Round-robin: 2-member pool behind a VIP, repeated curls hit both members. Recovery (admin scope): openstack loadbalancer failover <api-lb-id> -> watch ERROR/PENDING_UPDATE -> ACTIVE (~100s; single STANDALONE amphora -> brief blip; operating_status holds ONLINE). (appendix-A: LB-failover; amphora ops are admin-scope only.) Pass: round-robin distributes; failover returns to ACTIVE. TODO (before sign-off): this runbook does NOT yet contain the build steps for the standalone 2-member round-robin pool (LB + listener + pool + 2 backend members + health monitor). Add them here, or fold the round-robin check into the workload-cluster API LB the driver already builds, before D-011.4 is marked complete.

  • D-011.5 -- End-to-end Magnum CAPI cluster create, CCM not crash-looping. Satisfied by 8.1-8.3 (CREATE_COMPLETE + CCM Running). Pass = that gate.

  • D-011.6 -- Vault unseal (MANUAL is the v1 standard). # RUN: jumphost Confirm vault Sealed=false now. The v1 standard is MANUAL unseal after a unit reboot (3-of-5 key shares entered at the hidden prompt -- see phase-02); auto-unseal is an available option, adopted case-by-case (NOT configured in v1). This is a re-confirmation at acceptance, not a re-init. Pass: vault unsealed, and the operator can re-unseal manually after a reboot.

  • D-011.7 -- KVM snapshot baseline taken. # RUN: jumphost hypervisor Per D-012: Snapshot 1 (post-deploy, post-validation, pre-tenant-resources) and Snapshot 2 (post-tenant-setup). qcow2-level, per-VM, on the jumphost hypervisor. Pass: Snapshot 1 captured (Snapshot 2 after tenant setup).

  • D-011.8 -- Designate zones + tenant hostname resolution. DEFERRED. D-019 deferred Designate (dropped do-doc-10-dns). Also moot under IP-only B5: there are no API hostnames to resolve; tenants use IPs/VIPs. Re-scope when DNS returns (v2). NOT required for v1 acceptance.

Step 8.5 -- (Optional) Clean delete verification

# RUN: jumphost (capi-mgmt scope). Confirms the manage/teardown path.

openstack coe cluster delete capi-test-1     # watch coe cluster list to gone

If a delete WEDGES (DELETE_IN_PROGRESS, CRs stuck Deleting on an Octavia 403 from a frozen app-cred): clear the OpenStackCluster finalizer (the Cluster auto-follows), then manual neutron cleanup in dependency order -- appendix-A: stuck-delete.

# NS=magnum-674171fd28d446d3a37073b6a761e910
# KUBECONFIG=~/capi-mgmt.kubeconfig kubectl -n $NS patch openstackcluster <cluster>-<suffix> \
#   --type=merge -p '{"metadata":{"finalizers":[]}}'
# then: openstack router remove subnet / router unset external-gateway / router delete /
#       subnet delete / network delete / security group delete  (dependency order)

EXIT GATE (phase-08 / v1 acceptance)

  • 8.1-8.3 passed: capi-test-1 CREATE_COMPLETE, 3 Ready nodes, Calico, CCM/CSI/CoreDNS, API LB ACTIVE/ONLINE.
  • D-011 items 1-7 PASS; item 8 deferred (D-019).
  • health_status HEALTHY (phase-07 driver).
  • => v1 deployment is ACCEPTED. Project-completion tasks unlocked: consolidate the do-doc runbooks into docs/v1-deploy-runbook.md; revert the GitBucket repo OpenStack/openstack-caracal-ipv4 to PRIVATE.

As-built reference (capi-test-1, suffix kgwwe7c4qj6a, 2026-06-09)

  • create: --master-count 1 --node-count 2; uuid 6de15cf4-8805-4ac2-b413-8de2c48d92cf.
  • nodes: control-plane (xsc62) + 2 workers; v1.32.13; Calico CNI.
  • API LB id 0f968008-8429-4ac3-8b82-452e126982cf, VIP 10.20.16.144, FIP 10.12.7.180, endpoint https://10.12.7.180:6443; single STANDALONE amphora.
  • CCM / Cinder CSI / CoreDNS Running; all addons scheduled; CREATE_COMPLETE.
  • Incident on the as-built run (recovery patterns -> appendix-A): host OOM SHUTOFF the mgmt VM (D-041 manual openstack server start capi-mgmt-v2); API LB went provisioning_status ERROR -> admin-scope loadbalancer failover (ACTIVE ~100s); workers held the CAPI uninitialized taint until the mgmt API returned, then addons scheduled. Root remediation: D-040 reserved-host-memory 512 -> 8192.
  • health_status was UNHEALTHY on the as-built run (cosmetic, D-042) -- phase-07's contract-coherent driver clears it to HEALTHY.

Next

v1 acceptance passes here. Proceed to the project-completion workstream: runbook consolidation (this phase set -> docs/v1-deploy-runbook.md), appendix-A authoring, the repo change-list, and reverting repo visibility to private.