# Phase 08 -- Workload-Cluster Acceptance (D-011)

Prove tenant self-service Kubernetes end to end: create a workload cluster from
the `capi-k8s-v1-34` template, confirm it converges (Ready nodes, CNI, CCM/CSI,
API LB), then run the D-011 acceptance bar. Passing D-011 is the gate that unlocks
the project-completion tasks.

Decisions: D-011 (acceptance bar; amended by D-019 -- item 8 Designate deferred),
D-031/D-036 (driver/engine/chart coherence), D-039 (app-cred roles incl.
load-balancer_member), D-040 (reserved-host-memory), D-041 (non-HA mgmt manual
start), D-042 (driver contract coherence -> health HEALTHY after phase-07).
Troubleshooting: appendix-A -- stuck-delete finalizer, LB-failover, OOM/manual-start,
uninitialized-taint, CNI-label, DOCFIX-021.

---

## Prerequisites (must be true entering phase-08)
- phase-04 done: the external provider network (`provider-ext`) exists. The workload
  cluster's API-LB floating IP and node FIPs are allocated from it.
- phase-05 done: Octavia enabled and healthy. The magnum-capi-helm driver ALWAYS
  provisions an Octavia LB for the apiserver (`--master-lb-enabled`), so Octavia is a
  hard prerequisite for workload-cluster create (not optional).
- phase-07 EXIT GATE passed: conductor grafted, contract-coherent driver (1.4.0). On a
  FRESH DEPLOY the HEALTHY + regression items of that gate are deferred to THIS phase
  (8.2 health gate; 8.1-8.5 create path). On an existing-cluster graft, `health_status`
  already reports HEALTHY (if the phase-07 1.4.0 upgrade was skipped, expect the COSMETIC
  UNHEALTHY of D-042 -- functional, but not an acceptance pass).
- Image `ubuntu-jammy-kube-v1.34.8` present AND carrying Glance properties
  (8.0 below verifies, and on a fresh deploy stage-and-verifies it from the azimuth CDN --
  FINDING-3) `kube_version` (v1.34.8) and `os_distro=ubuntu`. The driver reads the k8s
  version from the IMAGE, not a template label (P6-CONTRACT / L-P6-3); a missing
  property fails create. (D1: bumped from EOL v1.32.13 to v1.34.8, within CAPI v1.13.2 support.)
- Cluster template `capi-k8s-v1-34` present (8.0 verifies/creates it).
- D-039: the Magnum service path mints app-creds carrying `load-balancer_member`
  (+ member, reader). A frozen pre-D-039 app-cred 403s on the Octavia LB step and
  wedges create/delete (appendix-A: stuck-delete).
- D-040: `nova-compute reserved-host-memory = 8192` in effect on all compute hosts
  (baked into the hardened bundle; verify below). Default 512 over-commits the
  hyperconverged hosts and OOM-kills guests.

## Constants and env-literals (TAG: confirm per site / run on rebuild)
- `ENV(project)`       capi-mgmt    (resolve by name; this rebuild id d5bc125c7c1841d389b76cd0a7b0a915, domain capi)
- `ENV(admin-project)` admin        (id 65ce73e6798e4d1e8dd066609b7033ef)
- `ENV(template)`      capi-k8s-v1-34   (D1; uuid regenerates per rebuild -- resolve by name)
- `ENV(image)`         ubuntu-jammy-kube-v1.34.8 (D1; kube_version v1.34.8; id regenerates -- resolve by name)
- `ENV(ext-net)`       provider-ext (resolve by name; this rebuild id 0d00ddc1-d2bf-4849-a087-14c07d77f167)
- `ENV(keypair)`       capi-mgmt-key
- `ENV(cluster)`       capi-test-1
- `ENV(workload-cidr)` 10.20.16.0/24
- `ENV(flavors)`       master gp.mid (8192/2) ; worker capi.node (4096/2)
- run-specific (do NOT hardcode -- capture at run): API LB id, LB VIP (10.20.16.x),
  workload API FIP (10.12.7.180 on the 2026-06-09 as-built run; per-rebuild).

## Scope-hygiene preambles (the project-scope leak guard)
Capi-mgmt-scoped (cluster CRUD, show, config). DOCFIX-034: resolve the capi-mgmt project id
dynamically while admin-scoped, THEN narrow to it -- never hardcode (it regenerates per rebuild):
```bash
source ~/admin-openrc
CAPI_PID=$(openstack project show capi-mgmt --domain capi -f value -c id)   # ENV(project)
unset OS_PROJECT_NAME OS_PROJECT_ID OS_TENANT_NAME OS_TENANT_ID OS_PROJECT_DOMAIN_ID OS_PROJECT_DOMAIN_NAME
export OS_PROJECT_ID="$CAPI_PID"
```
Admin-scoped (LB amphora/failover -- these 403 under tenant member scope):
```bash
source ~/admin-openrc
unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME            # token -> admin (the admin-openrc project)
```

---

## Step 8.0 -- Verify prerequisites; create the template if absent
`# RUN: jumphost` (capi-mgmt scope). Read-only checks consolidated; template create
gated separately. (NOTE: template + image are tenant-setup artifacts; on a fully
fresh build they may be produced by the magnum-setup step -- this phase
verifies/creates the template for self-containment.)

```bash
( {
  set -u
  echo "=== image present + carries kube_version / os_distro ==="
  openstack image show ubuntu-jammy-kube-v1.34.8 -f json \
    | python3 -c 'import json,sys;d=json.load(sys.stdin);p=d.get("properties",d);print("kube_version=",d.get("kube_version") or p.get("kube_version"));print("os_distro=",d.get("os_distro") or p.get("os_distro"))'
  echo "=== reserved-host-memory (D-040) on a compute unit ==="
  juju ssh nova-compute/0 'sudo grep -i reserved_host_memory /etc/nova/nova.conf' </dev/null   # expect 8192
  echo "=== template present? ==="
  openstack coe cluster template show capi-k8s-v1-34 -f value -c uuid 2>/dev/null \
    && echo "template OK" || echo "template ABSENT -- create it below"
} )
```
If the image is ABSENT (fresh deploy -- nothing survives teardown), seed it by
STAGE-AND-VERIFY (FINDING-3 -- REQUIRED, not merely preferred, for azimuth kube images):
glance's web-download plugin fetches with urllib (User-Agent `Python-urllib/3.x`) and the
azimuth CDN returns HTTP 403 to that UA, so a web-download import 202-accepts then hangs in
`queued` forever. curl sends a different UA and is NOT blocked. So curl the qcow2 to the
jumphost ($HOME -- snap-readable, NOT /tmp, L7), verify sha512 against the azimuth-images
0.28.0 manifest, then `openstack image create --file --import` (client-safe: the openstack snap
HAS `image create --import` = glance-direct and image-conversion lands it `raw`; it does NOT
have standalone `image stage`/`image import` subcommands, and the standalone `glance` client is
not assumed present):
```bash
( {
  set -u
  source ~/admin-openrc
  IMG_NAME=ubuntu-jammy-kube-v1.34.8                                  # ENV(image)
  KUBE_VER=v1.34.8                                                    # driver reads this from the image, not a label
  if openstack image show "$IMG_NAME" >/dev/null 2>&1; then
    echo "[SKIP] image $IMG_NAME present"
  else
    # azimuth-images 0.28.0 manifest (build 260518-1604) -- re-confirm vs manifest.json on any bump:
    URL="https://azimuth-images.stackhpc.cloud/ubuntu-jammy-kube-v1.34.8-260518-1604.qcow2"
    SHA512_EXP="7efde4857c9f9da045a98d71def30e229b3d7fffd8a5680e8aee0c5a8b13ba73fca3cf758a927230a1fbe3c451d8d21cfaeded96091e2a4f313c6a404760bdb3"
    SRC="$HOME/ubuntu-jammy-kube-v1.34.8-260518-1604.qcow2"
    if [ -f "$SRC" ] && [ "$(sha512sum "$SRC" | cut -d' ' -f1)" = "$SHA512_EXP" ]; then
      echo "[OK] staged image present + sha512-valid; skipping download"
    else
      echo "[..] curl the qcow2 to $SRC (curl UA passes the CDN; glance urllib UA 403s -- FINDING-3)"
      curl -fSL -o "$SRC" "$URL"
      GOT=$(sha512sum "$SRC" | cut -d' ' -f1)
      [ "$SHA512_EXP" = "$GOT" ] || { echo "GATE FAIL: sha512 mismatch exp=$SHA512_EXP got=$GOT"; exit 1; }
      echo "[OK] sha512 verified against the azimuth-images 0.28.0 manifest"
    fi
    # CORRECTION-1: a plain --file (no --import) PUT stores qcow2 (boots fine); --import runs
    # glance-direct + image-conversion -> raw (Ceph fast-clone alignment), so use --import here.
    openstack image create "$IMG_NAME" \
      --file "$SRC" --import \
      --container-format bare --disk-format qcow2 \
      --property os_distro=ubuntu --property kube_version="$KUBE_VER"
  fi
  echo "=== poll to active (multi-GB stage + conversion; allow ~10 min) ==="
  for i in $(seq 1 40); do
    ST=$(openstack image show "$IMG_NAME" -f value -c status 2>/dev/null || echo '?')
    echo "[$i] status=$ST"
    [ "$ST" = active ] && break
    sleep 15
  done
} )
```
GATE: image `active` and the 8.0 property check above passes (kube_version v1.34.8 /
os_distro ubuntu). Then create the template only if absent. DOCFIX-032: pin
`--network-driver calico` EXPLICITLY. Under the 1.4.0 driver `--network-driver` maps to the
chart `network_driver`, and chart 0.25.1 ships ONLY Calico (flannel is not packaged) -- an
explicit `calico` documents intent and removes reliance on the default staying Calico. Do NOT
set `flannel`: it is unsupported by chart 0.25.1 and would fail to converge.
```bash
openstack coe cluster template create capi-k8s-v1-34 \
  --coe kubernetes --server-type vm \
  --image ubuntu-jammy-kube-v1.34.8 \
  --external-network provider-ext \
  --master-flavor gp.mid --flavor capi.node \
  --master-lb-enabled --floating-ip-enabled \
  --network-driver calico \
  --dns-nameserver 8.8.8.8 \
  --docker-storage-driver overlay2 \
  --labels fixed_subnet_cidr=10.20.16.0/24,octavia_provider=amphora
```

## Step 8.1 -- Create the workload cluster (MUTATION)
`# RUN: jumphost` (capi-mgmt scope). 1 control-plane + 2 workers, matching the
as-built capi-test-1. The driver auto-mints the app-cred (D-039) and always
provisions an Octavia LB (+FIP) for the API.

```bash
openstack coe cluster create capi-test-1 \
  --cluster-template capi-k8s-v1-34 \
  --keypair capi-mgmt-key \
  --master-count 1 --node-count 2
openstack coe cluster show capi-test-1 -f value -c uuid -c status
```

## Step 8.2 -- Watch to CREATE_COMPLETE; capture the LB/FIP
`# RUN: jumphost` (capi-mgmt scope). Poll; capture run-specific LB id + FIP.
```bash
( {
  for i in $(seq 1 40); do
    S=$(openstack coe cluster show capi-test-1 -f value -c status 2>/dev/null)
    echo "[$i] status=$S"
    case "$S" in CREATE_COMPLETE|CREATE_FAILED) break;; esac
    sleep 30
  done
  echo "=== api endpoint + node counts ==="
  openstack coe cluster show capi-test-1 -f value -c api_address -c master_count -c node_count -c health_status
} )
```
GATE: `status = CREATE_COMPLETE`. Record `api_address` (the FIP endpoint, e.g.
https://10.12.7.180:6443) for 8.3. If `CREATE_FAILED`, see appendix-A (stuck-delete
/ app-cred 403 / OOM). With phase-07's driver, `health_status` should read HEALTHY.

## Step 8.3 -- Retrieve the workload kubeconfig; verify nodes / CNI / addons
`# RUN: jumphost`. Pull the cluster's kubeconfig via Magnum, then inspect.
```bash
# capi-mgmt scope
mkdir -p ~/capi-test-1                                   # DOCFIX-037: `coe cluster config --dir` does NOT create the dir
openstack coe cluster config capi-test-1 --dir ~/capi-test-1 --force
export KUBECONFIG=~/capi-test-1/config
# confirmed: `coe cluster config` returns a usable kubeconfig under the capi-helm driver.
# Alternative (CAPI kubeconfig secret on the mgmt cluster), magnum-ns resolved dynamically:
#   NS=magnum-$(openstack project show capi-mgmt --domain capi -f value -c id)
#   KUBECONFIG=~/capi-mgmt.kubeconfig clusterctl -n "$NS" get kubeconfig <cluster-name-suffix>

( {
  export KUBECONFIG=~/capi-test-1/config
  echo "=== nodes (expect 3 Ready, v1.34.8: 1 control-plane + 2 workers) ==="
  kubectl get nodes -o wide
  echo "=== CNI = Calico (DOCFIX-032: --network-driver calico pinned on the template) ==="
  kubectl -n kube-system get pods | grep -Ei 'calico|tigera' || kubectl get pods -A | grep -Ei 'calico|tigera'
  echo "=== CCM (OpenStack cloud-controller-manager) + Cinder CSI + CoreDNS Running ==="
  kubectl get pods -A | grep -Ei 'cloud-controller|openstack-cloud|cinder-csi|coredns'
  echo "=== any not-Running pods? (expect none) ==="
  kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
} )
```
GATE: 3 nodes Ready; Calico pods Running; CCM Running (NOT crash-looping -- this is
D-011 item 5); Cinder CSI + CoreDNS Running; no stuck pods.

================================================================================
## Step 8.4 -- D-011 acceptance bar (the gate)
================================================================================
Run each; record pass/fail. Wording adapted to the as-built IP-only endpoints (B5)
where the original D-011 said "hostname."

- **D-011.1 -- All charms active/idle.** `# RUN: jumphost`
  `juju status --format=short | grep -vE 'active|idle' || echo "all active/idle"`
  Pass: nothing but active/idle (phase-03 re-confirmed here).

- **D-011.2 -- API reachability from the jumphost (CORE service VIPs).** `# RUN: jumphost`
  IP-only: hit each CORE service VIP, e.g. Keystone:
  `curl -sk https://10.12.4.50:5000/v3 -o /dev/null -w '%{http_code}\n'` (expect 200/300).
  Repeat per core public VIP (.50-.60 block: keystone .50, barbican .51, cinder .52, glance .53,
  magnum .54, neutron .55, nova .56, octavia .57, horizon .58/.60, placement .59). DOCFIX-039:
  product-streams / glance-simplestreams (gss) is NOT a core API VIP -- it registers a unit-IP
  HTTP endpoint (this rebuild 10.12.8.196) with NO jumphost route to the container space, so it is
  EXPECTED unreachable from the jumphost and is OUT OF SCOPE for D-011.2. Pass: all core VIPs respond.

- **D-011.3 -- API reachability from a tenant VM (Option B).** `# RUN: jumphost -> mgmt VM`
  The generalized phase-06 GATE 1: a tenant VM reaches the provider VIP. DOCFIX-038: the mgmt
  FIP is per-rebuild -- source it (never hardcode the dead 10.12.7.40):
  `source ~/capi-mgmt-net.env`
  `ssh ... ubuntu@"$MGMT_FIP" "timeout 6 bash -c 'exec 3<>/dev/tcp/10.12.4.50/5000' && echo VIP-OK || echo VIP-FAIL" </dev/null`
  Pass: VIP-OK (proves the shared-L2 Option B path).

- **D-011.4 -- Octavia LB pattern re-passes (round-robin, failover, recovery).**
  DOCFIX-040 -- do NOT hand-build a standalone LB/listener/pool/members. Exercise round-robin via
  a THROWAWAY Kubernetes `Service type=LoadBalancer` on the workload cluster: the OpenStack CCM
  provisions an Octavia LB + pool + members for it automatically (the Roosevelt-real path -- tenant
  workloads get LBs exactly this way), then tear it down. `# RUN: jumphost, KUBECONFIG=~/capi-test-1/config`
  ```bash
  export KUBECONFIG=~/capi-test-1/config
  kubectl create deploy rr --image=registry.k8s.io/e2e-test-images/agnhost:2.40 --replicas=2 -- /agnhost netexec --http-port=8080
  kubectl expose deploy rr --port=80 --target-port=8080 --type=LoadBalancer
  kubectl get svc rr -w        # Ctrl-C once EXTERNAL-IP is assigned (CCM builds the Octavia LB + FIP)
  EXT=$(kubectl get svc rr -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
  for i in $(seq 1 10); do curl -s "http://$EXT/hostname"; echo; done   # expect BOTH pod names (round-robin)
  kubectl delete svc rr; kubectl delete deploy rr                       # tears down the Octavia LB
  ```
  Failover/recovery (admin scope -- against the workload-cluster API LB): `openstack loadbalancer
  failover <api-lb-id>` -> watch ERROR/PENDING_UPDATE -> ACTIVE (~100s; single STANDALONE amphora
  -> brief blip; operating_status holds ONLINE). STANDALONE failover needs N+1 amphora placement
  headroom (it builds the replacement BEFORE reaping the old -- a cloud at its scheduler ceiling
  cannot self-heal its LBs; Roosevelt sizing implication). (appendix-A: LB-failover; amphora ops
  are admin-scope only.) Pass: round-robin distributes across both members; failover returns to ACTIVE.

- **D-011.5 -- End-to-end Magnum CAPI cluster create, CCM not crash-looping.**
  Satisfied by 8.1-8.3 (CREATE_COMPLETE + CCM Running). Pass = that gate.

- **D-011.6 -- Vault unseal (MANUAL is the v1 standard).** `# RUN: jumphost`
  Confirm vault `Sealed=false` now. The v1 standard is MANUAL unseal after a unit
  reboot (3-of-5 key shares entered at the hidden prompt -- see phase-02); auto-unseal
  is an available option, adopted case-by-case (NOT configured in v1). This is a
  re-confirmation at acceptance, not a re-init. Pass: vault unsealed, and the operator
  can re-unseal manually after a reboot.

- **D-011.7 -- KVM snapshot baseline taken.** `# RUN: jumphost hypervisor`
  Per D-012: Snapshot 1 (post-deploy, post-validation, pre-tenant-resources) and
  Snapshot 2 (post-tenant-setup). qcow2-level, per-VM, on the jumphost hypervisor.
  Pass: Snapshot 1 captured (Snapshot 2 after tenant setup).

- **D-011.8 -- Designate zones + tenant hostname resolution.** DEFERRED.
  D-019 deferred Designate (dropped do-doc-10-dns). Also moot under IP-only B5:
  there are no API hostnames to resolve; tenants use IPs/VIPs. Re-scope when DNS
  returns (v2). NOT required for v1 acceptance.

## Step 8.5 -- (Optional) Clean delete verification
`# RUN: jumphost` (capi-mgmt scope). Confirms the manage/teardown path.
```bash
openstack coe cluster delete capi-test-1     # watch coe cluster list to gone
```
If a delete WEDGES (DELETE_IN_PROGRESS, CRs stuck Deleting on an Octavia 403 from a
frozen app-cred): clear the OpenStackCluster finalizer (the Cluster auto-follows),
then manual neutron cleanup in dependency order -- appendix-A: stuck-delete.
```bash
# NS=magnum-$(openstack project show capi-mgmt --domain capi -f value -c id)   # resolve; never hardcode
# KUBECONFIG=~/capi-mgmt.kubeconfig kubectl -n "$NS" patch openstackcluster <cluster>-<suffix> \
#   --type=merge -p '{"metadata":{"finalizers":[]}}'
# then: openstack router remove subnet / router unset external-gateway / router delete /
#       subnet delete / network delete / security group delete  (dependency order)
```

---

## EXIT GATE (phase-08 / v1 acceptance)
- 8.1-8.3 passed: capi-test-1 CREATE_COMPLETE, 3 Ready nodes, Calico, CCM/CSI/CoreDNS, API LB ACTIVE/ONLINE.
- D-011 items 1-6 PASS; item 7 (KVM snapshot baseline) OUTSTANDING -- it is the last gate before
  the accept-gate formally closes (D-012; dedicated pass); item 8 deferred (D-019).
- health_status HEALTHY (phase-07 1.4.0 driver clears the D-042 cosmetic UNHEALTHY).
- ACCEPTANCE SUMMARY (this rebuild): .1 charms PASS; .2 core VIPs PASS; .3 tenant->VIP PASS;
  .4 Octavia round-robin + admin-scope failover PASS; .5 E2E CAPI create PASS; .6 vault manual
  unseal PASS; .7 snapshot DEFERRED (operator); .8 Designate DEFERRED (D-019). => v1 is
  FUNCTIONALLY ACCEPTED; the .7 snapshot baseline is the only item left to formally close the gate.
- => Project-completion tasks unlocked: consolidate the per-phase runbooks into
  docs/v1-deploy-runbook.md; revert the GitBucket repo OpenStack/openstack-caracal-ipv4 to PRIVATE.

## As-built reference (capi-test-1, suffix kgwwe7c4qj6a, 2026-06-09 -- PRE-D1 v1.32.13 capture)
- D1 NOTE: the procedure above now targets capi-k8s-v1-34 / ubuntu-jammy-kube-v1.34.8. This
  capture is the 2026-06-09 v1.32.13 run (the D-011 acceptance ran on v1.32.13); re-validation on
  v1.34.8 follows the stage-and-verify seed (8.0). A later D-039-era recreate carried CAPI suffix
  qmyxu2xcsghz (CREATE_COMPLETE, HEALTHY).
- create: `--master-count 1 --node-count 2`; uuid 6de15cf4-8805-4ac2-b413-8de2c48d92cf.
- nodes: control-plane (xsc62) + 2 workers; v1.32.13; Calico CNI.
- API LB id 0f968008-8429-4ac3-8b82-452e126982cf, VIP 10.20.16.144, FIP 10.12.7.180,
  endpoint https://10.12.7.180:6443; single STANDALONE amphora.
- CCM / Cinder CSI / CoreDNS Running; all addons scheduled; CREATE_COMPLETE.
- Incident on the as-built run (recovery patterns -> appendix-A): host OOM SHUTOFF the
  mgmt VM (D-041 manual `openstack server start capi-mgmt-v2`); API LB went
  provisioning_status ERROR -> admin-scope `loadbalancer failover` (ACTIVE ~100s);
  workers held the CAPI uninitialized taint until the mgmt API returned, then addons
  scheduled. Root remediation: D-040 reserved-host-memory 512 -> 8192.
- health_status was UNHEALTHY on the as-built run (cosmetic, D-042) -- phase-07's
  contract-coherent driver clears it to HEALTHY.

## Next
v1 acceptance passes here. Proceed to the project-completion workstream: runbook
consolidation (this phase set -> docs/v1-deploy-runbook.md), appendix-A authoring,
the repo change-list, and reverting repo visibility to private.