# Phase 08 -- Workload-Cluster Acceptance (D-011)

Prove tenant self-service Kubernetes end to end: create a workload cluster from
the `capi-k8s-v1-32` template, confirm it converges (Ready nodes, CNI, CCM/CSI,
API LB), then run the D-011 acceptance bar. Passing D-011 is the gate that unlocks
the project-completion tasks.

Decisions: D-011 (acceptance bar; amended by D-019 -- item 8 Designate deferred),
D-031/D-036 (driver/engine/chart coherence), D-039 (app-cred roles incl.
load-balancer_member), D-040 (reserved-host-memory), D-041 (non-HA mgmt manual
start), D-042 (driver contract coherence -> health HEALTHY after phase-07).
Troubleshooting: appendix-A -- stuck-delete finalizer, LB-failover, OOM/manual-start,
uninitialized-taint, CNI-label, DOCFIX-021.

---

## Prerequisites (must be true entering phase-08)
- phase-04 done: the external provider network (`provider-ext`) exists. The workload
  cluster's API-LB floating IP and node FIPs are allocated from it.
- phase-05 done: Octavia enabled and healthy. The magnum-capi-helm driver ALWAYS
  provisions an Octavia LB for the apiserver (`--master-lb-enabled`), so Octavia is a
  hard prerequisite for workload-cluster create (not optional).
- phase-07 EXIT GATE passed: conductor grafted, contract-coherent driver (1.4.0). On a
  FRESH DEPLOY the HEALTHY + regression items of that gate are deferred to THIS phase
  (8.2 health gate; 8.1-8.5 create path). On an existing-cluster graft, `health_status`
  already reports HEALTHY (if the phase-07 1.4.0 upgrade was skipped, expect the COSMETIC
  UNHEALTHY of D-042 -- functional, but not an acceptance pass).
- Image `ubuntu-jammy-kube-v1.32.13` present AND carrying Glance properties
  (8.0 below verifies, and on a fresh deploy imports it from the jumphost-staged qcow2)
  `kube_version` (e.g. v1.32.13) and `os_distro=ubuntu`. The driver reads the k8s
  version from the IMAGE, not a template label (P6-CONTRACT / L-P6-3); a missing
  property fails create.
- Cluster template `capi-k8s-v1-32` present (8.0 verifies/creates it).
- D-039: the Magnum service path mints app-creds carrying `load-balancer_member`
  (+ member, reader). A frozen pre-D-039 app-cred 403s on the Octavia LB step and
  wedges create/delete (appendix-A: stuck-delete).
- D-040: `nova-compute reserved-host-memory = 8192` in effect on all compute hosts
  (baked into the hardened bundle; verify below). Default 512 over-commits the
  hyperconverged hosts and OOM-kills guests.

## Constants and env-literals (TAG: confirm per site / run on rebuild)
- `ENV(project)`       capi-mgmt    (id 674171fd28d446d3a37073b6a761e910)
- `ENV(admin-project)` admin        (id 65ce73e6798e4d1e8dd066609b7033ef)
- `ENV(template)`      capi-k8s-v1-32   (uuid e2549d8b-4b89-4947-8b9a-0f4fdbe87d59)
- `ENV(image)`         ubuntu-jammy-kube-v1.32.13 (id de69c243-bd1f-4182-8e9e-33933e926857)
- `ENV(ext-net)`       provider-ext (id 70b34bb2-3afb-4b43-96d3-f520dbcbf9a8)
- `ENV(keypair)`       capi-mgmt-key
- `ENV(cluster)`       capi-test-1
- `ENV(workload-cidr)` 10.20.16.0/24
- `ENV(flavors)`       master gp.mid (8192/2) ; worker capi.node (4096/2)
- run-specific (do NOT hardcode -- capture at run): API LB id, LB VIP (10.20.16.x),
  workload API FIP (10.12.7.180 on the as-built run).

## Scope-hygiene preambles (the project-scope leak guard)
Capi-mgmt-scoped (cluster CRUD, show, config):
```bash
source ~/admin-openrc
unset OS_PROJECT_NAME OS_PROJECT_ID OS_TENANT_NAME OS_TENANT_ID OS_PROJECT_DOMAIN_ID OS_PROJECT_DOMAIN_NAME
export OS_PROJECT_ID=674171fd28d446d3a37073b6a761e910      # ENV(project)
```
Admin-scoped (LB amphora/failover -- these 403 under tenant member scope):
```bash
source ~/admin-openrc
unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME            # token -> admin 65ce73e6...
```

---

## Step 8.0 -- Verify prerequisites; create the template if absent
`# RUN: jumphost` (capi-mgmt scope). Read-only checks consolidated; template create
gated separately. (NOTE: template + image are tenant-setup artifacts; on a fully
fresh build they may be produced by the magnum-setup step -- this phase
verifies/creates the template for self-containment.)

```bash
( {
  set -u
  echo "=== image present + carries kube_version / os_distro ==="
  openstack image show ubuntu-jammy-kube-v1.32.13 -f json \
    | python3 -c 'import json,sys;d=json.load(sys.stdin);p=d.get("properties",d);print("kube_version=",d.get("kube_version") or p.get("kube_version"));print("os_distro=",d.get("os_distro") or p.get("os_distro"))'
  echo "=== reserved-host-memory (D-040) on a compute unit ==="
  juju ssh nova-compute/0 'sudo grep -i reserved_host_memory /etc/nova/nova.conf' </dev/null   # expect 8192
  echo "=== template present? ==="
  openstack coe cluster template show capi-k8s-v1-32 -f value -c uuid 2>/dev/null \
    && echo "template OK" || echo "template ABSENT -- create it below"
} )
```
If the image is ABSENT (fresh deploy -- nothing survives teardown), import it from
the jumphost-staged qcow2. The command is the VERBATIM 2026-06-08 as-executed path
(glance-direct; plain web-download 403s on this cloud). With the hardened bundle's
glance `image-conversion: true` the stored disk_format lands `raw` on the redeploy
(expected -- the as-built run stored qcow2 because conversion was off then):
```bash
( {
  set -u
  source ~/admin-openrc
  if openstack image show ubuntu-jammy-kube-v1.32.13 >/dev/null 2>&1; then
    echo "[SKIP] image ubuntu-jammy-kube-v1.32.13 present"
  else
    SRC="$HOME/ubuntu-jammy-kube-v1.32.13-260401-2014.qcow2"
    [ -f "$SRC" ] || { echo "ABORT: $SRC missing on the jumphost (azimuth-images source; see appendix-B)"; exit 1; }
    glance image-create-via-import \
      --import-method glance-direct \
      --file "$SRC" \
      --container-format bare --disk-format qcow2 \
      --property os_distro=ubuntu --property kube_version=v1.32.13 \
      --name ubuntu-jammy-kube-v1.32.13
  fi
  echo "=== poll to active (3.7G stage + conversion; allow ~10 min) ==="
  for i in $(seq 1 40); do
    ST=$(openstack image show ubuntu-jammy-kube-v1.32.13 -f value -c status 2>/dev/null || echo '?')
    echo "[$i] status=$ST"
    [ "$ST" = active ] && break
    sleep 15
  done
} )
```
GATE: image `active` and the 8.0 property check above passes (kube_version
v1.32.13 / os_distro ubuntu). Then create the template only if absent (spec from
the as-built capture; the two labels
are intentionally the whole config -- chart 0.25.1 + the conf.d drop-in govern the
rest). `--network-driver` is OMITTED deliberately: under the 1.4.0 driver the option
IS honored (it maps to the chart `network_driver`), so to keep the as-built chart
default (Calico) we leave it unset. Setting `flannel` here would now switch the CNI --
do that only if Calico is being intentionally replaced (appendix-A: CNI-label / 1.4.0).
```bash
openstack coe cluster template create capi-k8s-v1-32 \
  --coe kubernetes --server-type vm \
  --image ubuntu-jammy-kube-v1.32.13 \
  --external-network provider-ext \
  --master-flavor gp.mid --flavor capi.node \
  --master-lb-enabled --floating-ip-enabled \
  --dns-nameserver 8.8.8.8 \
  --docker-storage-driver overlay2 \
  --labels fixed_subnet_cidr=10.20.16.0/24,octavia_provider=amphora
```

## Step 8.1 -- Create the workload cluster (MUTATION)
`# RUN: jumphost` (capi-mgmt scope). 1 control-plane + 2 workers, matching the
as-built capi-test-1. The driver auto-mints the app-cred (D-039) and always
provisions an Octavia LB (+FIP) for the API.

```bash
openstack coe cluster create capi-test-1 \
  --cluster-template capi-k8s-v1-32 \
  --keypair capi-mgmt-key \
  --master-count 1 --node-count 2
openstack coe cluster show capi-test-1 -f value -c uuid -c status
```

## Step 8.2 -- Watch to CREATE_COMPLETE; capture the LB/FIP
`# RUN: jumphost` (capi-mgmt scope). Poll; capture run-specific LB id + FIP.
```bash
( {
  for i in $(seq 1 40); do
    S=$(openstack coe cluster show capi-test-1 -f value -c status 2>/dev/null)
    echo "[$i] status=$S"
    case "$S" in CREATE_COMPLETE|CREATE_FAILED) break;; esac
    sleep 30
  done
  echo "=== api endpoint + node counts ==="
  openstack coe cluster show capi-test-1 -f value -c api_address -c master_count -c node_count -c health_status
} )
```
GATE: `status = CREATE_COMPLETE`. Record `api_address` (the FIP endpoint, e.g.
https://10.12.7.180:6443) for 8.3. If `CREATE_FAILED`, see appendix-A (stuck-delete
/ app-cred 403 / OOM). With phase-07's driver, `health_status` should read HEALTHY.

## Step 8.3 -- Retrieve the workload kubeconfig; verify nodes / CNI / addons
`# RUN: jumphost`. Pull the cluster's kubeconfig via Magnum, then inspect.
```bash
# capi-mgmt scope
openstack coe cluster config capi-test-1 --dir ~/capi-test-1 --force
export KUBECONFIG=~/capi-test-1/config
# LIVE-REVIEW: confirm `coe cluster config` returns a usable kubeconfig under the
#   capi-helm driver; alternative is the CAPI kubeconfig secret on the mgmt cluster:
#   KUBECONFIG=~/capi-mgmt.kubeconfig clusterctl -n <magnum-ns> get kubeconfig <cluster-name-suffix>

( {
  export KUBECONFIG=~/capi-test-1/config
  echo "=== nodes (expect 3 Ready, v1.32.13: 1 control-plane + 2 workers) ==="
  kubectl get nodes -o wide
  echo "=== CNI = Calico (chart default; --network-driver omitted) ==="
  kubectl -n kube-system get pods | grep -Ei 'calico|tigera' || kubectl get pods -A | grep -Ei 'calico|tigera'
  echo "=== CCM (OpenStack cloud-controller-manager) + Cinder CSI + CoreDNS Running ==="
  kubectl get pods -A | grep -Ei 'cloud-controller|openstack-cloud|cinder-csi|coredns'
  echo "=== any not-Running pods? (expect none) ==="
  kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
} )
```
GATE: 3 nodes Ready; Calico pods Running; CCM Running (NOT crash-looping -- this is
D-011 item 5); Cinder CSI + CoreDNS Running; no stuck pods.

================================================================================
## Step 8.4 -- D-011 acceptance bar (the gate)
================================================================================
Run each; record pass/fail. Wording adapted to the as-built IP-only endpoints (B5)
where the original D-011 said "hostname."

- **D-011.1 -- All charms active/idle.** `# RUN: jumphost`
  `juju status --format=short | grep -vE 'active|idle' || echo "all active/idle"`
  Pass: nothing but active/idle (phase-03 re-confirmed here).

- **D-011.2 -- API reachability from the jumphost (all public VIPs).** `# RUN: jumphost`
  IP-only: hit each service VIP, e.g. Keystone:
  `curl -sk https://10.12.4.50:5000/v3 -o /dev/null -w '%{http_code}\n'` (expect 200/300).
  Repeat per public VIP (.50-.60 block). Pass: all respond.

- **D-011.3 -- API reachability from a tenant VM (Option B).** `# RUN: mgmt VM`
  The generalized phase-06 GATE 1: a tenant VM reaches the provider VIP.
  `ssh ... ubuntu@10.12.7.40 "timeout 6 bash -c 'exec 3<>/dev/tcp/10.12.4.50/5000' && echo VIP-OK || echo VIP-FAIL" </dev/null`
  Pass: VIP-OK (proves the shared-L2 Option B path).

- **D-011.4 -- Octavia LB pattern re-passes (round-robin, failover, recovery).**
  Round-robin: 2-member pool behind a VIP, repeated curls hit both members.
  Recovery (admin scope): `openstack loadbalancer failover <api-lb-id>` -> watch
  ERROR/PENDING_UPDATE -> ACTIVE (~100s; single STANDALONE amphora -> brief blip;
  operating_status holds ONLINE). (appendix-A: LB-failover; amphora ops are
  admin-scope only.) Pass: round-robin distributes; failover returns to ACTIVE.
  TODO (before sign-off): this runbook does NOT yet contain the build steps for the
  standalone 2-member round-robin pool (LB + listener + pool + 2 backend members +
  health monitor). Add them here, or fold the round-robin check into the
  workload-cluster API LB the driver already builds, before D-011.4 is marked complete.

- **D-011.5 -- End-to-end Magnum CAPI cluster create, CCM not crash-looping.**
  Satisfied by 8.1-8.3 (CREATE_COMPLETE + CCM Running). Pass = that gate.

- **D-011.6 -- Vault unseal (MANUAL is the v1 standard).** `# RUN: jumphost`
  Confirm vault `Sealed=false` now. The v1 standard is MANUAL unseal after a unit
  reboot (3-of-5 key shares entered at the hidden prompt -- see phase-02); auto-unseal
  is an available option, adopted case-by-case (NOT configured in v1). This is a
  re-confirmation at acceptance, not a re-init. Pass: vault unsealed, and the operator
  can re-unseal manually after a reboot.

- **D-011.7 -- KVM snapshot baseline taken.** `# RUN: jumphost hypervisor`
  Per D-012: Snapshot 1 (post-deploy, post-validation, pre-tenant-resources) and
  Snapshot 2 (post-tenant-setup). qcow2-level, per-VM, on the jumphost hypervisor.
  Pass: Snapshot 1 captured (Snapshot 2 after tenant setup).

- **D-011.8 -- Designate zones + tenant hostname resolution.** DEFERRED.
  D-019 deferred Designate (dropped do-doc-10-dns). Also moot under IP-only B5:
  there are no API hostnames to resolve; tenants use IPs/VIPs. Re-scope when DNS
  returns (v2). NOT required for v1 acceptance.

## Step 8.5 -- (Optional) Clean delete verification
`# RUN: jumphost` (capi-mgmt scope). Confirms the manage/teardown path.
```bash
openstack coe cluster delete capi-test-1     # watch coe cluster list to gone
```
If a delete WEDGES (DELETE_IN_PROGRESS, CRs stuck Deleting on an Octavia 403 from a
frozen app-cred): clear the OpenStackCluster finalizer (the Cluster auto-follows),
then manual neutron cleanup in dependency order -- appendix-A: stuck-delete.
```bash
# NS=magnum-674171fd28d446d3a37073b6a761e910
# KUBECONFIG=~/capi-mgmt.kubeconfig kubectl -n $NS patch openstackcluster <cluster>-<suffix> \
#   --type=merge -p '{"metadata":{"finalizers":[]}}'
# then: openstack router remove subnet / router unset external-gateway / router delete /
#       subnet delete / network delete / security group delete  (dependency order)
```

---

## EXIT GATE (phase-08 / v1 acceptance)
- 8.1-8.3 passed: capi-test-1 CREATE_COMPLETE, 3 Ready nodes, Calico, CCM/CSI/CoreDNS, API LB ACTIVE/ONLINE.
- D-011 items 1-7 PASS; item 8 deferred (D-019).
- health_status HEALTHY (phase-07 driver).
- => v1 deployment is ACCEPTED. Project-completion tasks unlocked:
  consolidate the do-doc runbooks into docs/v1-deploy-runbook.md; revert the
  GitBucket repo OpenStack/openstack-caracal-ipv4 to PRIVATE.

## As-built reference (capi-test-1, suffix kgwwe7c4qj6a, 2026-06-09)
- create: `--master-count 1 --node-count 2`; uuid 6de15cf4-8805-4ac2-b413-8de2c48d92cf.
- nodes: control-plane (xsc62) + 2 workers; v1.32.13; Calico CNI.
- API LB id 0f968008-8429-4ac3-8b82-452e126982cf, VIP 10.20.16.144, FIP 10.12.7.180,
  endpoint https://10.12.7.180:6443; single STANDALONE amphora.
- CCM / Cinder CSI / CoreDNS Running; all addons scheduled; CREATE_COMPLETE.
- Incident on the as-built run (recovery patterns -> appendix-A): host OOM SHUTOFF the
  mgmt VM (D-041 manual `openstack server start capi-mgmt-v2`); API LB went
  provisioning_status ERROR -> admin-scope `loadbalancer failover` (ACTIVE ~100s);
  workers held the CAPI uninitialized taint until the mgmt API returned, then addons
  scheduled. Root remediation: D-040 reserved-host-memory 512 -> 8192.
- health_status was UNHEALTHY on the as-built run (cosmetic, D-042) -- phase-07's
  contract-coherent driver clears it to HEALTHY.

## Next
v1 acceptance passes here. Proceed to the project-completion workstream: runbook
consolidation (this phase set -> docs/v1-deploy-runbook.md), appendix-A authoring,
the repo change-list, and reverting repo visibility to private.
