Prove tenant self-service Kubernetes end to end: create a workload cluster from the capi-k8s-v1-32 template, confirm it converges (Ready nodes, CNI, CCM/CSI, API LB), then run the D-011 acceptance bar. Passing D-011 is the gate that unlocks the project-completion tasks.
Decisions: D-011 (acceptance bar; amended by D-019 -- item 8 Designate deferred), D-031/D-036 (driver/engine/chart coherence), D-039 (app-cred roles incl. load-balancer_member), D-040 (reserved-host-memory), D-041 (non-HA mgmt manual start), D-042 (driver contract coherence -> health HEALTHY after phase-07). Troubleshooting: appendix-A -- stuck-delete finalizer, LB-failover, OOM/manual-start, uninitialized-taint, CNI-label, DOCFIX-021.
provider-ext) exists. The workload cluster's API-LB floating IP and node FIPs are allocated from it.--master-lb-enabled), so Octavia is a hard prerequisite for workload-cluster create (not optional).health_status already reports HEALTHY (if the phase-07 1.4.0 upgrade was skipped, expect the COSMETIC UNHEALTHY of D-042 -- functional, but not an acceptance pass).ubuntu-jammy-kube-v1.32.13 present AND carrying Glance properties kube_version (e.g. v1.32.13) and os_distro=ubuntu. The driver reads the k8s version from the IMAGE, not a template label (P6-CONTRACT / L-P6-3); a missing property fails create.capi-k8s-v1-32 present (8.0 verifies/creates it).load-balancer_member (+ member, reader). A frozen pre-D-039 app-cred 403s on the Octavia LB step and wedges create/delete (appendix-A: stuck-delete).nova-compute reserved-host-memory = 8192 in effect on all compute hosts (baked into the hardened bundle; verify below). Default 512 over-commits the hyperconverged hosts and OOM-kills guests.ENV(project) capi-mgmt (id 674171fd28d446d3a37073b6a761e910)ENV(admin-project) admin (id 65ce73e6798e4d1e8dd066609b7033ef)ENV(template) capi-k8s-v1-32 (uuid e2549d8b-4b89-4947-8b9a-0f4fdbe87d59)ENV(image) ubuntu-jammy-kube-v1.32.13 (id de69c243-bd1f-4182-8e9e-33933e926857)ENV(ext-net) provider-ext (id 70b34bb2-3afb-4b43-96d3-f520dbcbf9a8)ENV(keypair) capi-mgmt-keyENV(cluster) capi-test-1ENV(workload-cidr) 10.20.16.0/24ENV(flavors) master gp.mid (8192/2) ; worker capi.node (4096/2)Capi-mgmt-scoped (cluster CRUD, show, config):
source ~/admin-openrc unset OS_PROJECT_NAME OS_PROJECT_ID OS_TENANT_NAME OS_TENANT_ID OS_PROJECT_DOMAIN_ID OS_PROJECT_DOMAIN_NAME export OS_PROJECT_ID=674171fd28d446d3a37073b6a761e910 # ENV(project)
Admin-scoped (LB amphora/failover -- these 403 under tenant member scope):
source ~/admin-openrc unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME # token -> admin 65ce73e6...
# RUN: jumphost (capi-mgmt scope). Read-only checks consolidated; template create gated separately. (NOTE: template + image are tenant-setup artifacts; on a fully fresh build they may be produced by the magnum-setup step -- this phase verifies/creates the template for self-containment.)
( {
set -u
echo "=== image present + carries kube_version / os_distro ==="
openstack image show ubuntu-jammy-kube-v1.32.13 -f json \
| python3 -c 'import json,sys;d=json.load(sys.stdin);p=d.get("properties",d);print("kube_version=",d.get("kube_version") or p.get("kube_version"));print("os_distro=",d.get("os_distro") or p.get("os_distro"))'
echo "=== reserved-host-memory (D-040) on a compute unit ==="
juju ssh nova-compute/0 'sudo grep -i reserved_host_memory /etc/nova/nova.conf' </dev/null # expect 8192
echo "=== template present? ==="
openstack coe cluster template show capi-k8s-v1-32 -f value -c uuid 2>/dev/null \
&& echo "template OK" || echo "template ABSENT -- create it below"
} )
Create the template only if absent (spec from the as-built capture; the two labels are intentionally the whole config -- chart 0.25.1 + the conf.d drop-in govern the rest). --network-driver is OMITTED deliberately: under the 1.4.0 driver the option IS honored (it maps to the chart network_driver), so to keep the as-built chart default (Calico) we leave it unset. Setting flannel here would now switch the CNI -- do that only if Calico is being intentionally replaced (appendix-A: CNI-label / 1.4.0).
openstack coe cluster template create capi-k8s-v1-32 \ --coe kubernetes --server-type vm \ --image ubuntu-jammy-kube-v1.32.13 \ --external-network provider-ext \ --master-flavor gp.mid --flavor capi.node \ --master-lb-enabled --floating-ip-enabled \ --dns-nameserver 8.8.8.8 \ --docker-storage-driver overlay2 \ --labels fixed_subnet_cidr=10.20.16.0/24,octavia_provider=amphora
# RUN: jumphost (capi-mgmt scope). 1 control-plane + 2 workers, matching the as-built capi-test-1. The driver auto-mints the app-cred (D-039) and always provisions an Octavia LB (+FIP) for the API.
openstack coe cluster create capi-test-1 \ --cluster-template capi-k8s-v1-32 \ --keypair capi-mgmt-key \ --master-count 1 --node-count 2 openstack coe cluster show capi-test-1 -f value -c uuid -c status
# RUN: jumphost (capi-mgmt scope). Poll; capture run-specific LB id + FIP.
( {
for i in $(seq 1 40); do
S=$(openstack coe cluster show capi-test-1 -f value -c status 2>/dev/null)
echo "[$i] status=$S"
case "$S" in CREATE_COMPLETE|CREATE_FAILED) break;; esac
sleep 30
done
echo "=== api endpoint + node counts ==="
openstack coe cluster show capi-test-1 -f value -c api_address -c master_count -c node_count -c health_status
} )
GATE: status = CREATE_COMPLETE. Record api_address (the FIP endpoint, e.g. https://10.12.7.180:6443) for 8.3. If CREATE_FAILED, see appendix-A (stuck-delete / app-cred 403 / OOM). With phase-07's driver, health_status should read HEALTHY.
# RUN: jumphost. Pull the cluster's kubeconfig via Magnum, then inspect.
# capi-mgmt scope
openstack coe cluster config capi-test-1 --dir ~/capi-test-1 --force
export KUBECONFIG=~/capi-test-1/config
# LIVE-REVIEW: confirm `coe cluster config` returns a usable kubeconfig under the
# capi-helm driver; alternative is the CAPI kubeconfig secret on the mgmt cluster:
# KUBECONFIG=~/capi-mgmt.kubeconfig clusterctl -n <magnum-ns> get kubeconfig <cluster-name-suffix>
( {
export KUBECONFIG=~/capi-test-1/config
echo "=== nodes (expect 3 Ready, v1.32.13: 1 control-plane + 2 workers) ==="
kubectl get nodes -o wide
echo "=== CNI = Calico (chart default; --network-driver omitted) ==="
kubectl -n kube-system get pods | grep -Ei 'calico|tigera' || kubectl get pods -A | grep -Ei 'calico|tigera'
echo "=== CCM (OpenStack cloud-controller-manager) + Cinder CSI + CoreDNS Running ==="
kubectl get pods -A | grep -Ei 'cloud-controller|openstack-cloud|cinder-csi|coredns'
echo "=== any not-Running pods? (expect none) ==="
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
} )
GATE: 3 nodes Ready; Calico pods Running; CCM Running (NOT crash-looping -- this is D-011 item 5); Cinder CSI + CoreDNS Running; no stuck pods.
================================================================================
================================================================================ Run each; record pass/fail. Wording adapted to the as-built IP-only endpoints (B5) where the original D-011 said "hostname."
D-011.1 -- All charms active/idle. # RUN: jumphost juju status --format=short | grep -vE 'active|idle' || echo "all active/idle" Pass: nothing but active/idle (phase-03 re-confirmed here).
D-011.2 -- API reachability from the jumphost (all public VIPs). # RUN: jumphost IP-only: hit each service VIP, e.g. Keystone: curl -sk https://10.12.4.50:5000/v3 -o /dev/null -w '%{http_code}\n' (expect 200/300). Repeat per public VIP (.50-.60 block). Pass: all respond.
D-011.3 -- API reachability from a tenant VM (Option B). # RUN: mgmt VM The generalized phase-06 GATE 1: a tenant VM reaches the provider VIP. ssh ... ubuntu@10.12.7.40 "timeout 6 bash -c 'exec 3<>/dev/tcp/10.12.4.50/5000' && echo VIP-OK || echo VIP-FAIL" </dev/null Pass: VIP-OK (proves the shared-L2 Option B path).
D-011.4 -- Octavia LB pattern re-passes (round-robin, failover, recovery). Round-robin: 2-member pool behind a VIP, repeated curls hit both members. Recovery (admin scope): openstack loadbalancer failover <api-lb-id> -> watch ERROR/PENDING_UPDATE -> ACTIVE (~100s; single STANDALONE amphora -> brief blip; operating_status holds ONLINE). (appendix-A: LB-failover; amphora ops are admin-scope only.) Pass: round-robin distributes; failover returns to ACTIVE. TODO (before sign-off): this runbook does NOT yet contain the build steps for the standalone 2-member round-robin pool (LB + listener + pool + 2 backend members + health monitor). Add them here, or fold the round-robin check into the workload-cluster API LB the driver already builds, before D-011.4 is marked complete.
D-011.5 -- End-to-end Magnum CAPI cluster create, CCM not crash-looping. Satisfied by 8.1-8.3 (CREATE_COMPLETE + CCM Running). Pass = that gate.
D-011.6 -- Vault unseal (MANUAL is the v1 standard). # RUN: jumphost Confirm vault Sealed=false now. The v1 standard is MANUAL unseal after a unit reboot (3-of-5 key shares entered at the hidden prompt -- see phase-02); auto-unseal is an available option, adopted case-by-case (NOT configured in v1). This is a re-confirmation at acceptance, not a re-init. Pass: vault unsealed, and the operator can re-unseal manually after a reboot.
D-011.7 -- KVM snapshot baseline taken. # RUN: jumphost hypervisor Per D-012: Snapshot 1 (post-deploy, post-validation, pre-tenant-resources) and Snapshot 2 (post-tenant-setup). qcow2-level, per-VM, on the jumphost hypervisor. Pass: Snapshot 1 captured (Snapshot 2 after tenant setup).
D-011.8 -- Designate zones + tenant hostname resolution. DEFERRED. D-019 deferred Designate (dropped do-doc-10-dns). Also moot under IP-only B5: there are no API hostnames to resolve; tenants use IPs/VIPs. Re-scope when DNS returns (v2). NOT required for v1 acceptance.
# RUN: jumphost (capi-mgmt scope). Confirms the manage/teardown path.
openstack coe cluster delete capi-test-1 # watch coe cluster list to gone
If a delete WEDGES (DELETE_IN_PROGRESS, CRs stuck Deleting on an Octavia 403 from a frozen app-cred): clear the OpenStackCluster finalizer (the Cluster auto-follows), then manual neutron cleanup in dependency order -- appendix-A: stuck-delete.
# NS=magnum-674171fd28d446d3a37073b6a761e910
# KUBECONFIG=~/capi-mgmt.kubeconfig kubectl -n $NS patch openstackcluster <cluster>-<suffix> \
# --type=merge -p '{"metadata":{"finalizers":[]}}'
# then: openstack router remove subnet / router unset external-gateway / router delete /
# subnet delete / network delete / security group delete (dependency order)
--master-count 1 --node-count 2; uuid 6de15cf4-8805-4ac2-b413-8de2c48d92cf.openstack server start capi-mgmt-v2); API LB went provisioning_status ERROR -> admin-scope loadbalancer failover (ACTIVE ~100s); workers held the CAPI uninitialized taint until the mgmt API returned, then addons scheduled. Root remediation: D-040 reserved-host-memory 512 -> 8192.v1 acceptance passes here. Proceed to the project-completion workstream: runbook consolidation (this phase set -> docs/v1-deploy-runbook.md), appendix-A authoring, the repo change-list, and reverting repo visibility to private.