diff --git a/bundle.yaml b/bundle.yaml index 1a2f21e..0c9afc2 100644 --- a/bundle.yaml +++ b/bundle.yaml @@ -477,8 +477,7 @@ to: [lxd:8] options: openstack-origin: *openstack-origin - # TODO(d008): set nameservers per omega.dc0.vr0 zone before deploy - # nameservers: "ns1.omega.dc0.vr0.cloud.neumatrix.local. ns2.omega.dc0.vr0.cloud.neumatrix.local." + nameservers: "ns1.omega.dc0.vr0.cloud.neumatrix.local. ns2.omega.dc0.vr0.cloud.neumatrix.local." vip: 10.12.4.227 os-public-hostname: designate.omega.dc0.vr0.cloud.neumatrix.local bindings: *api-bindings @@ -493,7 +492,9 @@ channel: 2024.1/stable num_units: 1 to: [lxd:8] - bindings: *internal-bindings + bindings: + "": provider # unit on provider so bind9:53 reachable from tenants (D-003) + cluster: metal # peer traffic stays internal (decorative with num_units=1) constraints: arch=amd64 # ===================================================================== diff --git a/runbooks/04a-capi-bootstrap-cluster.md b/runbooks/04a-capi-bootstrap-cluster.md index 8183b50..d98f20c 100644 --- a/runbooks/04a-capi-bootstrap-cluster.md +++ b/runbooks/04a-capi-bootstrap-cluster.md @@ -1,412 +1,1056 @@ -# Runbook 04a — CAPI bootstrap cluster install on capi-mgmt.maas +# Runbook 04a — CAPI bootstrap cluster -**Reference:** D-017 (full rebuild every cycle). Runs after `04-magnum-domain.md` -and before `05-magnum-capi-driver.md`. +**Status:** Executes after `02-deploy.md` (cloud up + all charms active/idle) +and `03-vault-init.md` (Vault initialized + root CA available). Precedes +`05-magnum-capi-driver.md` (driver graft consumes the workload kubeconfig +produced here). -**Goal:** From a MAAS-Ready `capi-mgmt` VM, produce a single-node k3s -running cluster-api, CAPO, canonical-kubernetes providers, cert-manager, -and ORC, with a workload-cluster kubeconfig delivered to the jumphost -for use by the Magnum CAPI driver in runbook 05. +**D-017 posture:** L3 full teardown and rebuild every deployment cycle. +Nothing is preserved across cycles. capi-mgmt is wiped to MAAS Ready on +teardown; rebuilt from scratch by this runbook. -**Pre-conditions:** +**Cross-references:** +- D-017 (CAPI bootstrap cluster lifecycle) +- D-007 (Magnum two-layer install) +- D-002 (channel matrix — informs Vault CA chain) +- Workstream 3b decision (2026-05-22): ship Vault CA (no tls-insecure); pivot mandatory -- OpenStack cloud is up and stable (`02-deploy.md` complete, all units - active/idle) -- Magnum trustee domain is created (`04-magnum-domain.md` complete) -- `capi-mgmt` MAAS machine is in **Ready** state (released after teardown, - not yet deployed) -- Jumphost has `~/admin-openrc` sourced and an authenticated `openstack` - CLI working against the new Caracal cloud -- Vault CA bundle is available on the jumphost at a known path - (issued by the Caracal Vault during `03-vault-init.md`) +--- -**Network preconditions:** +## 1. Purpose & scope -- `capi-mgmt` machine should be configured in MAAS with two interfaces: - - `eth0` on the metal fabric (DHCP from MAAS) — used for k3s API bind - - `eth1` on the provider fabric (static IP, no DHCP) — used for - workload-cluster FIP reach. This IP must NOT fall inside the Neutron - FIP allocation pool on the ext_net subnet. -- Verify the eth1 IP is outside the FIP pool before deploy: - ```bash - openstack subnet show -c allocation_pools -c gateway_ip - ``` +This runbook stands up the CAPI bootstrap cluster on `capi-mgmt.maas` and +pivots cluster state into a self-managing workload cluster. Output: -## Step 1 — Deploy Ubuntu 24.04 to capi-mgmt via MAAS +1. **Workload K8s cluster** (`capi-mgmt-cluster`) running in tenant VMs on + the cloud, self-managing post-pivot. +2. **Workload kubeconfig** copied to jumphost at a known path. Consumed by + `runbooks/05-magnum-capi-driver.md` for the Magnum CAPI Helm driver + graft. +3. **No remaining state** on the bootstrap k3s VM after pivot. capi-mgmt + becomes a disposable jump host. -Use MAAS UI: Machines → capi-mgmt → Take action → Deploy → Ubuntu 24.04 -LTS (Noble) → Deploy machine. Wait for Deployed status (~10 min). +**Scope:** v1 testcloud. Roosevelt deltas in section 20. -Verify SSH reachability once Deployed (note: SSH user is `ubuntu`, not -`jessea123`; MAAS cloud-init pattern): +**Out of scope:** + +- Magnum-side configuration (runbook 05). +- Workload cluster's tenant lifecycle (Magnum's job, not this runbook's). +- Backup / DR for the workload cluster (Roosevelt concern). + +--- + +## 2. Decisions captured + +Per workstream 3b sign-off (2026-05-22): + +| Decision | Choice | Roosevelt parallel | +|---|---|---| +| Version pinning | Pin-at-execution with discovery in §4 | Same pattern; pins captured in deploy record | +| Cloud TLS trust | Ship Vault CA to capi-mgmt + workload nodes (no `tls-insecure`) | Image-baked CA; CK8sConfig redundancy | +| `clusterctl move` pivot | Mandatory; workload cluster becomes self-managing | Same | +| K8s flavor | Canonical Kubernetes (CK8s) | Same | +| OpenStack auth | v3applicationcredential | Same | +| Pod CIDR | `10.244.0.0/16` | Same (does not conflict with cloud `10.12.0.0/16` or tenant pool `10.20.0.0/16`) | +| Service CIDR | `10.96.0.0/12` | Same | +| Workload cluster name | `capi-mgmt-cluster` | Same | +| Workload node SSH user | `ubuntu` (MAAS/cloud-init convention) | Same | + +**Naming convention:** + +- Keystone project for CAPI: `capi-mgmt` (in `admin_domain`) +- Keystone user for CAPI: `capo` (CAPO operator) +- App credential: `capo-app-cred` +- Workload image (Glance): `noble-amd64` (existing; do NOT duplicate as `ubuntu-24.04-capi` — Bobcat lesson) +- Workload flavor: `capi-mgmt-node` (4 vCPU / 4 GiB / 30 GB) — control plane node sizing + +--- + +## 3. Prerequisites + +| Prereq | Verification | +|---|---| +| Cloud deployed; all charms `active/idle` per D-011 | `juju status --color\| grep -v "active.*idle"` returns only the header | +| Vault initialized + unsealed | `juju ssh vault/leader -- sudo vault status` shows `Sealed=false` | +| Vault root CA available on jumphost | `test -f $HOME/vault-pki/root-ca.pem && openssl x509 -in $HOME/vault-pki/root-ca.pem -noout -subject` | +| Keystone reachable via FQDN | `curl -sf --cacert $HOME/vault-pki/root-ca.pem https://keystone.omega.dc0.vr0.cloud.neumatrix.local:5000/v3 \| jq .version.id` returns `"v3.14"` or current | +| capi-mgmt VM exists in MAAS as Ready | `maas $MAAS_PROFILE machines read \| jq '.[] \| select(.hostname=="capi-mgmt") \| .status_name'` returns `"Ready"` | +| Admin openrc available | `test -f $HOME/admin-openrc && source $HOME/admin-openrc && openstack token issue \| head -3` | +| Workspace path under $HOME (snap confinement) | `WORK=$HOME/capi-bootstrap; mkdir -p "$WORK"; cd "$WORK"; pwd` shows under home | + +**Set shell context for the runbook:** ```bash -ssh ubuntu@ 'hostname; uname -a; ip -br a' +export REPO=$HOME/repos/openstack-caracal-ipv4 # adjust if your clone is elsewhere +export WORK=$HOME/capi-bootstrap # runbook scratch dir +export VAULT_CA=$HOME/vault-pki/root-ca.pem # Vault root CA (from runbook 03) +export CAPI_MGMT_METAL_IP=10.12.8.21 # capi-mgmt metal interface +export CAPI_MGMT_PROVIDER_IP=10.12.4.21 # capi-mgmt provider interface +export CLUSTER_NAME=capi-mgmt-cluster +mkdir -p "$WORK" +cd "$WORK" ``` -Verify both interfaces show their expected IPs. +--- -## Step 2 — Install Vault CA on the bootstrap host +## 4. Version discovery (set pins) -The bootstrap host must trust the Caracal Vault root CA so that `openstack` -CLI calls and CAPO authentication to Keystone succeed over HTTPS. +Bobcat ran "dynamic latest." This runbook pins explicit versions captured at +execution time, with the discovery procedure documented inline so each +rebuild's pins are reproducible AND traceable. + +**GitHub API: authenticated vs unauthenticated.** Unauth has 60 req/hr; +authenticated has 5000. For multiple rebuilds in a day, set a token: ```bash -# From jumphost — replace with the deployed capi-mgmt IP -scp /vault-ca.crt ubuntu@:/tmp/vault-ca.crt - -ssh ubuntu@ << 'REMOTE' -sudo install -m 0644 /tmp/vault-ca.crt /usr/local/share/ca-certificates/vault-ca.crt -sudo update-ca-certificates -# Verify Keystone reachable with TLS -curl --cacert /etc/ssl/certs/ca-certificates.crt https://:5000/v3 -s -o /dev/null -w "%{http_code}\n" -# Expect: 200 -REMOTE +# Optional but recommended — avoids rate-limit headaches during rebuild +export GITHUB_TOKEN= +# Or skip if you can tolerate ~10 API calls slowly ``` -## Step 3 — Install k3s - -k3s defaults to binding 0.0.0.0:6443. Bind to the metal-network IP only to -keep the management API off the provider network. The TLS-SAN flags must -include both the IP and the FQDN. k3s does NOT auto-add 127.0.0.1 to the -SAN list; if 127.0.0.1 needs to be in the kubeconfig, add it explicitly as -a `--tls-san`. We do not — we rewrite the kubeconfig server URL instead. +**Discover current stable releases:** ```bash -ssh ubuntu@ 'bash -s' << 'REMOTE' +cd "$WORK" + +# Helper: fetch latest stable release tag from a GitHub repo +gh_latest() { + local repo=$1 + local auth="" + [ -n "$GITHUB_TOKEN" ] && auth="-H Authorization: Bearer $GITHUB_TOKEN" + curl -sfL $auth "https://api.github.com/repos/$repo/releases/latest" \ + | jq -r '.tag_name' +} + +# Pin captures (one file per pin, for the deploy-record convention) +mkdir -p pins +gh_latest "kubernetes-sigs/cluster-api" | tee pins/CAPI_VERSION +gh_latest "kubernetes-sigs/cluster-api-provider-openstack" | tee pins/CAPO_VERSION +gh_latest "canonical/cluster-api-k8s" | tee pins/CK8S_VERSION +gh_latest "cert-manager/cert-manager" | tee pins/CERT_MANAGER_VERSION +gh_latest "k-orc/openstack-resource-controller" | tee pins/ORC_VERSION +gh_latest "k3s-io/k3s" | tee pins/K3S_VERSION +gh_latest "helm/helm" | tee pins/HELM_VERSION + +# Load into shell +export CAPI_VERSION=$(cat pins/CAPI_VERSION) +export CAPO_VERSION=$(cat pins/CAPO_VERSION) +export CK8S_VERSION=$(cat pins/CK8S_VERSION) +export CERT_MANAGER_VERSION=$(cat pins/CERT_MANAGER_VERSION) +export ORC_VERSION=$(cat pins/ORC_VERSION) +export K3S_VERSION=$(cat pins/K3S_VERSION) +export HELM_VERSION=$(cat pins/HELM_VERSION) + +# Display for the deploy log +cat pins/*_VERSION | paste -d= <(ls pins/) - +``` + +**Sanity check:** all values should look like `v1.X.Y` or `v0.X.Y`. If any +returned `null` or empty, the GitHub API call failed — most likely +rate-limited. Wait an hour or set `$GITHUB_TOKEN` and retry. + +**Capture pins to repo as deploy record:** + +The pin files in `$WORK/pins/` should be appended to a deploy-log artifact +(NOT committed to the repo — these are deploy-time captures). Suggested +location: `$HOME/deploy-records/$(date +%Y%m%d-%H%M)/capi-pins/`. + +```bash +DEPLOY_RECORD=$HOME/deploy-records/$(date +%Y%m%d-%H%M%S)/capi-pins +mkdir -p "$DEPLOY_RECORD" +cp pins/*_VERSION "$DEPLOY_RECORD/" +ls -la "$DEPLOY_RECORD/" +``` + +--- + +## 5. MAAS-deploy capi-mgmt + +Prerequisite: capi-mgmt MAAS machine is in `Ready` state (see §3). +Network config in MAAS: + +- **eth0** on metal fabric, DHCP → `10.12.8.21` (MAAS-pinned static lease) +- **eth1** on provider fabric, static → `10.12.4.21` + +Deploy Ubuntu 24.04 (Noble): + +```bash +# Get the capi-mgmt system_id from MAAS +CAPI_MGMT_SYSTEM_ID=$(maas $MAAS_PROFILE machines read \ + | jq -r '.[] | select(.hostname=="capi-mgmt") | .system_id') +echo "capi-mgmt system_id: $CAPI_MGMT_SYSTEM_ID" + +# Deploy +maas $MAAS_PROFILE machine deploy "$CAPI_MGMT_SYSTEM_ID" \ + distro_series=noble \ + hwe_kernel=ga-24.04 +``` + +Poll for `Deployed`: + +```bash +while true; do + STATUS=$(maas $MAAS_PROFILE machine read "$CAPI_MGMT_SYSTEM_ID" \ + | jq -r '.status_name') + echo "$(date -Is) capi-mgmt status: $STATUS" + [ "$STATUS" = "Deployed" ] && break + [ "$STATUS" = "Failed deployment" ] && { echo "FAILED"; exit 1; } + sleep 30 +done +``` + +Typical deploy time: 5-8 minutes on this hardware. + +**SSH reachability:** + +```bash +# MAAS .maas zone may not resolve from jumphost — use IP directly per handoff lessons +ssh -o StrictHostKeyChecking=accept-new ubuntu@$CAPI_MGMT_METAL_IP -- hostname +# Expect: capi-mgmt +``` + +> **Gotcha:** MAAS-deployed Ubuntu uses the `ubuntu` user, not `jessea123`. +> See handoff "recurring technical pitfalls." + +--- + +## 6. SSH bootstrap + Vault CA install + +On the jumphost, prepare a transport bundle of essentials: + +```bash +mkdir -p "$WORK/bootstrap-bundle" +cp "$VAULT_CA" "$WORK/bootstrap-bundle/vault-ca.crt" +chmod 644 "$WORK/bootstrap-bundle/vault-ca.crt" + +# Bundle pin files so capi-mgmt can read versions +cp -r "$WORK/pins" "$WORK/bootstrap-bundle/" +``` + +SCP and install Vault CA on capi-mgmt: + +```bash +scp -r "$WORK/bootstrap-bundle" ubuntu@$CAPI_MGMT_METAL_IP:/home/ubuntu/ + +ssh ubuntu@$CAPI_MGMT_METAL_IP <<'EOF' set -euo pipefail -BIND_ADDR=$(ip -4 -br a show eth0 | awk '{print $3}' | cut -d/ -f1) -echo "bind addr: $BIND_ADDR" -if systemctl is-active --quiet k3s; then - echo "[skip] k3s already running" -else - curl -sfL https://get.k3s.io | \ - INSTALL_K3S_EXEC="server \ - --bind-address=${BIND_ADDR} \ - --advertise-address=${BIND_ADDR} \ - --node-ip=${BIND_ADDR} \ - --tls-san=${BIND_ADDR} \ - --tls-san=capi-mgmt.maas \ - --write-kubeconfig-mode=0644 \ - --disable=traefik" \ - sh - -fi +# Install Vault CA as a system-trusted root +sudo cp /home/ubuntu/bootstrap-bundle/vault-ca.crt /usr/local/share/ca-certificates/ +sudo update-ca-certificates 2>&1 | tail -3 -# Wait for Ready +# Verify +openssl s_client -connect keystone.omega.dc0.vr0.cloud.neumatrix.local:5000 \ + -CApath /etc/ssl/certs -verify_return_error &1 \ + | grep -E "(Verify return code|subject=)" || \ + { echo "TLS chain verify failed against Keystone — investigate before proceeding"; exit 1; } + +# Update apt + base utilities +sudo apt-get update -qq +sudo apt-get install -y -qq jq curl yq + +# Confirm +which jq curl yq +EOF +``` + +**Expected:** + +- `update-ca-certificates` reports "1 added" +- `openssl s_client` shows `Verify return code: 0 (ok)` and a Keystone cert + whose chain terminates at the Vault CA + +> **Why this matters:** Bobcat used `tls-insecure=true` in cloud.conf which +> skipped this entire trust path. Our workstream 3b decision (ship Vault CA) +> means OCCM and CAPO will validate certs against this trust store. If TLS +> verify fails here, OCCM will crashloop later. + +--- + +## 7. k3s install + +On capi-mgmt: + +```bash +ssh ubuntu@$CAPI_MGMT_METAL_IP "K3S_VERSION=$K3S_VERSION CAPI_MGMT_METAL_IP=$CAPI_MGMT_METAL_IP bash -s" <<'REMOTE_EOF' +set -euo pipefail + +# Install k3s with explicit bind/advertise/SAN flags +curl -sfL https://get.k3s.io | \ + INSTALL_K3S_VERSION="$K3S_VERSION" \ + sh -s - server \ + --bind-address="$CAPI_MGMT_METAL_IP" \ + --advertise-address="$CAPI_MGMT_METAL_IP" \ + --node-ip="$CAPI_MGMT_METAL_IP" \ + --tls-san="$CAPI_MGMT_METAL_IP" \ + --tls-san=capi-mgmt.maas \ + --write-kubeconfig-mode=0644 \ + --disable=traefik + +# Wait for k3s API to respond for i in $(seq 1 30); do - if sudo k3s kubectl get nodes 2>/dev/null | awk 'NR>1 && $2=="Ready"{n++} END{exit n<1}'; then - echo "[ok] node Ready after ${i} polls" - break + if sudo kubectl get nodes 2>/dev/null | grep -q "Ready"; then + echo "k3s ready"; break fi - sleep 2 + echo "Waiting for k3s API... ($i/30)" + sleep 5 done -# Copy and rewrite kubeconfig -sudo install -o ubuntu -g ubuntu -m 0600 /etc/rancher/k3s/k3s.yaml /home/ubuntu/.kube-bootstrap.yaml -sed -i "s|server: https://127\\.0\\.0\\.1:6443|server: https://${BIND_ADDR}:6443|" /home/ubuntu/.kube-bootstrap.yaml -grep '^ server:' /home/ubuntu/.kube-bootstrap.yaml - -KUBECONFIG=/home/ubuntu/.kube-bootstrap.yaml kubectl get nodes -REMOTE +sudo kubectl get nodes +sudo kubectl get pods -A +REMOTE_EOF ``` -## Step 4 — Install helm and clusterctl +> **Gotcha:** `--bind-address=$IP` makes k3s listen ONLY on that IP — not +> also on 127.0.0.1. The default kubeconfig at +> `/etc/rancher/k3s/k3s.yaml` has `server: https://127.0.0.1:6443` and will +> NOT work as-is. Sed-rewrite below. -kubectl is provided by k3s as a symlink; do not re-install. +--- + +## 8. Kubeconfig server-URL rewrite ```bash -ssh ubuntu@ 'bash -s' << 'REMOTE' +ssh ubuntu@$CAPI_MGMT_METAL_IP "CAPI_MGMT_METAL_IP=$CAPI_MGMT_METAL_IP bash -s" <<'REMOTE_EOF' set -euo pipefail -# helm -if ! command -v helm >/dev/null 2>&1; then - curl -fL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash -fi +# Copy k3s kubeconfig to ubuntu user; rewrite server URL +mkdir -p /home/ubuntu/.kube +sudo cp /etc/rancher/k3s/k3s.yaml /home/ubuntu/.kube/config +sudo chown ubuntu:ubuntu /home/ubuntu/.kube/config +chmod 600 /home/ubuntu/.kube/config + +# Rewrite 127.0.0.1 → metal IP +sed -i "s|server: https://127.0.0.1:6443|server: https://$CAPI_MGMT_METAL_IP:6443|" \ + /home/ubuntu/.kube/config + +# Verify rewrite +grep "server:" /home/ubuntu/.kube/config +# Expect: server: https://10.12.8.21:6443 + +# Confirm kubectl works as ubuntu user (no sudo) +kubectl get nodes +REMOTE_EOF +``` + +--- + +## 9. helm + clusterctl install + +```bash +ssh ubuntu@$CAPI_MGMT_METAL_IP "HELM_VERSION=$HELM_VERSION CAPI_VERSION=$CAPI_VERSION bash -s" <<'REMOTE_EOF' +set -euo pipefail + +# helm install (get-helm-3 fetches the version we specify) +cd /tmp +curl -sfL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 \ + | DESIRED_VERSION="$HELM_VERSION" bash helm version --short -# clusterctl — fetch latest from GitHub API, fall back to a pinned version if needed -if ! command -v clusterctl >/dev/null 2>&1; then - CLUSTERCTL_VER=$(curl -fsSL --max-time 15 \ - https://api.github.com/repos/kubernetes-sigs/cluster-api/releases/latest \ - | python3 -c 'import json,sys; print(json.load(sys.stdin)["tag_name"])') - curl -fLo /tmp/clusterctl --max-time 60 \ - "https://github.com/kubernetes-sigs/cluster-api/releases/download/${CLUSTERCTL_VER}/clusterctl-linux-amd64" - sudo install -o root -g root -m 0755 /tmp/clusterctl /usr/local/bin/clusterctl - rm /tmp/clusterctl -fi +# clusterctl install +CLUSTERCTL_URL="https://github.com/kubernetes-sigs/cluster-api/releases/download/${CAPI_VERSION}/clusterctl-linux-amd64" +sudo curl -sfL "$CLUSTERCTL_URL" -o /usr/local/bin/clusterctl +sudo chmod +x /usr/local/bin/clusterctl clusterctl version -REMOTE +REMOTE_EOF ``` -## Step 5 — clusterctl init with canonical-kubernetes providers +--- + +## 10. clusterctl init (CAPI controllers + cert-manager + ORC + CAPO + CK8s) ```bash -ssh ubuntu@ 'bash -s' << 'REMOTE' +ssh ubuntu@$CAPI_MGMT_METAL_IP "CK8S_VERSION=$CK8S_VERSION CERT_MANAGER_VERSION=$CERT_MANAGER_VERSION ORC_VERSION=$ORC_VERSION CAPO_VERSION=$CAPO_VERSION bash -s" <<'REMOTE_EOF' set -euo pipefail +# Configure clusterctl with provider URLs mkdir -p ~/.cluster-api -cat > ~/.cluster-api/clusterctl.yaml << 'CONFIG' +cat > ~/.cluster-api/clusterctl.yaml </dev/null 2>&1; then - echo "[skip] CAPI already initialized" -else - clusterctl init \ - --infrastructure openstack \ - --bootstrap canonical-kubernetes \ - --control-plane canonical-kubernetes +# Wait for controllers to be Ready +kubectl wait --for=condition=Available --timeout=5m \ + deployment --all -n capi-system +kubectl wait --for=condition=Available --timeout=5m \ + deployment --all -n capi-kubeadm-bootstrap-system 2>/dev/null || true +kubectl wait --for=condition=Available --timeout=5m \ + deployment --all -n capo-system +kubectl wait --for=condition=Available --timeout=5m \ + deployment --all -n cert-manager + +# Install ORC +kubectl apply -f "https://github.com/k-orc/openstack-resource-controller/releases/${ORC_VERSION}/orc.yaml" +kubectl wait --for=condition=Available --timeout=5m \ + deployment --all -n orc-system + +# Confirm all controllers +kubectl get pods -A | grep -v "Running\|Completed" | grep -v NAME +# Expected: empty output (all pods Running or no abnormal state) +REMOTE_EOF +``` + +> **Gotcha:** the actual namespace names (`capi-system`, `capo-system`, etc.) +> are conventions. If a controller fails to land in the expected namespace, +> `kubectl get deployment -A` lists all deployments — diagnose from there. + +--- + +## 11. Cloud-side prep (Keystone, Nova, Glance) + +Back on the jumphost: + +```bash +source $HOME/admin-openrc + +# Inventory existing resources FIRST (Bobcat lesson: don't create duplicates) +echo "=== Existing images ===" +openstack image list -c ID -c Name -f json | jq -r '.[] | "\(.Name)\t\(.ID)"' +echo "" +echo "=== Existing flavors ===" +openstack flavor list -c Name -c ID -c RAM -c VCPUs -c Disk -f json \ + | jq -r '.[] | "\(.Name)\tRAM=\(.RAM)\tCPU=\(.VCPUs)\tDisk=\(.Disk)\tID=\(.ID)"' +echo "" +echo "=== Existing keypairs ===" +openstack keypair list +echo "" +echo "=== Existing projects in admin_domain ===" +openstack project list --domain admin_domain +``` + +**Create / verify resources:** + +```bash +# Keystone project + user +openstack project show capi-mgmt --domain admin_domain 2>/dev/null \ + || openstack project create capi-mgmt --domain admin_domain --description "CAPI management plane" + +openstack user show capo --domain admin_domain 2>/dev/null \ + || openstack user create capo --domain admin_domain --password-prompt --description "CAPO operator" + +# Role assignments (CAPO needs member + load-balancer_member at minimum; +# admin works for testcloud — Roosevelt should use least-privilege) +openstack role add --user capo --user-domain admin_domain \ + --project capi-mgmt --project-domain admin_domain \ + member + +openstack role add --user capo --user-domain admin_domain \ + --project capi-mgmt --project-domain admin_domain \ + load-balancer_member 2>/dev/null || \ + echo "(load-balancer_member role may not exist if Octavia not deployed yet)" + +# Application credential — captured to file under $HOME (snap confinement) +APP_CRED_FILE=$WORK/capo-app-cred.json +openstack --os-username capo --os-user-domain-name admin_domain \ + --os-project-name capi-mgmt --os-project-domain-name admin_domain \ + application credential create capo-app-cred \ + --description "CAPO operator app credential" \ + -f json > "$APP_CRED_FILE" +chmod 600 "$APP_CRED_FILE" + +# Extract credential ID + secret +export APP_CRED_ID=$(jq -r '.id' "$APP_CRED_FILE") +export APP_CRED_SECRET=$(jq -r '.secret' "$APP_CRED_FILE") +echo "App cred ID: $APP_CRED_ID" +``` + +**Nova keypair (workload node SSH key):** + +```bash +# Generate fresh keypair locally (do NOT reuse jumphost personal key) +ssh-keygen -t ed25519 -N '' -f "$WORK/capi-workload-key" \ + -C "capi-workload-$(date +%Y%m%d)" +chmod 600 "$WORK/capi-workload-key" + +# Upload public key to Keystone as a Nova keypair +openstack keypair create --public-key "$WORK/capi-workload-key.pub" capi-workload-key +openstack keypair show capi-workload-key +``` + +**Workload image:** + +```bash +# Inventory check — use noble-amd64 if it exists (Bobcat lesson: do NOT create ubuntu-24.04-capi as a dup) +NOBLE_IMAGE_ID=$(openstack image show noble-amd64 -c id -f value 2>/dev/null || echo "") + +if [ -z "$NOBLE_IMAGE_ID" ]; then + echo "noble-amd64 image not found — upload required." + echo "(Pull from https://cloud-images.ubuntu.com/noble/current/noble-server-cloudimg-amd64.img" + echo " then: openstack image create --disk-format qcow2 --container-format bare \\" + echo " --public --file noble-server-cloudimg-amd64.img noble-amd64)" + exit 1 fi - -# Wait for all controller deployments -for ns in cert-manager capi-system cabpck-system cacpck-system capo-system; do - echo "[wait] ${ns}" - kubectl wait --for=condition=Available deployment --all --namespace "${ns}" --timeout=5m -done - -clusterctl version -kubectl get pods -A -REMOTE +echo "Using image: noble-amd64 ($NOBLE_IMAGE_ID)" +export WORKLOAD_IMAGE_ID=$NOBLE_IMAGE_ID ``` -Expected namespaces (note the abbreviated canonical-kubernetes names): - -- `cert-manager` -- `capi-system` — cluster-api core -- `capo-system` — CAPI provider for OpenStack -- `cabpck-system` — CAPI Bootstrap Provider Canonical Kubernetes -- `cacpck-system` — CAPI Control-Plane Provider Canonical Kubernetes - -## Step 6 — Install ORC (OpenStack Resource Controller) - -Required by CAPO for managing OpenStack resources as Kubernetes objects. -Verify the latest release URL before applying. +**Workload flavor:** ```bash -ssh ubuntu@ 'bash -s' << 'REMOTE' -set -euo pipefail -export KUBECONFIG=/home/ubuntu/.kube-bootstrap.yaml +openstack flavor show capi-mgmt-node 2>/dev/null \ + || openstack flavor create capi-mgmt-node \ + --vcpus 4 --ram 4096 --disk 30 \ + --description "CAPI workload node (control plane sizing)" -ORC_URL="https://github.com/k-orc/openstack-resource-controller/releases/latest/download/install.yaml" -kubectl apply -f "$ORC_URL" - -# Wait for ORC controller -sleep 5 -for ns in $(kubectl get ns -o name | grep -E '^namespace/(orc|openstack-resource-controller)' | sed 's|namespace/||'); do - echo "[wait] ${ns}" - kubectl wait --for=condition=Available deployment --all --namespace "${ns}" --timeout=5m -done -REMOTE +export WORKLOAD_FLAVOR=capi-mgmt-node ``` -## Step 7 — Cloud-side preparation (run from jumphost) +--- -Inventory existing images and flavors before creating. Lesson from prior -cycles: do not blindly create `ubuntu-24.04-capi` when `noble-amd64` is -already present and suitable. +## 12. clouds.yaml + cloud.conf composition (with Vault CA, no tls-insecure) + +The workload cluster's OCCM (OpenStack Cloud Controller Manager) and CAPO both +need to call OpenStack APIs. Two files: + +- `clouds.yaml` — CAPO's view of how to reach OpenStack (used at cluster + creation time on capi-mgmt) +- `cloud.conf` — OCCM's view, injected into the workload cluster's k8s + Secret (used continuously by OCCM running in the workload cluster) + +**Compose clouds.yaml:** ```bash -source ~/admin-openrc -openstack image list | grep -i noble -openstack flavor list -``` - -Create the supporting cloud-side resources for CAPO: - -```bash -# Project -openstack project create --domain admin_domain capi-mgmt \ - --description "CAPI management cluster workloads" - -# User -openstack user create --domain admin_domain --project capi-mgmt \ - --project-domain admin_domain --password-prompt capo - -# Roles -openstack role add --project capi-mgmt --project-domain admin_domain \ - --user capo --user-domain admin_domain member -openstack role add --project capi-mgmt --project-domain admin_domain \ - --user capo --user-domain admin_domain load-balancer_member - -# Switch to capo -unset $(env | awk -F= '/^OS_/{print $1}') -export OS_AUTH_URL= -export OS_IDENTITY_API_VERSION=3 -export OS_USERNAME=capo -export OS_USER_DOMAIN_NAME=admin_domain -export OS_PROJECT_NAME=capi-mgmt -export OS_PROJECT_DOMAIN_NAME=admin_domain -export OS_PASSWORD= -export OS_CACERT= - -# App credential (record id and secret immediately — secret only shown at creation) -openstack application credential create capo-app-cred \ - --description "CAPO authentication" \ - -f yaml > ~/capi-mgmt/capo-app-cred.yaml -chmod 0600 ~/capi-mgmt/capo-app-cred.yaml - -# Nova keypair — generate on capi-mgmt and upload public key -ssh ubuntu@ 'ssh-keygen -t ed25519 -N "" -f ~/.ssh/capi-mgmt-key' -ssh ubuntu@ 'cat ~/.ssh/capi-mgmt-key.pub' > /tmp/capi-mgmt-key.pub -openstack keypair create --public-key /tmp/capi-mgmt-key.pub capi-mgmt-key -# Also pull the private key back to jumphost for post-rebuild access -scp -p ubuntu@:~/.ssh/capi-mgmt-key ~/capi-mgmt/capi-mgmt-key -chmod 0600 ~/capi-mgmt/capi-mgmt-key -``` - -## Step 8 — Compose clouds.yaml and cloud.conf - -Use `v3applicationcredential` auth — cleaner than user/password. - -```bash -# Read app credential -APP_CRED_ID=$(yq -r '.id' ~/capi-mgmt/capo-app-cred.yaml) -APP_CRED_SECRET=$(yq -r '.secret' ~/capi-mgmt/capo-app-cred.yaml) - -# Compose clouds.yaml for capi-mgmt -cat > /tmp/clouds.yaml << EOC +cat > "$WORK/clouds.yaml" < - application_credential_id: ${APP_CRED_ID} - application_credential_secret: ${APP_CRED_SECRET} + capi-mgmt: region_name: RegionOne - cacert: /usr/local/share/ca-certificates/vault-ca.crt interface: public identity_api_version: 3 -EOC + auth_type: v3applicationcredential + auth: + auth_url: https://keystone.omega.dc0.vr0.cloud.neumatrix.local:5000/v3 + application_credential_id: $APP_CRED_ID + application_credential_secret: $APP_CRED_SECRET + cacert: /usr/local/share/ca-certificates/vault-ca.crt + verify: true +EOF +chmod 600 "$WORK/clouds.yaml" -scp /tmp/clouds.yaml ubuntu@:/home/ubuntu/clouds.yaml -ssh ubuntu@ 'chmod 0600 ~/clouds.yaml' +# base64-encode for cluster template embedding (no newline wrapping) +base64 -w0 "$WORK/clouds.yaml" > "$WORK/clouds.yaml.b64" +``` -# cloud.conf for OCCM — use tls-insecure=true for v1 testcloud -# (v2: ship Vault CA via CK8sConfig files field instead) -cat > /tmp/cloud.conf << EOC +**Compose cloud.conf** (INI format, NOT YAML): + +```bash +cat > "$WORK/cloud.conf" < -application-credential-id=${APP_CRED_ID} -application-credential-secret=${APP_CRED_SECRET} +auth-url=https://keystone.omega.dc0.vr0.cloud.neumatrix.local:5000/v3 +application-credential-id=$APP_CRED_ID +application-credential-secret=$APP_CRED_SECRET region=RegionOne -tls-insecure=true +domain-name=admin_domain +ca-file=/usr/local/share/ca-certificates/vault-ca.crt [LoadBalancer] -floating-network-id= -EOC +use-octavia=true +EOF +chmod 600 "$WORK/cloud.conf" + +base64 -w0 "$WORK/cloud.conf" > "$WORK/cloud.conf.b64" ``` -## Step 9 — Render and apply the cluster manifest +> **Critical delta from Bobcat:** the `ca-file` line replaces `tls-insecure=true`. +> The path `/usr/local/share/ca-certificates/vault-ca.crt` exists on capi-mgmt +> (from §6) AND will be injected into workload nodes via CK8sConfig in §13. -The canonical-kubernetes cluster template takes 18 substitution variables. -Capture them in a `cluster-env` file, then use `envsubst` to render. The -template is fetched from `canonical/cluster-api-k8s`. - -Variables (verify exact names against the template at apply time): - -``` -CLUSTER_NAME=capi-mgmt-cluster -NAMESPACE=default -KUBERNETES_VERSION=v1.32.2 -CONTROL_PLANE_MACHINE_COUNT=1 -WORKER_MACHINE_COUNT=0 -OPENSTACK_CONTROL_PLANE_MACHINE_FLAVOR=capi-mgmt-node -OPENSTACK_NODE_MACHINE_FLAVOR=capi-mgmt-node -OPENSTACK_DNS_NAMESERVERS= -OPENSTACK_EXTERNAL_NETWORK_ID= -OPENSTACK_FAILURE_DOMAIN=nova -OPENSTACK_IMAGE_NAME=noble-amd64 -OPENSTACK_SSH_KEY_NAME=capi-mgmt-key -OPENSTACK_CLOUD_YAML_B64=$(base64 -w0 /tmp/clouds.yaml) -OPENSTACK_CLOUD_CONFIG_B64=$(base64 -w0 /tmp/cloud.conf) -OPENSTACK_CLOUD_CACERT_B64=$(base64 -w0 ) -OPENSTACK_CLOUD=openstack -OPENSTACK_NODE_CIDR=10.6.0.0/24 -KUBE_CONTROL_PLANE_ENDPOINT_PORT=6443 -``` - -Render and apply: +**base64-encode Vault CA for CK8sConfig injection:** ```bash -ssh ubuntu@ 'bash -s' << 'REMOTE' -set -euo pipefail -export KUBECONFIG=/home/ubuntu/.kube-bootstrap.yaml - -curl -fLo /tmp/cluster-template.yaml \ - https://github.com/canonical/cluster-api-k8s/releases/latest/download/cluster-template.yaml - -# Source env vars (operator fills in /tmp/cluster-env) -# shellcheck disable=SC1091 -source /tmp/cluster-env - -envsubst < /tmp/cluster-template.yaml > /tmp/cluster-rendered.yaml -kubectl apply -f /tmp/cluster-rendered.yaml -REMOTE +base64 -w0 "$VAULT_CA" > "$WORK/vault-ca.crt.b64" +wc -c "$WORK/vault-ca.crt.b64" ``` -## Step 10 — Poll for cluster Available +--- + +## 13. Cluster template rendering (with Vault CA injection) + +The cluster template defines: + +- Cluster object +- OpenStackCluster (CAPO infrastructure) +- CK8sControlPlane +- CK8sConfigTemplate (control plane bootstrap — includes Vault CA injection) +- MachineDeployment + CK8sConfigTemplate (workers — includes Vault CA injection) +- Secrets for clouds.yaml and cloud.conf + +Variables (18 total): ```bash -ssh ubuntu@ 'bash -s' << 'REMOTE' +export CLUSTER_NAME=capi-mgmt-cluster +export CLUSTER_NAMESPACE=default +export KUBERNETES_VERSION=v1.31.4 # adjust to CK8s-supported +export CONTROL_PLANE_MACHINE_COUNT=1 # 3 for HA on Roosevelt +export WORKER_MACHINE_COUNT=2 # 3 on Roosevelt +export OPENSTACK_DNS_NAMESERVERS=10.12.4.227 # designate VIP +export OPENSTACK_FAILURE_DOMAIN=nova +export OPENSTACK_EXTERNAL_NETWORK_ID=$(openstack network show ext_net -c id -f value) +export OPENSTACK_IMAGE_NAME=noble-amd64 +export OPENSTACK_FLAVOR=capi-mgmt-node +export OPENSTACK_SSH_KEY_NAME=capi-workload-key +export POD_CIDR=10.244.0.0/16 +export SERVICE_CIDR=10.96.0.0/12 +export CLOUDS_YAML_B64=$(cat "$WORK/clouds.yaml.b64") +export CLOUD_CONF_B64=$(cat "$WORK/cloud.conf.b64") +export VAULT_CA_B64=$(cat "$WORK/vault-ca.crt.b64") +export CLUSTER_DOMAIN=cluster.local +export OPENSTACK_CLOUD=capi-mgmt + +# Sanity print +env | grep -E "^(CLUSTER|KUBERNETES|CONTROL_PLANE|WORKER|OPENSTACK|POD|SERVICE|VAULT|CLOUD)" \ + | grep -v "B64\|SECRET\|PASS" | sort +``` + +**Render the cluster template:** + +```bash +cat > "$WORK/cluster-template.yaml" <<'TEMPLATE_EOF' +apiVersion: v1 +kind: Secret +metadata: + name: ${CLUSTER_NAME}-cloud-config + namespace: ${CLUSTER_NAMESPACE} +type: Opaque +data: + clouds.yaml: ${CLOUDS_YAML_B64} + cloud.conf: ${CLOUD_CONF_B64} + cacert: ${VAULT_CA_B64} +--- +apiVersion: cluster.x-k8s.io/v1beta1 +kind: Cluster +metadata: + name: ${CLUSTER_NAME} + namespace: ${CLUSTER_NAMESPACE} +spec: + clusterNetwork: + pods: + cidrBlocks: + - ${POD_CIDR} + services: + cidrBlocks: + - ${SERVICE_CIDR} + serviceDomain: ${CLUSTER_DOMAIN} + infrastructureRef: + apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 + kind: OpenStackCluster + name: ${CLUSTER_NAME} + controlPlaneRef: + apiVersion: controlplane.cluster.x-k8s.io/v1beta2 + kind: CK8sControlPlane + name: ${CLUSTER_NAME}-control-plane +--- +apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 +kind: OpenStackCluster +metadata: + name: ${CLUSTER_NAME} + namespace: ${CLUSTER_NAMESPACE} +spec: + identityRef: + name: ${CLUSTER_NAME}-cloud-config + cloudName: ${OPENSTACK_CLOUD} + externalNetwork: + id: ${OPENSTACK_EXTERNAL_NETWORK_ID} + managedSecurityGroups: + allowAllInClusterTraffic: true + apiServerLoadBalancer: + enabled: true +--- +apiVersion: controlplane.cluster.x-k8s.io/v1beta2 +kind: CK8sControlPlane +metadata: + name: ${CLUSTER_NAME}-control-plane + namespace: ${CLUSTER_NAMESPACE} +spec: + replicas: ${CONTROL_PLANE_MACHINE_COUNT} + version: ${KUBERNETES_VERSION} + machineTemplate: + infrastructureTemplate: + apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 + kind: OpenStackMachineTemplate + name: ${CLUSTER_NAME}-control-plane + spec: + files: + - path: /usr/local/share/ca-certificates/vault-ca.crt + owner: root:root + permissions: "0644" + contentFrom: + secret: + name: ${CLUSTER_NAME}-cloud-config + key: cacert + preRunCommands: + - update-ca-certificates + extraKubeAPIServerArgs: + "--cloud-provider": external +--- +apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 +kind: OpenStackMachineTemplate +metadata: + name: ${CLUSTER_NAME}-control-plane + namespace: ${CLUSTER_NAMESPACE} +spec: + template: + spec: + flavor: ${OPENSTACK_FLAVOR} + image: + filter: + name: ${OPENSTACK_IMAGE_NAME} + sshKeyName: ${OPENSTACK_SSH_KEY_NAME} + identityRef: + name: ${CLUSTER_NAME}-cloud-config + cloudName: ${OPENSTACK_CLOUD} +--- +apiVersion: cluster.x-k8s.io/v1beta1 +kind: MachineDeployment +metadata: + name: ${CLUSTER_NAME}-md-0 + namespace: ${CLUSTER_NAMESPACE} +spec: + clusterName: ${CLUSTER_NAME} + replicas: ${WORKER_MACHINE_COUNT} + selector: + matchLabels: {} + template: + spec: + clusterName: ${CLUSTER_NAME} + version: ${KUBERNETES_VERSION} + bootstrap: + configRef: + apiVersion: bootstrap.cluster.x-k8s.io/v1beta2 + kind: CK8sConfigTemplate + name: ${CLUSTER_NAME}-md-0 + infrastructureRef: + apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 + kind: OpenStackMachineTemplate + name: ${CLUSTER_NAME}-md-0 +--- +apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 +kind: OpenStackMachineTemplate +metadata: + name: ${CLUSTER_NAME}-md-0 + namespace: ${CLUSTER_NAMESPACE} +spec: + template: + spec: + flavor: ${OPENSTACK_FLAVOR} + image: + filter: + name: ${OPENSTACK_IMAGE_NAME} + sshKeyName: ${OPENSTACK_SSH_KEY_NAME} + identityRef: + name: ${CLUSTER_NAME}-cloud-config + cloudName: ${OPENSTACK_CLOUD} +--- +apiVersion: bootstrap.cluster.x-k8s.io/v1beta2 +kind: CK8sConfigTemplate +metadata: + name: ${CLUSTER_NAME}-md-0 + namespace: ${CLUSTER_NAMESPACE} +spec: + template: + spec: + files: + - path: /usr/local/share/ca-certificates/vault-ca.crt + owner: root:root + permissions: "0644" + contentFrom: + secret: + name: ${CLUSTER_NAME}-cloud-config + key: cacert + preRunCommands: + - update-ca-certificates +TEMPLATE_EOF + +# envsubst to render +envsubst < "$WORK/cluster-template.yaml" > "$WORK/cluster-rendered.yaml" + +# Validate as YAML +python3 -c "import yaml; list(yaml.safe_load_all(open('$WORK/cluster-rendered.yaml'))); print('YAML OK')" + +# Quick visual check — no leftover ${...} markers +grep -n '\${' "$WORK/cluster-rendered.yaml" || echo "No unsubstituted variables — good" +``` + +> **CK8sConfig field name caveat:** the exact field names (`files`, +> `preRunCommands`) and their `contentFrom.secret` schema are CK8s-version- +> dependent. If `clusterctl init` failed earlier with schema warnings, +> consult the CK8s release notes for the pinned `$CK8S_VERSION`. + +--- + +## 14. Apply + poll-to-Ready + +Transfer rendered template to capi-mgmt and apply: + +```bash +scp "$WORK/cluster-rendered.yaml" ubuntu@$CAPI_MGMT_METAL_IP:/home/ubuntu/cluster.yaml + +ssh ubuntu@$CAPI_MGMT_METAL_IP <<'EOF' set -euo pipefail -export KUBECONFIG=/home/ubuntu/.kube-bootstrap.yaml +kubectl apply -f /home/ubuntu/cluster.yaml +echo "Applied. Waiting for cluster Available status (15-min timeout)..." -START=$(date +%s) -DEADLINE=$((START + 15*60)) - -while [[ $(date +%s) -lt $DEADLINE ]]; do - PHASE=$(kubectl get cluster capi-mgmt-cluster -o jsonpath='{.status.phase}' 2>/dev/null || echo "?") - AVAILABLE=$(kubectl get cluster capi-mgmt-cluster -o jsonpath='{.status.conditions[?(@.type=="Available")].status}' 2>/dev/null || echo "?") - ELAPSED=$(($(date +%s) - START)) - printf '[%4ds] Phase=%s Available=%s\n' "$ELAPSED" "$PHASE" "$AVAILABLE" - [[ "$AVAILABLE" == "True" ]] && break - sleep 15 +for i in $(seq 1 90); do + STATUS=$(kubectl get cluster capi-mgmt-cluster -o json 2>/dev/null \ + | jq -r '.status.phase // "Unknown"') + READY=$(kubectl get cluster capi-mgmt-cluster -o json 2>/dev/null \ + | jq -r '.status.conditions[]? | select(.type=="Ready") | .status' \ + | head -1) + echo "$(date -Is) phase=$STATUS ready=$READY" + [ "$READY" = "True" ] && { echo "Cluster Ready"; break; } + sleep 10 done -clusterctl describe cluster capi-mgmt-cluster --show-conditions all -REMOTE +kubectl get cluster,machines,kubeadmcontrolplane,machinedeployment -A +EOF ``` -## Step 11 — Export workload kubeconfig to jumphost +**If the poll times out before Ready,** typical diagnosis: ```bash -ssh ubuntu@ 'bash -s' << 'REMOTE' -set -euo pipefail -export KUBECONFIG=/home/ubuntu/.kube-bootstrap.yaml -mkdir -p ~/magnum-capi -clusterctl get kubeconfig capi-mgmt-cluster > ~/magnum-capi/capi-mgmt-cluster.kubeconfig -chmod 0600 ~/magnum-capi/capi-mgmt-cluster.kubeconfig -KUBECONFIG=~/magnum-capi/capi-mgmt-cluster.kubeconfig kubectl get nodes -REMOTE - -# Copy to jumphost for runbook 05 -scp -p ubuntu@:~/magnum-capi/capi-mgmt-cluster.kubeconfig ~/magnum-capi/capi-mgmt-cluster.kubeconfig -chmod 0600 ~/magnum-capi/capi-mgmt-cluster.kubeconfig +ssh ubuntu@$CAPI_MGMT_METAL_IP -- kubectl describe cluster capi-mgmt-cluster +ssh ubuntu@$CAPI_MGMT_METAL_IP -- kubectl get machines -A +ssh ubuntu@$CAPI_MGMT_METAL_IP -- kubectl logs -n capo-system deployment/capo-controller-manager --tail=100 ``` -## Exit criteria +Common causes: -- `capi-mgmt.maas` is Deployed in MAAS with k3s + CAPI controllers + ORC running -- `capi-mgmt-cluster` workload cluster is Available -- Workload kubeconfig exists at `~/magnum-capi/capi-mgmt-cluster.kubeconfig` - on the jumphost -- Proceed to `05-magnum-capi-driver.md` +- OpenStack API unreachable from capi-mgmt → check Vault CA install on capi-mgmt (§6) +- Image / flavor / network ID wrong in cluster template → re-check §11 variables +- Security group rules block kube-api LB → CAPO usually handles this; check OpenStackCluster status +- Application credential expired / wrong → re-check `$APP_CRED_ID` -## Recurring pitfalls (apply to execution) +--- -- `juju ssh` HANGS when stdout is redirected — use `juju exec --unit X -- 'cmd'` -- MAAS-deployed Ubuntu uses `ubuntu` user, not `jessea123` -- k3s `--bind-address=X` doesn't bind 127.0.0.1 — kubeconfig server URL must be sed-rewritten -- Snap-confined openstack CLI cannot read `/tmp` — paths under `$HOME` only -- `openstack -f value -c X -c Y` outputs in alphabetical column order — use single-column queries -- GitHub API rate limit is 60 unauthenticated requests/hour — cache results, don't refetch on every run -- `.maas` DNS may not resolve from jumphost — use IPs directly +## 15. Extract workload kubeconfig + +```bash +ssh ubuntu@$CAPI_MGMT_METAL_IP -- clusterctl get kubeconfig capi-mgmt-cluster \ + > "$WORK/capi-mgmt-cluster.kubeconfig" +chmod 600 "$WORK/capi-mgmt-cluster.kubeconfig" + +# Sanity-check the workload cluster is reachable +kubectl --kubeconfig "$WORK/capi-mgmt-cluster.kubeconfig" get nodes +# Expect: 1 control plane + 2 workers, all Ready +``` + +If `get nodes` times out, the cluster's API LB may not have allocated its +external IP yet, or the firewall rules don't permit jumphost → workload API: + +```bash +# What IP is the cluster's API LB on? +ssh ubuntu@$CAPI_MGMT_METAL_IP -- kubectl get openstackcluster capi-mgmt-cluster \ + -o json | jq '.status.externalNetwork, .status.controlPlaneEndpoint' + +# Test reachability +curl -sk --max-time 10 "https://:6443/version" && echo " ← reachable" || echo "API LB unreachable" +``` + +--- + +## 16. `clusterctl init` on target (workload cluster) + +The workload cluster must have the same CAPI providers installed before `move`. + +```bash +# Run from jumphost using the workload kubeconfig +KUBECONFIG="$WORK/capi-mgmt-cluster.kubeconfig" clusterctl init \ + --core "cluster-api:${CAPI_VERSION}" \ + --infrastructure "openstack:${CAPO_VERSION}" \ + --bootstrap "canonical-kubernetes:${CK8S_VERSION}" \ + --control-plane "canonical-kubernetes:${CK8S_VERSION}" \ + --cert-manager-version "${CERT_MANAGER_VERSION}" + +# ORC into workload cluster too +kubectl --kubeconfig "$WORK/capi-mgmt-cluster.kubeconfig" apply \ + -f "https://github.com/k-orc/openstack-resource-controller/releases/${ORC_VERSION}/orc.yaml" + +# Wait for everything Available +kubectl --kubeconfig "$WORK/capi-mgmt-cluster.kubeconfig" wait \ + --for=condition=Available --timeout=5m \ + deployment --all -n capi-system +kubectl --kubeconfig "$WORK/capi-mgmt-cluster.kubeconfig" wait \ + --for=condition=Available --timeout=5m \ + deployment --all -n capo-system +kubectl --kubeconfig "$WORK/capi-mgmt-cluster.kubeconfig" wait \ + --for=condition=Available --timeout=5m \ + deployment --all -n cert-manager +kubectl --kubeconfig "$WORK/capi-mgmt-cluster.kubeconfig" wait \ + --for=condition=Available --timeout=5m \ + deployment --all -n orc-system +``` + +> **cert-manager double-install caveat:** if CK8s already installed +> cert-manager during workload bootstrap, the second `clusterctl init` may +> warn or skip. Check existing cert-manager version against `$CERT_MANAGER_VERSION` +> — if they differ, version-skew issues may surface post-pivot. Adjust the +> pin in §4 or accept the existing version. Roosevelt's standard practice +> is to install cert-manager via `clusterctl init` only (don't pre-install +> via CK8s) — same approach valid here if you want clean version control. + +--- + +## 17. `clusterctl move` pivot + +Move all CAPI CRs from bootstrap k3s → workload cluster: + +```bash +# Stage the target kubeconfig on capi-mgmt (where clusterctl move runs) +scp "$WORK/capi-mgmt-cluster.kubeconfig" ubuntu@$CAPI_MGMT_METAL_IP:/home/ubuntu/target.kubeconfig + +# Dry-run first to catch issues before commit +ssh ubuntu@$CAPI_MGMT_METAL_IP -- clusterctl move \ + --to-kubeconfig=/home/ubuntu/target.kubeconfig \ + --dry-run + +# Inspect dry-run output: list of objects to be moved. Should include: +# - Cluster, OpenStackCluster, OpenStackClusterTemplate +# - Secrets (cloud-config) +# - Machine objects, OpenStackMachineTemplate +# - CK8sControlPlane, CK8sConfigTemplate +# - MachineDeployment +# Should NOT include cert-manager state (cert-manager manages its own state +# on each cluster independently) +``` + +**If dry-run looks correct, execute the move:** + +```bash +ssh ubuntu@$CAPI_MGMT_METAL_IP -- clusterctl move \ + --to-kubeconfig=/home/ubuntu/target.kubeconfig + +# Move can take several minutes. Output ends with: "moved successfully" +``` + +--- + +## 18. Post-pivot verification + +```bash +echo "=== Bootstrap k3s (should now be empty of cluster CRs) ===" +ssh ubuntu@$CAPI_MGMT_METAL_IP -- kubectl get cluster -A +# Expect: No resources found (or only a header) + +ssh ubuntu@$CAPI_MGMT_METAL_IP -- kubectl get machines -A +# Expect: No resources found + +ssh ubuntu@$CAPI_MGMT_METAL_IP -- kubectl get openstackcluster -A +# Expect: No resources found + +echo "" +echo "=== Workload cluster (should now own its own cluster CRs) ===" +kubectl --kubeconfig "$WORK/capi-mgmt-cluster.kubeconfig" get cluster -A +# Expect: capi-mgmt-cluster shown, phase=Provisioned + +kubectl --kubeconfig "$WORK/capi-mgmt-cluster.kubeconfig" get machines -A +# Expect: 3 machines (1 control-plane + 2 workers), all Running + +kubectl --kubeconfig "$WORK/capi-mgmt-cluster.kubeconfig" get openstackcluster -A + +echo "" +echo "=== CAPI controllers in workload ===" +kubectl --kubeconfig "$WORK/capi-mgmt-cluster.kubeconfig" get pods -A \ + | grep -E "(capi|capo|orc|cert-manager)" | grep -v "Running\|Completed" +# Expect: empty (all controller pods Running) + +echo "" +echo "=== OCCM not crash-looping (CRITICAL — main goal of TLS-verify work) ===" +kubectl --kubeconfig "$WORK/capi-mgmt-cluster.kubeconfig" get pods -n kube-system \ + -l k8s-app=openstack-cloud-controller-manager +# Expect: 1 pod Running, NOT CrashLoopBackOff + +kubectl --kubeconfig "$WORK/capi-mgmt-cluster.kubeconfig" logs -n kube-system \ + -l k8s-app=openstack-cloud-controller-manager --tail=50 \ + | grep -iE "(tls|cert|error)" | head -20 +# Expect: no TLS/cert errors; OCCM should be healthy +``` + +> **If OCCM crash-loops with "x509: certificate signed by unknown authority":** +> Vault CA distribution failed. Check (a) `/usr/local/share/ca-certificates/vault-ca.crt` +> exists on workload nodes; (b) `update-ca-certificates` ran (check `/etc/ssl/certs/ca-certificates.crt` +> for the Vault CA's subject); (c) the secret reference in CK8sConfigTemplate +> matched the secret name. SSH into a worker via the jumphost key (`ssh -i +> $WORK/capi-workload-key ubuntu@`) to diagnose. + +--- + +## 19. Handoff to runbook 05 + +The workload kubeconfig at `$WORK/capi-mgmt-cluster.kubeconfig` is the input to +`runbooks/05-magnum-capi-driver.md`. Copy it to a stable path: + +```bash +mkdir -p $HOME/magnum-capi +cp "$WORK/capi-mgmt-cluster.kubeconfig" $HOME/magnum-capi/capi-mgmt-cluster.kubeconfig +chmod 600 $HOME/magnum-capi/capi-mgmt-cluster.kubeconfig +echo "Workload kubeconfig staged at: $HOME/magnum-capi/capi-mgmt-cluster.kubeconfig" +``` + +> **Important — post-pivot semantic shift from Bobcat:** Magnum's +> `kubeconfig_file` setting (under `[capi_helm]` in +> `/etc/magnum/magnum.conf.d/99-capi.conf`, per D-007) now points to the +> workload cluster, not the bootstrap k3s. Bobcat had Magnum pointing at +> bootstrap k3s because the pivot was never executed. With pivot mandatory, +> Magnum's CAPI calls flow: +> +> ``` +> Magnum/leader → workload cluster API → CAPI controllers (running in workload) +> → create new Cluster CRs (tenant Magnum clusters) +> ``` +> +> The bootstrap k3s on capi-mgmt is now disposable. If you wanted, you could +> destroy capi-mgmt entirely at this point — the workload cluster manages +> itself. (Roosevelt may actually do this for cost savings.) For v1 testcloud, +> leave capi-mgmt running so its k3s can be inspected for diagnostics. + +--- + +## 20. Roosevelt deltas (forward-look) + +| Aspect | Testcloud (v1) | Roosevelt | +|---|---|---| +| Workload image | Default `noble-amd64` from cloud-images.ubuntu.com | Custom image baked with Vault CA pre-installed (no runtime install step) | +| Vault CA distribution | CK8sConfig `files:` + `preRunCommands:` (this runbook) | Image-baked + CK8sConfig (defense in depth) | +| App credential lifetime | No expiry set (testcloud) | Short-lived rotating credentials via Vault auth method | +| Workload cluster control plane | 1 node | 3 nodes (HA) | +| Workload cluster workers | 2 nodes | Per-tenant sizing; HPA-driven | +| `clusterctl init --cert-manager-version` | Pin from §4 | Pin to Vault PKI cert-manager profile (separate Roosevelt prep) | +| capi-mgmt VM lifecycle post-pivot | Kept running for diagnostics | Destroyed (cost savings; pivot makes it disposable) | +| Version pinning record | `$HOME/deploy-records//capi-pins/` | Same pattern, captured in Vault as audit artifact | +| Authentication to GitHub API | Optional PAT | Mandatory PAT (avoid rate-limit during automated rebuilds) | + +--- + +## 21. Rotation/refresh of pins + +The pins captured in §4 will age. Recommended cadence: + +- **Per rebuild:** re-discover all pins (Step 1 of next execution will catch + natural drift). +- **Out-of-band patch:** if a CVE drops for any pinned component, run §4 + discovery alone and capture the new pin into `$DEPLOY_RECORD/`. Then for + the affected component only, follow the upgrade procedure from its + upstream docs (does NOT necessarily require this whole runbook re-run). + +For Roosevelt, this becomes a tracked maintenance window task. + +--- + +## 22. Change log + +| Date | Change | Reference | +|---|---|---| +| 2026-05-22 | Document created. Vault CA distribution (no tls-insecure), mandatory `clusterctl move` pivot, pin-at-execution version model. | Workstream 3b | diff --git a/runbooks/05-magnum-capi-driver.md b/runbooks/05-magnum-capi-driver.md index ce24090..e1414e1 100644 --- a/runbooks/05-magnum-capi-driver.md +++ b/runbooks/05-magnum-capi-driver.md @@ -1,198 +1,529 @@ -# Runbook 05 — Magnum CAPI Helm Driver Graft +# Runbook 05 — Magnum CAPI Helm driver install -**Reference:** D-007 Layer B (rescoped per D-017). Runs after `04a-capi-bootstrap-cluster.md`. +**Status:** Executes after `04-magnum-domain.md` (Keystone wiring) and +`04a-capi-bootstrap-cluster.md` (workload cluster + kubeconfig staged). +Final post-deploy step to make Magnum capable of creating CAPI-managed +tenant K8s clusters. -**Purpose:** Install the `stackhpc/magnum-capi-helm` driver into the Magnum -charm's Python environment, configure Magnum to load and use it, and verify -end-to-end cluster creation via the driver against the bootstrap k3s -management cluster on `capi-mgmt.maas`. +**Cross-references:** +- D-007 Layer B (Magnum two-layer install) +- D-017 (CAPI bootstrap cluster lifecycle) +- Runbook 04a §19 (workload kubeconfig handoff) +- Workstream 3c decision (2026-05-22): magnum-capi-helm 1.1.0 from PyPI; workload-cluster kubeconfig (NOT bootstrap k3s) -**Prerequisites:** +**Known doc inconsistency (tracked for cleanup):** +D-007's Layer B currently states the kubeconfig points at "capi-mgmt.maas +bootstrap k3s". That language is correct for Bobcat (no pivot) but obsolete +post-workstream-3b (pivot mandatory). This runbook uses the workload cluster +kubeconfig as the canonical target. D-007 patch to follow in a workstream-3 +cleanup commit. -- Runbook 04 complete (Magnum trustee domain created) -- Runbook 04a complete (capi-mgmt bootstrap k3s + CAPI controllers + ORC - running; workload cluster Available; kubeconfig at - `~/magnum-capi/capi-mgmt-cluster.kubeconfig` on jumphost) -- Authenticated Juju session active +--- -**Key constraint:** Charm-magnum's systemd units invoke -`/etc/init.d/magnum-{api,conductor} systemd-start` (SysV-wrapped). Drop-in -config dirs are NOT consumed by the init.d script as shipped. Phase 4 graft -must REPLACE the systemd ExecStart entirely with a wrapper that adds -`--config-dir /etc/magnum/magnum.conf.d/`. This pattern was validated on -Bobcat and is expected to persist on Caracal — verify with `juju exec` at -the start of execution. +## 1. Purpose & scope -## Step 1 — Investigation block (D-017 rehearsal) +Graft the CAPI Helm driver onto the Charmed Magnum deployment so that +`openstack coe cluster create` provisions tenant K8s clusters via CAPI (in +the workload cluster) instead of via the deprecated Heat driver. -Before any grafting, inspect the live charm state. The init.d/systemd -wrapping shape may have shifted between Bobcat and Caracal: +**Output of this runbook:** + +- `magnum-capi-helm==1.1.0` installed on the magnum unit's system Python. +- `/etc/magnum/kubeconfig` populated with the workload cluster's + kubeconfig (post-pivot CAPI controller plane). +- `/etc/magnum/magnum.conf.d/99-capi.conf` configured with + `enabled_drivers = k8s_capi_helm_v1` and `[capi_helm] kubeconfig_file=`. +- Systemd overrides on `magnum-api` and `magnum-conductor` that replace + the init.d wrapper's ExecStart with explicit `--config-dir` invocation. +- Both services running cleanly with the CAPI driver loaded. + +**Scope:** v1 testcloud. Roosevelt deltas in §12. + +**Out of scope:** +- Magnum domain setup (runbook 04) +- Workload cluster lifecycle (runbook 04a) +- Smoketest tenant cluster creation is OPTIONAL (§11) — full validation + framework belongs in runbook 08. + +--- + +## 2. Decisions captured + +| Decision | Choice | Reason | +|---|---|---| +| Driver pin | `magnum-capi-helm==1.1.0` from PyPI | D-007 correction (stackhpc fork archived Dec 2024; canonical project on opendev/PyPI; 1.1.0 is last Caracal-cycle release) | +| Install method | `pip3 install --break-system-packages` | PEP 668 — Ubuntu 22.04+ requires explicit override for system-site-packages install | +| Install scope | System Python on magnum unit (not venv) | Magnum charm uses system-packaged python at `/usr/lib/python3/dist-packages/magnum/`; driver must import from same site | +| Kubeconfig target | Workload cluster (post-pivot) | Workstream 3b — bootstrap k3s is empty post-pivot; CAPI controllers live in workload | +| Kubeconfig source | `$HOME/magnum-capi/capi-mgmt-cluster.kubeconfig` (staged by 04a §19) | Documented handoff | +| Driver entry-point name | `k8s_capi_helm_v1` | Per upstream magnum-capi-helm 1.1.0; verify in §10 | +| Conf.d filename | `99-capi.conf` | Numeric prefix ensures it loads AFTER any charm-managed conf, so `enabled_drivers` override wins | +| File encoding | ASCII-only | Non-ASCII in conf.d causes silent magnum daemon failures (handoff lesson; cf. Horizon `local_settings.d` issue) | +| Trustee credential | Existing magnum-shared user (charm-managed) | Roosevelt will use app-credential pattern | + +--- + +## 3. Prerequisites + +| Prereq | Verification | +|---|---| +| Magnum charm active/idle | `juju status magnum \| grep magnum/0` shows `active idle` | +| Magnum domain setup completed (runbook 04) | `openstack domain show magnum \| grep enabled` returns `True` | +| Workload cluster reachable from jumphost | `kubectl --kubeconfig $HOME/magnum-capi/capi-mgmt-cluster.kubeconfig get nodes` returns Ready nodes | +| CAPI controllers running in workload cluster | `kubectl --kubeconfig $HOME/magnum-capi/capi-mgmt-cluster.kubeconfig get pods -n capi-system \| grep -v Running \| grep -v NAME` empty | +| Workload kubeconfig staged at expected path | `test -r $HOME/magnum-capi/capi-mgmt-cluster.kubeconfig && stat -c %a $HOME/magnum-capi/capi-mgmt-cluster.kubeconfig` shows `600` | +| `juju exec` works to magnum/leader (use exec, NOT ssh, for non-interactive — handoff lesson) | `juju exec --unit magnum/leader -- hostname` returns the unit hostname | + +**Set shell context:** ```bash -juju exec --unit magnum/leader -- 'cat /lib/systemd/system/magnum-api.service' -juju exec --unit magnum/leader -- 'cat /lib/systemd/system/magnum-conductor.service' -juju exec --unit magnum/leader -- 'ls /etc/init.d/ | grep magnum' -juju exec --unit magnum/leader -- 'cat /etc/init.d/magnum-api 2>/dev/null | head -40' -juju exec --unit magnum/leader -- 'ls /etc/default/magnum-* 2>/dev/null' -juju exec --unit magnum/leader -- 'python3 -c "import magnum; print(magnum.__file__)"' +export WORK=$HOME/magnum-capi +export WORKLOAD_KUBECONFIG=$WORK/capi-mgmt-cluster.kubeconfig +export DRIVER_VERSION=magnum-capi-helm==1.1.0 # per D-007 correction +cd "$WORK" ``` -Record results in execution notes. The Python import path tells us where -to pip-install the driver (Bobcat: `/usr/lib/python3/dist-packages/magnum/`). +> **`juju ssh` vs `juju exec` choice:** the handoff lessons explicitly call +> out that `juju ssh` hangs when stdout is redirected (PTY allocation issue). +> This runbook uses `juju exec` for all non-interactive command execution and +> reserves `juju ssh` only for cases where you actually want an interactive +> shell. -## Step 2 — Pre-flight: confirm kubeconfig reachability +--- -The Magnum charm unit must be able to reach the k3s API on -`capi-mgmt.maas:6443`. The charm runs in an LXD container on the metal -network; reach is expected via direct L2. +## 4. Pre-flight: capture current state + +Capture the magnum unit's state BEFORE making changes. Useful for diagnosis +if anything goes wrong, and as a record of what was changed. ```bash -juju exec --unit magnum/leader -- "curl -sk --max-time 5 https://$(awk '/server:/ {print $2}' ~/magnum-capi/capi-mgmt-cluster.kubeconfig | head -1 | sed 's|https://||')/healthz" -# Expect: "ok" +mkdir -p "$WORK/pre-state" + +# Service unit files (as managed by charm) +juju exec --unit magnum/leader -- \ + 'sudo systemctl cat magnum-api magnum-conductor 2>&1' \ + > "$WORK/pre-state/systemd-units.txt" + +# Currently-enabled drivers +juju exec --unit magnum/leader -- \ + 'sudo grep -r enabled_drivers /etc/magnum/ 2>/dev/null || echo "(no enabled_drivers found — charm default applies)"' \ + > "$WORK/pre-state/drivers-pre.txt" + +# Python site-packages — see what's already installed +juju exec --unit magnum/leader -- \ + 'sudo pip3 list 2>/dev/null | grep -iE "magnum|cluster|helm|kubernetes" || true' \ + > "$WORK/pre-state/pip-pre.txt" + +# conf.d state +juju exec --unit magnum/leader -- \ + 'sudo ls -la /etc/magnum/magnum.conf.d/ 2>/dev/null || echo "(no conf.d directory)"' \ + > "$WORK/pre-state/confd-pre.txt" + +# Service running state +juju exec --unit magnum/leader -- \ + 'sudo systemctl is-active magnum-api magnum-conductor' \ + > "$WORK/pre-state/service-state-pre.txt" + +# Display the captured state +cat "$WORK/pre-state/"*.txt ``` -## Step 3 — Install the driver into the charm Python environment +> **What to look for in pre-state:** the charm-managed `enabled_drivers` value +> probably includes Heat-based drivers (`heat_kubernetes`, etc.). The 99-capi.conf +> override in §7 replaces this with the single CAPI driver. The pre-state +> capture documents what was active before the override took effect. + +--- + +## 5. Install magnum-capi-helm 1.1.0 ```bash -juju ssh magnum/leader -- "sudo pip install --break-system-packages \ - 'git+https://github.com/stackhpc/magnum-capi-helm@v0.13.0'" - -# Verify -juju exec --unit magnum/leader -- 'python3 -c "import magnum_capi_helm; print(magnum_capi_helm.__file__)"' +juju exec --unit magnum/leader -- \ + "sudo pip3 install $DRIVER_VERSION --break-system-packages" ``` -Pin to a specific tag rather than `main` — the driver should not move -under our feet between deploys. Version `v0.13.0` was validated on Bobcat; -verify it remains the chosen tag at Caracal execution time. - -## Step 4 — Deploy the kubeconfig to the charm unit +**Verify install:** ```bash -# Copy from jumphost to magnum/leader -juju scp ~/magnum-capi/capi-mgmt-cluster.kubeconfig magnum/leader:/tmp/capi-kubeconfig -juju ssh magnum/leader -- "sudo install -o root -g magnum -m 0640 /tmp/capi-kubeconfig /etc/magnum/kubeconfig && sudo rm /tmp/capi-kubeconfig" -juju ssh magnum/leader -- "ls -la /etc/magnum/kubeconfig" +juju exec --unit magnum/leader -- \ + 'sudo pip3 show magnum-capi-helm | head -10' +# Expect: Name: magnum-capi-helm +# Version: 1.1.0 +# Location: /usr/lib/python3/dist-packages + +juju exec --unit magnum/leader -- \ + 'sudo python3 -c "import magnum_capi_helm; print(magnum_capi_helm.__file__)"' +# Expect: /usr/lib/python3/dist-packages/magnum_capi_helm/__init__.py ``` -## Step 5 — Configure Magnum to use the CAPI Helm driver - -Create the conf.d directory and drop-in: +**Check that the driver entry point is registered:** ```bash -juju ssh magnum/leader -- "sudo mkdir -p /etc/magnum/magnum.conf.d && sudo chown root:magnum /etc/magnum/magnum.conf.d && sudo chmod 0750 /etc/magnum/magnum.conf.d" +juju exec --unit magnum/leader -- \ + 'sudo python3 -c " +from stevedore import driver +mgr = driver.DriverManager( + namespace=\"magnum.drivers\", + name=\"k8s_capi_helm_v1\", + invoke_on_load=False +) +print(\"Driver class:\", mgr.driver) +"' +# Expect: Driver class: +# (or similar — the actual class path is package-version-dependent) +``` -juju ssh magnum/leader -- "sudo tee /etc/magnum/magnum.conf.d/99-capi.conf > /dev/null" << 'EOC' +> If the entry point check fails with "No 'k8s_capi_helm_v1' driver found", +> the driver name in 1.1.0 may differ from what D-007 documented. Inspect the +> installed package's `entry_points.txt`: +> +> ```bash +> juju exec --unit magnum/leader -- \ +> 'sudo cat /usr/lib/python3/dist-packages/magnum_capi_helm*.dist-info/entry_points.txt 2>/dev/null' +> ``` +> +> Find the entry under `[magnum.drivers]` — use that exact name in §7. + +--- + +## 6. Stage workload kubeconfig on magnum unit + +```bash +# Transfer kubeconfig from jumphost to magnum unit +juju scp "$WORKLOAD_KUBECONFIG" magnum/leader:/tmp/kubeconfig + +# Install with correct ownership/mode in one atomic step +juju exec --unit magnum/leader -- \ + 'sudo install -m 0640 -o root -g magnum /tmp/kubeconfig /etc/magnum/kubeconfig && sudo rm /tmp/kubeconfig' +``` + +**Verify:** + +```bash +juju exec --unit magnum/leader -- \ + 'sudo ls -la /etc/magnum/kubeconfig' +# Expect: -rw-r----- 1 root magnum ... /etc/magnum/kubeconfig + +# Confirm magnum user can read it +juju exec --unit magnum/leader -- \ + 'sudo -u magnum cat /etc/magnum/kubeconfig | head -3' +# Expect: apiVersion: v1 / clusters: / - cluster: + +# Confirm kubectl can use it from the magnum unit (sanity check on API reachability) +juju exec --unit magnum/leader -- \ + 'sudo -u magnum kubectl --kubeconfig /etc/magnum/kubeconfig get nodes 2>&1 | head -10' +# Expect: NAME ... STATUS=Ready for control plane + workers +# OR: kubectl not installed (acceptable — magnum-capi-helm uses Python client, not kubectl) +``` + +> **Why mode 0640 and group magnum:** kubeconfig contains auth tokens. Mode +> 0600 (owner-only) wouldn't let the `magnum` system user (which runs +> magnum-api/conductor) read it. Mode 0640 with `group: magnum` is the +> minimum-permission setup that works. NOT 0644 — keeps it off other users +> on the unit. + +--- + +## 7. Configure `/etc/magnum/magnum.conf.d/99-capi.conf` + +Generate the conf locally first (snap confinement does not apply to plain +bash on jumphost, but we keep paths under `$HOME` for consistency), then +transfer. + +**ASCII-only verification is critical** — the handoff documents non-ASCII +characters in `conf.d` files causing silent daemon failures (cf. Horizon +`local_settings.d`). Use plain straight quotes, ASCII dashes, no smart +typography. + +```bash +# Write locally +cat > "$WORK/99-capi.conf" <<'EOF' [DEFAULT] enabled_drivers = k8s_capi_helm_v1 [capi_helm] kubeconfig_file = /etc/magnum/kubeconfig -EOC +EOF -juju ssh magnum/leader -- "sudo chown root:magnum /etc/magnum/magnum.conf.d/99-capi.conf && sudo chmod 0640 /etc/magnum/magnum.conf.d/99-capi.conf" +# Verify it is pure ASCII (no UTF-8 sneakers) +file "$WORK/99-capi.conf" +# Expect: ASCII text +# If it says "UTF-8 Unicode text", STOP and rewrite by hand — even one stray +# em-dash or smart quote will silently break magnum + +# Hex dump check (paranoid mode) +xxd "$WORK/99-capi.conf" | grep -v "^[0-9a-f]*: [0-9a-f ]* [a-zA-Z0-9 \[\]=._/]*$" | head -5 +# Expect: empty output (all bytes are printable ASCII) ``` -## Step 6 — Install the systemd ExecStart override - -Because the charm's systemd units invoke an init.d wrapper that does NOT -honor `--config-dir`, the override must replace the ExecStart entirely -with a wrapper that invokes the Magnum binaries directly with both the -default config file and our config dir. +**Stage and install:** ```bash -juju ssh magnum/leader -- "sudo mkdir -p /etc/systemd/system/magnum-api.service.d" -juju ssh magnum/leader -- "sudo tee /etc/systemd/system/magnum-api.service.d/override.conf > /dev/null" << 'EOC' +juju scp "$WORK/99-capi.conf" magnum/leader:/tmp/99-capi.conf + +juju exec --unit magnum/leader -- \ + 'sudo mkdir -p /etc/magnum/magnum.conf.d && sudo install -m 0644 -o root -g root /tmp/99-capi.conf /etc/magnum/magnum.conf.d/99-capi.conf && sudo rm /tmp/99-capi.conf' + +# Verify +juju exec --unit magnum/leader -- \ + 'sudo ls -la /etc/magnum/magnum.conf.d/ && sudo cat /etc/magnum/magnum.conf.d/99-capi.conf' +# Expect: file listed; content matches what was written +``` + +--- + +## 8. Systemd override on magnum-api + magnum-conductor + +The Charmed Magnum unit files use a wrapper pattern: + +``` +ExecStart=/etc/init.d/magnum-api systemd-start +``` + +The wrapper does NOT pass `--config-dir` to magnum-api, so `/etc/magnum/magnum.conf.d/` +is never loaded. The 99-capi.conf would have no effect. + +Override with explicit `--config-file` + `--config-dir` invocation. + +**Generate override files locally:** + +```bash +cat > "$WORK/magnum-api-override.conf" <<'EOF' [Service] ExecStart= -ExecStart=/usr/bin/magnum-api --config-file /etc/magnum/magnum.conf --config-dir /etc/magnum/magnum.conf.d -EOC +ExecStart=/usr/bin/magnum-api --config-file=/etc/magnum/magnum.conf --config-dir=/etc/magnum/magnum.conf.d +EOF -juju ssh magnum/leader -- "sudo mkdir -p /etc/systemd/system/magnum-conductor.service.d" -juju ssh magnum/leader -- "sudo tee /etc/systemd/system/magnum-conductor.service.d/override.conf > /dev/null" << 'EOC' +cat > "$WORK/magnum-conductor-override.conf" <<'EOF' [Service] ExecStart= -ExecStart=/usr/bin/magnum-conductor --config-file /etc/magnum/magnum.conf --config-dir /etc/magnum/magnum.conf.d -EOC +ExecStart=/usr/bin/magnum-conductor --config-file=/etc/magnum/magnum.conf --config-dir=/etc/magnum/magnum.conf.d +EOF -juju ssh magnum/leader -- "sudo systemctl daemon-reload" -juju ssh magnum/leader -- "sudo systemctl restart magnum-api magnum-conductor" -juju ssh magnum/leader -- "sudo systemctl status magnum-api magnum-conductor --no-pager" +# ASCII check +file "$WORK/magnum-api-override.conf" "$WORK/magnum-conductor-override.conf" +# Expect: ASCII text x2 ``` -Verify the override took effect: +> **The empty `ExecStart=` line is critical.** Systemd accumulates ExecStart +> directives by default; an empty assignment is required to CLEAR the inherited +> directive before setting the replacement. Without the empty line, the unit +> would have BOTH the init.d wrapper AND the new direct invocation, and would +> likely fail to start. + +**Install on the unit:** ```bash -juju ssh magnum/leader -- "sudo systemctl cat magnum-api | grep ExecStart" -juju ssh magnum/leader -- "ps -ef | grep magnum-api | grep -v grep" -# Expect: /usr/bin/magnum-api with --config-dir flag +juju scp "$WORK/magnum-api-override.conf" magnum/leader:/tmp/magnum-api-override.conf +juju scp "$WORK/magnum-conductor-override.conf" magnum/leader:/tmp/magnum-conductor-override.conf + +juju exec --unit magnum/leader -- \ + 'sudo mkdir -p /etc/systemd/system/magnum-api.service.d /etc/systemd/system/magnum-conductor.service.d && \ + sudo install -m 0644 -o root -g root /tmp/magnum-api-override.conf /etc/systemd/system/magnum-api.service.d/override.conf && \ + sudo install -m 0644 -o root -g root /tmp/magnum-conductor-override.conf /etc/systemd/system/magnum-conductor.service.d/override.conf && \ + sudo rm /tmp/magnum-api-override.conf /tmp/magnum-conductor-override.conf' + +# Reload systemd to pick up the overrides +juju exec --unit magnum/leader -- 'sudo systemctl daemon-reload' + +# Verify the overrides are effective (systemctl cat shows combined unit + overrides) +juju exec --unit magnum/leader -- 'sudo systemctl cat magnum-api | grep -A1 ExecStart' +# Expect: TWO ExecStart= lines — the empty clear-line and the new /usr/bin/magnum-api invocation +juju exec --unit magnum/leader -- 'sudo systemctl cat magnum-conductor | grep -A1 ExecStart' +# Expect: TWO ExecStart= lines as above for magnum-conductor ``` -## Step 7 — Verify driver loaded +> **Charm reconciliation note:** the Magnum charm may rewrite its own systemd +> units on config changes or upgrades. The drop-in override at +> `/etc/systemd/system/magnum-api.service.d/override.conf` is OUTSIDE the +> charm's writable zone and should survive. Verify after any `juju refresh` or +> `juju config magnum` command by re-running the `systemctl cat` check above. + +--- + +## 9. Restart services + verify health ```bash -juju ssh magnum/leader -- "sudo tail -100 /var/log/magnum/magnum-conductor.log | grep -i -E 'driver|capi'" -# Expect: log lines mentioning k8s_capi_helm_v1 driver loaded +juju exec --unit magnum/leader -- \ + 'sudo systemctl restart magnum-api magnum-conductor' + +# Wait briefly for services to initialize +sleep 5 + +# Check active state +juju exec --unit magnum/leader -- \ + 'sudo systemctl is-active magnum-api magnum-conductor' +# Expect: active (x2) + +# Examine recent journal for errors (the critical step — magnum's silent failure +# mode means we must read logs, not just trust is-active) +juju exec --unit magnum/leader -- \ + 'sudo journalctl -u magnum-api --since "2 minutes ago" --no-pager | tail -50' +juju exec --unit magnum/leader -- \ + 'sudo journalctl -u magnum-conductor --since "2 minutes ago" --no-pager | tail -50' ``` -## Step 8 — Smoke test +**Look for these red flags in the logs:** -Create a cluster template and small cluster to validate end-to-end: +| Symptom | Likely cause | Remediation | +|---|---|---| +| `ImportError: No module named magnum_capi_helm` | §5 pip install failed | Re-run §5; check pip3 output | +| `EntryPointError: No 'k8s_capi_helm_v1' driver` | Driver entry-point name mismatch | Verify name per §5 footnote; update §7 | +| Service repeatedly restarts (look for "Started" appearing twice in 10s) | Likely a config error in 99-capi.conf | Re-check ASCII-only; check magnum.conf.d permissions | +| `kubeconfig_file` not honored | --config-dir not being passed | §8 override not active; re-run `systemctl daemon-reload` | +| Silent: no error but driver also not loading | Non-ASCII char snuck into a conf | `file /etc/magnum/magnum.conf.d/99-capi.conf` — if it says UTF-8, regenerate | + +--- + +## 10. CAPI driver enablement check + +Verify the driver is actually loaded by Magnum and reachable via the API. ```bash -source ~/admin-openrc +source $HOME/admin-openrc -# Cluster template -openstack coe cluster template create \ - --name k8s-capi-test \ +# List supported COE drivers via the Magnum API +openstack coe cluster template list -f json +# (empty templates list is fine — we are checking the endpoint responds) + +# Direct check on the unit: scan the service's loaded drivers +juju exec --unit magnum/leader -- \ + 'sudo journalctl -u magnum-conductor --since "5 minutes ago" --no-pager | grep -iE "driver|enabled" | head -20' +# Expect: a line mentioning k8s_capi_helm_v1 having been loaded +# (Magnum logs the loaded drivers at startup) + +# Definitive check: try creating a cluster template that requires the CAPI driver +openstack coe cluster template create magnum-capi-driver-check \ --image noble-amd64 \ - --keypair capi-mgmt-key \ - --external-network \ - --master-flavor m1.medium \ - --flavor m1.medium \ + --keypair capi-workload-key \ + --external-network ext_net \ + --master-flavor capi-mgmt-node \ + --flavor capi-mgmt-node \ --coe kubernetes \ --network-driver calico \ - --labels driver=k8s_capi_helm_v1,kube_tag=v1.32.2 + --labels kube_tag=v1.31.4 -# Cluster -openstack coe cluster create \ - --cluster-template k8s-capi-test \ - --master-count 1 \ - --node-count 1 \ - --keypair capi-mgmt-key \ - k8s-capi-smoke - -# Poll -watch -n 30 'openstack coe cluster show k8s-capi-smoke -c status -c status_reason' -# Expect CREATE_COMPLETE within 15-20 min +openstack coe cluster template show magnum-capi-driver-check -c name -c coe -c labels ``` -Tear down the smoke cluster after validation: +> **If template create fails with "driver not enabled" or similar:** the +> Magnum API process is not loading the conf.d. Verify the systemd override +> took effect — `sudo systemctl show magnum-api -p ExecStart` on the unit +> should show the explicit `--config-dir` invocation. If it still shows the +> init.d wrapper, the daemon-reload + restart did not pick up the override. + +**Cleanup the driver-check template:** ```bash -openstack coe cluster delete k8s-capi-smoke -# Wait for DELETE_COMPLETE -openstack coe cluster template delete k8s-capi-test +openstack coe cluster template delete magnum-capi-driver-check ``` -## Exit criteria +--- -- Magnum services running with `--config-dir /etc/magnum/magnum.conf.d` - visible in the live process -- `k8s_capi_helm_v1` driver logged at conductor startup -- Smoke-test cluster reached `CREATE_COMPLETE` and torn down cleanly +## 11. Optional smoketest — create a tenant CAPI cluster -## Idempotency and recovery notes +This step is **optional**. Full validation belongs in runbook 08. Use this +smoketest only if you want immediate confirmation that the entire chain +(Magnum API -> conductor -> magnum-capi-helm -> CAPI controllers in workload +cluster -> tenant K8s cluster on tenant VMs) works end-to-end. -- The systemd override survives `charm config-changed` (charm rewrites - `magnum.conf` but doesn't touch the conf.d dir or systemd drop-ins) -- The pip-installed driver may NOT survive a charm `upgrade-charm` — if - the venv gets rebuilt, re-run Step 3 -- The kubeconfig at `/etc/magnum/kubeconfig` is operator-managed; survives - charm hooks but if Magnum is redeployed, restore it +```bash +# Create a cluster template tuned for testcloud smoketest +openstack coe cluster template create magnum-smoketest-template \ + --image noble-amd64 \ + --keypair capi-workload-key \ + --external-network ext_net \ + --master-flavor capi-mgmt-node \ + --flavor capi-mgmt-node \ + --coe kubernetes \ + --network-driver calico \ + --labels boot_volume_size=20,kube_tag=v1.31.4,octavia_provider=ovn -## Recurring pitfalls +# Create a 1+1 cluster (minimum for smoketest) +openstack coe cluster create magnum-smoketest \ + --cluster-template magnum-smoketest-template \ + --master-count 1 \ + --node-count 1 -- `juju ssh` HANGS when stdout is redirected — use `juju exec --unit X -- 'cmd'` -- Python magnum at `/usr/lib/python3/dist-packages/magnum/` needs `--break-system-packages` for PEP 668 -- Heredoc nesting in `juju ssh` is fragile — keep heredocs simple, single level -- Non-ASCII characters in conf.d files cause silent daemon failures — ensure ASCII only +# Poll for status (15-20 min typical; CAPI provisions tenant VMs end-to-end) +for i in $(seq 1 60); do + STATUS=$(openstack coe cluster show magnum-smoketest -c status -f value) + echo "$(date -Is) status=$STATUS" + case "$STATUS" in + CREATE_COMPLETE) echo "Smoketest passed"; break ;; + CREATE_FAILED) echo "Smoketest FAILED"; openstack coe cluster show magnum-smoketest; exit 1 ;; + esac + sleep 30 +done + +# Retrieve the smoketest cluster's kubeconfig +openstack coe cluster config magnum-smoketest --dir "$WORK/smoketest-kubeconfig" + +# Sanity-check the smoketest cluster +KUBECONFIG="$WORK/smoketest-kubeconfig/config" kubectl get nodes +KUBECONFIG="$WORK/smoketest-kubeconfig/config" kubectl get pods -A | head -20 + +# Cleanup the smoketest cluster +openstack coe cluster delete magnum-smoketest +openstack coe cluster template delete magnum-smoketest-template +``` + +> **What success looks like:** the CAPI controllers in the workload cluster +> receive the new Cluster CR (created by magnum-capi-helm in response to the +> Magnum API call), CAPO talks to OpenStack to provision tenant VMs, the +> tenant VMs join the new K8s cluster, and the new cluster has 1 control +> plane + 1 worker Ready. Octavia provides the API server LB (visible as a +> Floating IP in the tenant project). + +--- + +## 12. Roosevelt deltas (forward-look) + +| Aspect | Testcloud (v1) | Roosevelt | +|---|---|---| +| Driver pin source | PyPI `magnum-capi-helm==1.1.0` | Internal mirror with checksum verification | +| Driver pin record | Implicit in this runbook | Captured in Vault as audit artifact alongside CAPI pins | +| Kubeconfig source | Workload cluster (post-pivot per 04a §17) | Same | +| Kubeconfig rotation | Manual on capi-mgmt rebuild | Automated when workload cluster cert rotates | +| Trustee credential | Charm-default magnum-shared user | Per-tenant app credentials via Vault auth method | +| Magnum HA | num_units=1 (per D-009 testcloud) | num_units=3 with hacluster + provider VIP | +| Driver upgrade discipline | Manual re-run of §5 | Tracked maintenance window; Vault audit log | +| Systemd override | Drop-in at `/etc/systemd/system/magnum-*.service.d/override.conf` | Same — but provided via a charm overlay package, not manual file install | +| ASCII-only enforcement | Manual check (§7, §8) | Pre-flight lint in `scripts/pre-flight-checks.sh` | + +--- + +## 13. Documented runtime gotchas (carry-forward from handoff) + +These gotchas burned cycles during the Bobcat Magnum CAPI work. Each is +explicitly handled in this runbook; collecting them here for visibility: + +1. **PEP 668 `--break-system-packages`** (§5). Ubuntu 22.04+ refuses + `pip install` against system Python by default. The flag is required for + the magnum-capi-helm install path used by Charmed Magnum. +2. **`juju ssh` hangs on stdout redirect.** PTY allocation issue. + This runbook uses `juju exec` for all non-interactive command execution. +3. **Heredoc nesting in `juju ssh` is fragile.** This runbook writes + conf files locally first and uses `juju scp` + `juju exec install` to + transfer — single-level only. +4. **Non-ASCII characters in `conf.d` files cause silent daemon failures.** + §7 and §8 both include `file ` ASCII verification before transfer. +5. **`openstack -f value -c X -c Y` outputs in alphabetical field order, + not flag order.** This runbook uses single-column queries or `-f json | + jq` throughout. +6. **Charm-managed `enabled_drivers` is overridden, not appended.** The + `enabled_drivers = k8s_capi_helm_v1` line in 99-capi.conf REPLACES the + charm-default value (which would include the deprecated Heat drivers). +7. **The systemd override empty `ExecStart=` line is required** to clear + the inherited ExecStart before setting the replacement (§8). +8. **Snap-confined `openstack` CLI cannot read `/tmp`.** This runbook stages + files under `$WORK=$HOME/magnum-capi`. The smoketest in §11 also writes + to `$WORK/smoketest-kubeconfig`. + +--- + +## 14. Change log + +| Date | Change | Reference | +|---|---|---| +| 2026-05-22 | Document created. magnum-capi-helm 1.1.0 from PyPI; workload-cluster kubeconfig (post-pivot per workstream 3b); systemd override pattern; ASCII-only conf.d. | Workstream 3c |