diff --git a/README.md b/README.md index 3052ddc..8a00b43 100644 --- a/README.md +++ b/README.md @@ -51,6 +51,7 @@ │ ├── 02-deploy.md # juju deploy + settle wait │ ├── 03-vault-init.md # vault unseal + cert auth │ ├── 04-magnum-domain.md # domain-setup action + keystone wiring +│ ├── 04a-capi-bootstrap-cluster.md # capi-mgmt VM deploy + k3s + CAPI + ORC (D-017) │ ├── 05-magnum-capi-driver.md # pip install driver + kubeconfig + systemd │ ├── 06-tenant-setup.md # project, user, openrc, app credentials │ ├── 07-dns-zones.md # Designate zones + API VIP A records (v1) @@ -77,10 +78,11 @@ 5. Deploy new bundle (`runbooks/02-deploy.md`) 6. Initialize Vault (`runbooks/03-vault-init.md`) 7. Set up Magnum domain (`runbooks/04-magnum-domain.md`) -8. Install Magnum CAPI Helm driver (`runbooks/05-magnum-capi-driver.md`) -9. Recreate tenant resources (`runbooks/06-tenant-setup.md`) -10. Populate DNS zones (`runbooks/07-dns-zones.md`) -11. Run validation (`runbooks/08-validate.md` + `scripts/validate.sh`) +8. Stand up CAPI bootstrap cluster on `capi-mgmt.maas` (`runbooks/04a-capi-bootstrap-cluster.md`) +9. Install Magnum CAPI Helm driver (`runbooks/05-magnum-capi-driver.md`) +10. Recreate tenant resources (`runbooks/06-tenant-setup.md`) +11. Populate DNS zones (`runbooks/07-dns-zones.md`) +12. Run validation (`runbooks/08-validate.md` + `scripts/validate.sh`) ## v1-specific design decisions (summary; see docs/design-decisions.md for full record) diff --git a/docs/design-decisions.md b/docs/design-decisions.md index df57d6d..98cf902 100644 --- a/docs/design-decisions.md +++ b/docs/design-decisions.md @@ -167,8 +167,14 @@ `enabled_drivers=k8s_capi_helm_v1` and `[capi_helm] kubeconfig_file=/etc/magnum/kubeconfig` -**CAPI mgmt plane:** stays on `capi-mgmt.maas` bootstrap k3s. Not in-cloud. -This pattern transfers to Roosevelt unchanged. +**CAPI mgmt plane:** `capi-mgmt.maas` bootstrap k3s. Per **D-017**, this +cluster is rebuilt from scratch every deployment cycle — there is no +preserved-across-rebuild artifact. The install procedure for the bootstrap +cluster lives in `runbooks/04a-capi-bootstrap-cluster.md` and runs **before** +this runbook. This pattern transfers to Roosevelt unchanged. + +**Superseded portions:** The "preserved across rebuild" stance in earlier +drafts of this decision is **superseded by D-017**. See D-017 for rationale. --- @@ -274,18 +280,22 @@ --- -## D-013: Clean teardown of existing capi-mgmt +## D-013: ~~Clean teardown of existing capi-mgmt~~ (SUPERSEDED by D-018) -**Decision:** Before destroying the OpenStack model, gracefully delete the -CAPI workload cluster on capi-mgmt.maas to allow OpenStack resources (LBs, +**Original decision:** Before destroying the OpenStack model, gracefully delete +the CAPI workload cluster on capi-mgmt.maas to allow OpenStack resources (LBs, FIPs, volumes) to be cleaned up properly by CAPI controllers. -**Steps:** `kubectl delete cluster capi-mgmt-cluster` → wait for CAPI to -clean up tenant-side OpenStack resources → `juju destroy-model openstack +**Original steps:** `kubectl delete cluster capi-mgmt-cluster` → wait for CAPI +to clean up tenant-side OpenStack resources → `juju destroy-model openstack --destroy-storage --no-prompt`. -**Preserved across rebuild:** capi-mgmt.maas bootstrap k3s + CAPI controllers -themselves. Re-used as the Magnum CAPI mgmt plane post-deploy. +**Original "preserved across rebuild" claim:** capi-mgmt.maas bootstrap k3s + +CAPI controllers re-used as the Magnum CAPI mgmt plane post-deploy. + +**Status:** Superseded. See **D-018** for the replacement teardown strategy +(MAAS-release-direct, skip graceful) and **D-017** for the replacement +bootstrap cluster lifecycle (full rebuild every cycle, nothing preserved). --- @@ -378,7 +388,84 @@ --- -## Known bugs to avoid in bundle drafting +## D-017: CAPI bootstrap cluster lifecycle + +**Decision:** L3 full teardown and rebuild every deployment cycle. The +`capi-mgmt.maas` MAAS VM is released back to Ready state on teardown; on +rebuild, it is re-deployed from scratch with Ubuntu 24.04, k3s, CAPI +controllers, and ORC. **Nothing is preserved across cycles.** + +**Rationale:** + +- Rehearsal-first principle. If the bootstrap-cluster install procedure + isn't documented and rehearsed, the runbook doesn't exist; if the runbook + doesn't exist, surprises surface on Roosevelt. +- Self-imposed forcing function. Every rebuild exercises the full path: + MAAS deploy → Ubuntu cloud-init → Vault CA install → k3s install with + correct bind-address/SAN flags → kubeconfig server-URL rewrite → helm + + clusterctl install → clusterctl init with canonical-kubernetes provider + URLs → ORC install → cloud-side prep → cluster manifest render → apply + → poll-to-Ready → kubeconfig copy. +- Disposability test. The Bobcat experience proved no critical state lives + on capi-mgmt that isn't reproducible from the runbook and the OpenStack + cloud. Wiping is safe. + +**Runbook:** `runbooks/04a-capi-bootstrap-cluster.md` documents the install +sequence in full. It runs **after** `02-deploy.md` (OpenStack cloud up) and +**before** `05-magnum-capi-driver.md` (driver graft, which needs the +bootstrap k3s kubeconfig). + +**Supersedes:** the "preserved across rebuild" stance in earlier drafts of +D-007 and D-013. + +**Alternatives considered:** + +- L1: Wipe just the cluster CRs, keep k3s + controllers. Rejected: skips + the install rehearsal that's the whole point. +- L2: Wipe just the controllers, keep k3s. Rejected: same reason; the + `clusterctl init` step is exactly the surface that needs rehearsing. +- L3 (chosen): Full wipe including the VM. + +--- + +## D-018: Teardown strategy — skip graceful, release MAAS directly + +**Decision:** On teardown, do not pursue graceful CAPI workload deletion or +graceful OpenStack model destroy. Instead: + +1. (Optional) Capture pre-destroy state for reference +2. `juju destroy-model openstack --force --no-wait --destroy-storage --no-prompt` (background) +3. MAAS release all 5 VMs (openstack0, openstack1, openstack2, openstack3, capi-mgmt) → Ready (parallel) +4. Verify both sides + +**Rationale:** + +- The rebuild's goal is rehearsing the Roosevelt deploy path. Roosevelt + starts from MAAS-Ready bare-metal machines. The most faithful rehearsal + is teardown-to-MAAS-Ready. +- Graceful CAPI workload teardown rehearses a different procedure + (production cluster decommissioning) that doesn't transfer to Roosevelt's + initial deploy. +- `juju destroy-model --destroy-storage` can hang on stuck hooks and leave + partial state. `--force --no-wait` plus MAAS release is more reliable. +- Cloud-side OpenStack data (Keystone projects, Neutron networks, Glance + images, app credentials) lives in MySQL on the openstack0-3 hosts. MAAS + release wipes those hosts, so no separate cloud-side cleanup is needed. + +**What is lost vs. graceful path:** verified-clean release path for CAPI +workload resources (Octavia LBs, FIPs, CAPO-managed networks). All of these +are destined for obliteration anyway; the loss is theoretical. + +**What is gained:** ~30+ minutes saved; cleaner end-state guarantee; better +Roosevelt rehearsal fidelity. + +**Supersedes:** D-013. + +**Runbook:** `runbooks/01-destroy-model.md` documents the four phases. + +--- + + From prior bundle review work — these are anti-patterns: @@ -406,4 +493,5 @@ | Date | Change | Reference | |---|---|---| | 2026-05-22 | Initial document captured | Caracal rebuild planning session | -| 2026-05-22 | D-015 v1/v2 fork added; D-004 and D-004a marked v2-scope; D-016 IPv4 tenant pool hybrid model added; D-014 updated with new repo name | v1/v2 fork session (this update) | +| 2026-05-22 | D-015 v1/v2 fork added; D-004 and D-004a marked v2-scope; D-016 IPv4 tenant pool hybrid model added; D-014 updated with new repo name | v1/v2 fork session | +| 2026-05-22 | D-017 CAPI bootstrap full-rebuild lifecycle added; D-018 MAAS-release-direct teardown added. D-013 marked superseded by D-018. D-007 Layer B updated to reference D-017 and `runbooks/04a-capi-bootstrap-cluster.md`. | Teardown planning + handoff session | diff --git a/runbooks/01-destroy-model.md b/runbooks/01-destroy-model.md index 5a41968..d158b84 100644 --- a/runbooks/01-destroy-model.md +++ b/runbooks/01-destroy-model.md @@ -1,23 +1,99 @@ -# Runbook 01 — Destroy Existing OpenStack Model +# Runbook 01 — Teardown of existing testcloud -**STATUS: PLACEHOLDER** — drafted alongside bundle drafting. +**Reference:** D-018 (skip graceful, MAAS-release-direct). Supersedes the +graceful-teardown approach formerly in D-013. -## Purpose +**Pre-conditions:** -Cleanly destroy the existing Bobcat `openstack` model, freeing the Juju -controller to host the new Caracal model. +- KVM snapshots of openstack0–3 exist as the safety net (pre-Magnum + baseline). With L3 full rebuild (D-017) we should not need them, but they + remain valid disaster recovery. +- Run from jumphost `vopenstack-jesse` as user `jessea123`. +- Authenticated Juju session active (`juju whoami` returns identity). +- MAAS CLI profile configured OR access to MAAS UI for releasing machines. +- This procedure destroys the entire `openstack` Juju model and wipes all 5 + MAAS-managed VMs. There is no undo short of restoring from snapshot. -## Prerequisites +**Phase A — Pre-destroy capture (~30 sec)** -- All steps in `00-pre-deploy.md` completed including go/no-go checklist -- Vault unseal keys backed up -- KVM snapshots in place -- CAPI workload cluster gracefully torn down +```bash +BACKUP_DIR=~/backups/pre-caracal-destroy-$(date -u +%Y%m%dT%H%M%SZ) +mkdir -p "$BACKUP_DIR" +juju export-bundle > "$BACKUP_DIR/bundle-pre-destroy.yaml" +juju status --format=yaml > "$BACKUP_DIR/juju-status-pre-destroy.yaml" +juju models --format=yaml > "$BACKUP_DIR/juju-models-pre-destroy.yaml" +ls -la "$BACKUP_DIR" +``` -## TODO +This is reference material for diff-checking against the new Caracal bundle +later. Not used for restore. -- [ ] `juju destroy-model openstack --destroy-storage --no-prompt` -- [ ] Verify storage cleanup (no orphaned LXD storage pools, no orphaned - volumes on Ceph) -- [ ] Verify MAAS-side machine state (machines back to Ready, not Deployed) -- [ ] Clean up any stale Juju agent state on KVM hosts if needed +**Phase B — Force-destroy the Juju model (~1-2 min to return; ~5-10 min to fully reap in background)** + +```bash +juju destroy-model openstack --force --no-wait --destroy-storage --no-prompt +``` + +Flags: + +- `--force` — ignore charm hooks; don't wait for graceful shutdown +- `--no-wait` — return immediately; reaping continues in the background +- `--destroy-storage` — mark Juju-tracked persistent storage for cleanup +- `--no-prompt` — non-interactive + +The Juju controller on `juju.maas` is untouched. Only the `openstack` model +is destroyed. + +**Phase C — Release MAAS machines (parallel with Phase B; ~5 min)** + +Either path is acceptable. UI is faster for visual confirmation; CLI is +script-documented for Roosevelt. + +**Path 1 — MAAS UI:** Machines → select `openstack0`, `openstack1`, +`openstack2`, `openstack3`, `capi-mgmt` → Take action → Release. + +**Path 2 — MAAS CLI:** + +```bash +# Replace $PROFILE with your MAAS CLI profile name (e.g. "admin") +PROFILE=admin + +# Look up system IDs +maas $PROFILE machines read 2>/dev/null \ + | jq -r '.[] | select(.hostname | test("^(openstack[0-3]|capi-mgmt)$")) | "\(.hostname) \(.system_id) \(.status_name)"' + +# Release each by system_id +for SID in ; do + maas $PROFILE machine release "$SID" comment="Caracal rebuild teardown" +done +``` + +LXD VMs managed by MAAS are destroyed on release; the VMs go away and the +machine entries return to Ready state. + +**Phase D — Verification (~1 min)** + +```bash +# Juju side +juju models +# Expect: openstack model not listed + +# MAAS side — all 5 hostnames must report Ready +maas $PROFILE machines read 2>/dev/null \ + | jq -r '.[] | select(.hostname | test("^(openstack[0-3]|capi-mgmt)$")) | "\(.hostname) \(.status_name)"' +# Expect five lines, each ending in "Ready" +``` + +**If the Juju model is still listed as "destroying" after 10 minutes:** + +```bash +# Force-clean any orphan machine entries +juju machines -m openstack --format=yaml 2>/dev/null +# For each lingering machine: +juju remove-machine -m openstack --force +# Then attempt model removal again +juju destroy-model openstack --force --no-wait --no-prompt +``` + +**Exit criteria:** `juju models` does not show `openstack`. All 5 VMs show +`Ready` in MAAS. Proceed to `02-deploy.md`. diff --git a/runbooks/04a-capi-bootstrap-cluster.md b/runbooks/04a-capi-bootstrap-cluster.md new file mode 100644 index 0000000..8183b50 --- /dev/null +++ b/runbooks/04a-capi-bootstrap-cluster.md @@ -0,0 +1,412 @@ +# Runbook 04a — CAPI bootstrap cluster install on capi-mgmt.maas + +**Reference:** D-017 (full rebuild every cycle). Runs after `04-magnum-domain.md` +and before `05-magnum-capi-driver.md`. + +**Goal:** From a MAAS-Ready `capi-mgmt` VM, produce a single-node k3s +running cluster-api, CAPO, canonical-kubernetes providers, cert-manager, +and ORC, with a workload-cluster kubeconfig delivered to the jumphost +for use by the Magnum CAPI driver in runbook 05. + +**Pre-conditions:** + +- OpenStack cloud is up and stable (`02-deploy.md` complete, all units + active/idle) +- Magnum trustee domain is created (`04-magnum-domain.md` complete) +- `capi-mgmt` MAAS machine is in **Ready** state (released after teardown, + not yet deployed) +- Jumphost has `~/admin-openrc` sourced and an authenticated `openstack` + CLI working against the new Caracal cloud +- Vault CA bundle is available on the jumphost at a known path + (issued by the Caracal Vault during `03-vault-init.md`) + +**Network preconditions:** + +- `capi-mgmt` machine should be configured in MAAS with two interfaces: + - `eth0` on the metal fabric (DHCP from MAAS) — used for k3s API bind + - `eth1` on the provider fabric (static IP, no DHCP) — used for + workload-cluster FIP reach. This IP must NOT fall inside the Neutron + FIP allocation pool on the ext_net subnet. +- Verify the eth1 IP is outside the FIP pool before deploy: + ```bash + openstack subnet show -c allocation_pools -c gateway_ip + ``` + +## Step 1 — Deploy Ubuntu 24.04 to capi-mgmt via MAAS + +Use MAAS UI: Machines → capi-mgmt → Take action → Deploy → Ubuntu 24.04 +LTS (Noble) → Deploy machine. Wait for Deployed status (~10 min). + +Verify SSH reachability once Deployed (note: SSH user is `ubuntu`, not +`jessea123`; MAAS cloud-init pattern): + +```bash +ssh ubuntu@ 'hostname; uname -a; ip -br a' +``` + +Verify both interfaces show their expected IPs. + +## Step 2 — Install Vault CA on the bootstrap host + +The bootstrap host must trust the Caracal Vault root CA so that `openstack` +CLI calls and CAPO authentication to Keystone succeed over HTTPS. + +```bash +# From jumphost — replace with the deployed capi-mgmt IP +scp /vault-ca.crt ubuntu@:/tmp/vault-ca.crt + +ssh ubuntu@ << 'REMOTE' +sudo install -m 0644 /tmp/vault-ca.crt /usr/local/share/ca-certificates/vault-ca.crt +sudo update-ca-certificates +# Verify Keystone reachable with TLS +curl --cacert /etc/ssl/certs/ca-certificates.crt https://:5000/v3 -s -o /dev/null -w "%{http_code}\n" +# Expect: 200 +REMOTE +``` + +## Step 3 — Install k3s + +k3s defaults to binding 0.0.0.0:6443. Bind to the metal-network IP only to +keep the management API off the provider network. The TLS-SAN flags must +include both the IP and the FQDN. k3s does NOT auto-add 127.0.0.1 to the +SAN list; if 127.0.0.1 needs to be in the kubeconfig, add it explicitly as +a `--tls-san`. We do not — we rewrite the kubeconfig server URL instead. + +```bash +ssh ubuntu@ 'bash -s' << 'REMOTE' +set -euo pipefail +BIND_ADDR=$(ip -4 -br a show eth0 | awk '{print $3}' | cut -d/ -f1) +echo "bind addr: $BIND_ADDR" + +if systemctl is-active --quiet k3s; then + echo "[skip] k3s already running" +else + curl -sfL https://get.k3s.io | \ + INSTALL_K3S_EXEC="server \ + --bind-address=${BIND_ADDR} \ + --advertise-address=${BIND_ADDR} \ + --node-ip=${BIND_ADDR} \ + --tls-san=${BIND_ADDR} \ + --tls-san=capi-mgmt.maas \ + --write-kubeconfig-mode=0644 \ + --disable=traefik" \ + sh - +fi + +# Wait for Ready +for i in $(seq 1 30); do + if sudo k3s kubectl get nodes 2>/dev/null | awk 'NR>1 && $2=="Ready"{n++} END{exit n<1}'; then + echo "[ok] node Ready after ${i} polls" + break + fi + sleep 2 +done + +# Copy and rewrite kubeconfig +sudo install -o ubuntu -g ubuntu -m 0600 /etc/rancher/k3s/k3s.yaml /home/ubuntu/.kube-bootstrap.yaml +sed -i "s|server: https://127\\.0\\.0\\.1:6443|server: https://${BIND_ADDR}:6443|" /home/ubuntu/.kube-bootstrap.yaml +grep '^ server:' /home/ubuntu/.kube-bootstrap.yaml + +KUBECONFIG=/home/ubuntu/.kube-bootstrap.yaml kubectl get nodes +REMOTE +``` + +## Step 4 — Install helm and clusterctl + +kubectl is provided by k3s as a symlink; do not re-install. + +```bash +ssh ubuntu@ 'bash -s' << 'REMOTE' +set -euo pipefail + +# helm +if ! command -v helm >/dev/null 2>&1; then + curl -fL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash +fi +helm version --short + +# clusterctl — fetch latest from GitHub API, fall back to a pinned version if needed +if ! command -v clusterctl >/dev/null 2>&1; then + CLUSTERCTL_VER=$(curl -fsSL --max-time 15 \ + https://api.github.com/repos/kubernetes-sigs/cluster-api/releases/latest \ + | python3 -c 'import json,sys; print(json.load(sys.stdin)["tag_name"])') + curl -fLo /tmp/clusterctl --max-time 60 \ + "https://github.com/kubernetes-sigs/cluster-api/releases/download/${CLUSTERCTL_VER}/clusterctl-linux-amd64" + sudo install -o root -g root -m 0755 /tmp/clusterctl /usr/local/bin/clusterctl + rm /tmp/clusterctl +fi +clusterctl version +REMOTE +``` + +## Step 5 — clusterctl init with canonical-kubernetes providers + +```bash +ssh ubuntu@ 'bash -s' << 'REMOTE' +set -euo pipefail + +mkdir -p ~/.cluster-api +cat > ~/.cluster-api/clusterctl.yaml << 'CONFIG' +providers: + - name: "canonical-kubernetes" + url: "https://github.com/canonical/cluster-api-k8s/releases/latest/download/bootstrap-components.yaml" + type: "BootstrapProvider" + - name: "canonical-kubernetes" + url: "https://github.com/canonical/cluster-api-k8s/releases/latest/download/control-plane-components.yaml" + type: "ControlPlaneProvider" +CONFIG + +export KUBECONFIG=/home/ubuntu/.kube-bootstrap.yaml + +if kubectl get namespace capi-system >/dev/null 2>&1; then + echo "[skip] CAPI already initialized" +else + clusterctl init \ + --infrastructure openstack \ + --bootstrap canonical-kubernetes \ + --control-plane canonical-kubernetes +fi + +# Wait for all controller deployments +for ns in cert-manager capi-system cabpck-system cacpck-system capo-system; do + echo "[wait] ${ns}" + kubectl wait --for=condition=Available deployment --all --namespace "${ns}" --timeout=5m +done + +clusterctl version +kubectl get pods -A +REMOTE +``` + +Expected namespaces (note the abbreviated canonical-kubernetes names): + +- `cert-manager` +- `capi-system` — cluster-api core +- `capo-system` — CAPI provider for OpenStack +- `cabpck-system` — CAPI Bootstrap Provider Canonical Kubernetes +- `cacpck-system` — CAPI Control-Plane Provider Canonical Kubernetes + +## Step 6 — Install ORC (OpenStack Resource Controller) + +Required by CAPO for managing OpenStack resources as Kubernetes objects. +Verify the latest release URL before applying. + +```bash +ssh ubuntu@ 'bash -s' << 'REMOTE' +set -euo pipefail +export KUBECONFIG=/home/ubuntu/.kube-bootstrap.yaml + +ORC_URL="https://github.com/k-orc/openstack-resource-controller/releases/latest/download/install.yaml" +kubectl apply -f "$ORC_URL" + +# Wait for ORC controller +sleep 5 +for ns in $(kubectl get ns -o name | grep -E '^namespace/(orc|openstack-resource-controller)' | sed 's|namespace/||'); do + echo "[wait] ${ns}" + kubectl wait --for=condition=Available deployment --all --namespace "${ns}" --timeout=5m +done +REMOTE +``` + +## Step 7 — Cloud-side preparation (run from jumphost) + +Inventory existing images and flavors before creating. Lesson from prior +cycles: do not blindly create `ubuntu-24.04-capi` when `noble-amd64` is +already present and suitable. + +```bash +source ~/admin-openrc +openstack image list | grep -i noble +openstack flavor list +``` + +Create the supporting cloud-side resources for CAPO: + +```bash +# Project +openstack project create --domain admin_domain capi-mgmt \ + --description "CAPI management cluster workloads" + +# User +openstack user create --domain admin_domain --project capi-mgmt \ + --project-domain admin_domain --password-prompt capo + +# Roles +openstack role add --project capi-mgmt --project-domain admin_domain \ + --user capo --user-domain admin_domain member +openstack role add --project capi-mgmt --project-domain admin_domain \ + --user capo --user-domain admin_domain load-balancer_member + +# Switch to capo +unset $(env | awk -F= '/^OS_/{print $1}') +export OS_AUTH_URL= +export OS_IDENTITY_API_VERSION=3 +export OS_USERNAME=capo +export OS_USER_DOMAIN_NAME=admin_domain +export OS_PROJECT_NAME=capi-mgmt +export OS_PROJECT_DOMAIN_NAME=admin_domain +export OS_PASSWORD= +export OS_CACERT= + +# App credential (record id and secret immediately — secret only shown at creation) +openstack application credential create capo-app-cred \ + --description "CAPO authentication" \ + -f yaml > ~/capi-mgmt/capo-app-cred.yaml +chmod 0600 ~/capi-mgmt/capo-app-cred.yaml + +# Nova keypair — generate on capi-mgmt and upload public key +ssh ubuntu@ 'ssh-keygen -t ed25519 -N "" -f ~/.ssh/capi-mgmt-key' +ssh ubuntu@ 'cat ~/.ssh/capi-mgmt-key.pub' > /tmp/capi-mgmt-key.pub +openstack keypair create --public-key /tmp/capi-mgmt-key.pub capi-mgmt-key +# Also pull the private key back to jumphost for post-rebuild access +scp -p ubuntu@:~/.ssh/capi-mgmt-key ~/capi-mgmt/capi-mgmt-key +chmod 0600 ~/capi-mgmt/capi-mgmt-key +``` + +## Step 8 — Compose clouds.yaml and cloud.conf + +Use `v3applicationcredential` auth — cleaner than user/password. + +```bash +# Read app credential +APP_CRED_ID=$(yq -r '.id' ~/capi-mgmt/capo-app-cred.yaml) +APP_CRED_SECRET=$(yq -r '.secret' ~/capi-mgmt/capo-app-cred.yaml) + +# Compose clouds.yaml for capi-mgmt +cat > /tmp/clouds.yaml << EOC +clouds: + openstack: + auth_type: v3applicationcredential + auth: + auth_url: + application_credential_id: ${APP_CRED_ID} + application_credential_secret: ${APP_CRED_SECRET} + region_name: RegionOne + cacert: /usr/local/share/ca-certificates/vault-ca.crt + interface: public + identity_api_version: 3 +EOC + +scp /tmp/clouds.yaml ubuntu@:/home/ubuntu/clouds.yaml +ssh ubuntu@ 'chmod 0600 ~/clouds.yaml' + +# cloud.conf for OCCM — use tls-insecure=true for v1 testcloud +# (v2: ship Vault CA via CK8sConfig files field instead) +cat > /tmp/cloud.conf << EOC +[Global] +auth-url= +application-credential-id=${APP_CRED_ID} +application-credential-secret=${APP_CRED_SECRET} +region=RegionOne +tls-insecure=true + +[LoadBalancer] +floating-network-id= +EOC +``` + +## Step 9 — Render and apply the cluster manifest + +The canonical-kubernetes cluster template takes 18 substitution variables. +Capture them in a `cluster-env` file, then use `envsubst` to render. The +template is fetched from `canonical/cluster-api-k8s`. + +Variables (verify exact names against the template at apply time): + +``` +CLUSTER_NAME=capi-mgmt-cluster +NAMESPACE=default +KUBERNETES_VERSION=v1.32.2 +CONTROL_PLANE_MACHINE_COUNT=1 +WORKER_MACHINE_COUNT=0 +OPENSTACK_CONTROL_PLANE_MACHINE_FLAVOR=capi-mgmt-node +OPENSTACK_NODE_MACHINE_FLAVOR=capi-mgmt-node +OPENSTACK_DNS_NAMESERVERS= +OPENSTACK_EXTERNAL_NETWORK_ID= +OPENSTACK_FAILURE_DOMAIN=nova +OPENSTACK_IMAGE_NAME=noble-amd64 +OPENSTACK_SSH_KEY_NAME=capi-mgmt-key +OPENSTACK_CLOUD_YAML_B64=$(base64 -w0 /tmp/clouds.yaml) +OPENSTACK_CLOUD_CONFIG_B64=$(base64 -w0 /tmp/cloud.conf) +OPENSTACK_CLOUD_CACERT_B64=$(base64 -w0 ) +OPENSTACK_CLOUD=openstack +OPENSTACK_NODE_CIDR=10.6.0.0/24 +KUBE_CONTROL_PLANE_ENDPOINT_PORT=6443 +``` + +Render and apply: + +```bash +ssh ubuntu@ 'bash -s' << 'REMOTE' +set -euo pipefail +export KUBECONFIG=/home/ubuntu/.kube-bootstrap.yaml + +curl -fLo /tmp/cluster-template.yaml \ + https://github.com/canonical/cluster-api-k8s/releases/latest/download/cluster-template.yaml + +# Source env vars (operator fills in /tmp/cluster-env) +# shellcheck disable=SC1091 +source /tmp/cluster-env + +envsubst < /tmp/cluster-template.yaml > /tmp/cluster-rendered.yaml +kubectl apply -f /tmp/cluster-rendered.yaml +REMOTE +``` + +## Step 10 — Poll for cluster Available + +```bash +ssh ubuntu@ 'bash -s' << 'REMOTE' +set -euo pipefail +export KUBECONFIG=/home/ubuntu/.kube-bootstrap.yaml + +START=$(date +%s) +DEADLINE=$((START + 15*60)) + +while [[ $(date +%s) -lt $DEADLINE ]]; do + PHASE=$(kubectl get cluster capi-mgmt-cluster -o jsonpath='{.status.phase}' 2>/dev/null || echo "?") + AVAILABLE=$(kubectl get cluster capi-mgmt-cluster -o jsonpath='{.status.conditions[?(@.type=="Available")].status}' 2>/dev/null || echo "?") + ELAPSED=$(($(date +%s) - START)) + printf '[%4ds] Phase=%s Available=%s\n' "$ELAPSED" "$PHASE" "$AVAILABLE" + [[ "$AVAILABLE" == "True" ]] && break + sleep 15 +done + +clusterctl describe cluster capi-mgmt-cluster --show-conditions all +REMOTE +``` + +## Step 11 — Export workload kubeconfig to jumphost + +```bash +ssh ubuntu@ 'bash -s' << 'REMOTE' +set -euo pipefail +export KUBECONFIG=/home/ubuntu/.kube-bootstrap.yaml +mkdir -p ~/magnum-capi +clusterctl get kubeconfig capi-mgmt-cluster > ~/magnum-capi/capi-mgmt-cluster.kubeconfig +chmod 0600 ~/magnum-capi/capi-mgmt-cluster.kubeconfig +KUBECONFIG=~/magnum-capi/capi-mgmt-cluster.kubeconfig kubectl get nodes +REMOTE + +# Copy to jumphost for runbook 05 +scp -p ubuntu@:~/magnum-capi/capi-mgmt-cluster.kubeconfig ~/magnum-capi/capi-mgmt-cluster.kubeconfig +chmod 0600 ~/magnum-capi/capi-mgmt-cluster.kubeconfig +``` + +## Exit criteria + +- `capi-mgmt.maas` is Deployed in MAAS with k3s + CAPI controllers + ORC running +- `capi-mgmt-cluster` workload cluster is Available +- Workload kubeconfig exists at `~/magnum-capi/capi-mgmt-cluster.kubeconfig` + on the jumphost +- Proceed to `05-magnum-capi-driver.md` + +## Recurring pitfalls (apply to execution) + +- `juju ssh` HANGS when stdout is redirected — use `juju exec --unit X -- 'cmd'` +- MAAS-deployed Ubuntu uses `ubuntu` user, not `jessea123` +- k3s `--bind-address=X` doesn't bind 127.0.0.1 — kubeconfig server URL must be sed-rewritten +- Snap-confined openstack CLI cannot read `/tmp` — paths under `$HOME` only +- `openstack -f value -c X -c Y` outputs in alphabetical column order — use single-column queries +- GitHub API rate limit is 60 unauthenticated requests/hour — cache results, don't refetch on every run +- `.maas` DNS may not resolve from jumphost — use IPs directly diff --git a/runbooks/05-magnum-capi-driver.md b/runbooks/05-magnum-capi-driver.md index cbfe158..ce24090 100644 --- a/runbooks/05-magnum-capi-driver.md +++ b/runbooks/05-magnum-capi-driver.md @@ -1,35 +1,198 @@ # Runbook 05 — Magnum CAPI Helm Driver Graft -**STATUS: PLACEHOLDER** — drafted post-deploy. Per D-007 Layer B. +**Reference:** D-007 Layer B (rescoped per D-017). Runs after `04a-capi-bootstrap-cluster.md`. -## Purpose +**Purpose:** Install the `stackhpc/magnum-capi-helm` driver into the Magnum +charm's Python environment, configure Magnum to load and use it, and verify +end-to-end cluster creation via the driver against the bootstrap k3s +management cluster on `capi-mgmt.maas`. -Install the stackhpc/magnum-capi-helm driver into the magnum charm venv, -configure Magnum to use it, and verify cluster-template creation succeeds. +**Prerequisites:** -## Prerequisites +- Runbook 04 complete (Magnum trustee domain created) +- Runbook 04a complete (capi-mgmt bootstrap k3s + CAPI controllers + ORC + running; workload cluster Available; kubeconfig at + `~/magnum-capi/capi-mgmt-cluster.kubeconfig` on jumphost) +- Authenticated Juju session active -- Runbook 04 complete (magnum domain setup done) -- capi-mgmt.maas k3s cluster healthy (CAPI/CAPO/cert-manager/ORC pods Running) -- Per D-007 Layer B, kubeconfig from capi-mgmt.maas accessible +**Key constraint:** Charm-magnum's systemd units invoke +`/etc/init.d/magnum-{api,conductor} systemd-start` (SysV-wrapped). Drop-in +config dirs are NOT consumed by the init.d script as shipped. Phase 4 graft +must REPLACE the systemd ExecStart entirely with a wrapper that adds +`--config-dir /etc/magnum/magnum.conf.d/`. This pattern was validated on +Bobcat and is expected to persist on Caracal — verify with `juju exec` at +the start of execution. -## TODO +## Step 1 — Investigation block (D-017 rehearsal) -- [ ] `juju ssh magnum/leader` and `pip install --break-system-packages \ - "git+https://github.com/stackhpc/magnum-capi-helm@v0.13.0"` - into the charm venv -- [ ] Place `/etc/magnum/kubeconfig` pointing at capi-mgmt.maas bootstrap k3s -- [ ] Systemd override for magnum services to load `--config-dir /etc/magnum/magnum.conf.d/` -- [ ] Create `/etc/magnum/magnum.conf.d/99-capi.conf`: - ``` - [DEFAULT] - enabled_drivers = k8s_capi_helm_v1 +Before any grafting, inspect the live charm state. The init.d/systemd +wrapping shape may have shifted between Bobcat and Caracal: - [capi_helm] - kubeconfig_file = /etc/magnum/kubeconfig - ``` -- [ ] Restart magnum-api and magnum-conductor -- [ ] Verify driver loaded: `openstack coe cluster template list` - should show capi_helm_v1 driver option available -- [ ] Smoke test: create a test cluster template + 1-node cluster; - verify it reaches CREATE_COMPLETE +```bash +juju exec --unit magnum/leader -- 'cat /lib/systemd/system/magnum-api.service' +juju exec --unit magnum/leader -- 'cat /lib/systemd/system/magnum-conductor.service' +juju exec --unit magnum/leader -- 'ls /etc/init.d/ | grep magnum' +juju exec --unit magnum/leader -- 'cat /etc/init.d/magnum-api 2>/dev/null | head -40' +juju exec --unit magnum/leader -- 'ls /etc/default/magnum-* 2>/dev/null' +juju exec --unit magnum/leader -- 'python3 -c "import magnum; print(magnum.__file__)"' +``` + +Record results in execution notes. The Python import path tells us where +to pip-install the driver (Bobcat: `/usr/lib/python3/dist-packages/magnum/`). + +## Step 2 — Pre-flight: confirm kubeconfig reachability + +The Magnum charm unit must be able to reach the k3s API on +`capi-mgmt.maas:6443`. The charm runs in an LXD container on the metal +network; reach is expected via direct L2. + +```bash +juju exec --unit magnum/leader -- "curl -sk --max-time 5 https://$(awk '/server:/ {print $2}' ~/magnum-capi/capi-mgmt-cluster.kubeconfig | head -1 | sed 's|https://||')/healthz" +# Expect: "ok" +``` + +## Step 3 — Install the driver into the charm Python environment + +```bash +juju ssh magnum/leader -- "sudo pip install --break-system-packages \ + 'git+https://github.com/stackhpc/magnum-capi-helm@v0.13.0'" + +# Verify +juju exec --unit magnum/leader -- 'python3 -c "import magnum_capi_helm; print(magnum_capi_helm.__file__)"' +``` + +Pin to a specific tag rather than `main` — the driver should not move +under our feet between deploys. Version `v0.13.0` was validated on Bobcat; +verify it remains the chosen tag at Caracal execution time. + +## Step 4 — Deploy the kubeconfig to the charm unit + +```bash +# Copy from jumphost to magnum/leader +juju scp ~/magnum-capi/capi-mgmt-cluster.kubeconfig magnum/leader:/tmp/capi-kubeconfig +juju ssh magnum/leader -- "sudo install -o root -g magnum -m 0640 /tmp/capi-kubeconfig /etc/magnum/kubeconfig && sudo rm /tmp/capi-kubeconfig" +juju ssh magnum/leader -- "ls -la /etc/magnum/kubeconfig" +``` + +## Step 5 — Configure Magnum to use the CAPI Helm driver + +Create the conf.d directory and drop-in: + +```bash +juju ssh magnum/leader -- "sudo mkdir -p /etc/magnum/magnum.conf.d && sudo chown root:magnum /etc/magnum/magnum.conf.d && sudo chmod 0750 /etc/magnum/magnum.conf.d" + +juju ssh magnum/leader -- "sudo tee /etc/magnum/magnum.conf.d/99-capi.conf > /dev/null" << 'EOC' +[DEFAULT] +enabled_drivers = k8s_capi_helm_v1 + +[capi_helm] +kubeconfig_file = /etc/magnum/kubeconfig +EOC + +juju ssh magnum/leader -- "sudo chown root:magnum /etc/magnum/magnum.conf.d/99-capi.conf && sudo chmod 0640 /etc/magnum/magnum.conf.d/99-capi.conf" +``` + +## Step 6 — Install the systemd ExecStart override + +Because the charm's systemd units invoke an init.d wrapper that does NOT +honor `--config-dir`, the override must replace the ExecStart entirely +with a wrapper that invokes the Magnum binaries directly with both the +default config file and our config dir. + +```bash +juju ssh magnum/leader -- "sudo mkdir -p /etc/systemd/system/magnum-api.service.d" +juju ssh magnum/leader -- "sudo tee /etc/systemd/system/magnum-api.service.d/override.conf > /dev/null" << 'EOC' +[Service] +ExecStart= +ExecStart=/usr/bin/magnum-api --config-file /etc/magnum/magnum.conf --config-dir /etc/magnum/magnum.conf.d +EOC + +juju ssh magnum/leader -- "sudo mkdir -p /etc/systemd/system/magnum-conductor.service.d" +juju ssh magnum/leader -- "sudo tee /etc/systemd/system/magnum-conductor.service.d/override.conf > /dev/null" << 'EOC' +[Service] +ExecStart= +ExecStart=/usr/bin/magnum-conductor --config-file /etc/magnum/magnum.conf --config-dir /etc/magnum/magnum.conf.d +EOC + +juju ssh magnum/leader -- "sudo systemctl daemon-reload" +juju ssh magnum/leader -- "sudo systemctl restart magnum-api magnum-conductor" +juju ssh magnum/leader -- "sudo systemctl status magnum-api magnum-conductor --no-pager" +``` + +Verify the override took effect: + +```bash +juju ssh magnum/leader -- "sudo systemctl cat magnum-api | grep ExecStart" +juju ssh magnum/leader -- "ps -ef | grep magnum-api | grep -v grep" +# Expect: /usr/bin/magnum-api with --config-dir flag +``` + +## Step 7 — Verify driver loaded + +```bash +juju ssh magnum/leader -- "sudo tail -100 /var/log/magnum/magnum-conductor.log | grep -i -E 'driver|capi'" +# Expect: log lines mentioning k8s_capi_helm_v1 driver loaded +``` + +## Step 8 — Smoke test + +Create a cluster template and small cluster to validate end-to-end: + +```bash +source ~/admin-openrc + +# Cluster template +openstack coe cluster template create \ + --name k8s-capi-test \ + --image noble-amd64 \ + --keypair capi-mgmt-key \ + --external-network \ + --master-flavor m1.medium \ + --flavor m1.medium \ + --coe kubernetes \ + --network-driver calico \ + --labels driver=k8s_capi_helm_v1,kube_tag=v1.32.2 + +# Cluster +openstack coe cluster create \ + --cluster-template k8s-capi-test \ + --master-count 1 \ + --node-count 1 \ + --keypair capi-mgmt-key \ + k8s-capi-smoke + +# Poll +watch -n 30 'openstack coe cluster show k8s-capi-smoke -c status -c status_reason' +# Expect CREATE_COMPLETE within 15-20 min +``` + +Tear down the smoke cluster after validation: + +```bash +openstack coe cluster delete k8s-capi-smoke +# Wait for DELETE_COMPLETE +openstack coe cluster template delete k8s-capi-test +``` + +## Exit criteria + +- Magnum services running with `--config-dir /etc/magnum/magnum.conf.d` + visible in the live process +- `k8s_capi_helm_v1` driver logged at conductor startup +- Smoke-test cluster reached `CREATE_COMPLETE` and torn down cleanly + +## Idempotency and recovery notes + +- The systemd override survives `charm config-changed` (charm rewrites + `magnum.conf` but doesn't touch the conf.d dir or systemd drop-ins) +- The pip-installed driver may NOT survive a charm `upgrade-charm` — if + the venv gets rebuilt, re-run Step 3 +- The kubeconfig at `/etc/magnum/kubeconfig` is operator-managed; survives + charm hooks but if Magnum is redeployed, restore it + +## Recurring pitfalls + +- `juju ssh` HANGS when stdout is redirected — use `juju exec --unit X -- 'cmd'` +- Python magnum at `/usr/lib/python3/dist-packages/magnum/` needs `--break-system-packages` for PEP 668 +- Heredoc nesting in `juju ssh` is fragile — keep heredocs simple, single level +- Non-ASCII characters in conf.d files cause silent daemon failures — ensure ASCII only