diff --git a/runbooks/v1-ops-capi-recovery-procedure-20260610.md b/runbooks/v1-ops-capi-recovery-procedure-20260610.md new file mode 100644 index 0000000..01e6886 --- /dev/null +++ b/runbooks/v1-ops-capi-recovery-procedure-20260610.md @@ -0,0 +1,241 @@ +# v1 ops -- CAPI/Magnum stack recovery procedure (parking, restart, LB repair) + +Status: blocks below are AS-EXECUTED-VERIFIED 2026-06-10 (this is their first +formal consolidation). Destination: runbooks/ as an ops companion to the +phase-NN deploy runbook, cross-referenced from appendix-A and from +OpenStack_Test_Deployment-restart-procedure.md. + +Applies when: capi-mgmt-v2 has been stopped (parking, host event, OOM) and the +CAPI/Magnum stack must be returned to service. ORDER MATTERS: repair from the +bottom up (VM -> k8s -> CAPI controllers -> Octavia LB -> CAPO conditions -> +Magnum health). Everything upstream stays red until the layer below is green. + +Scope-hygiene preambles are the canonical ones from the 2026-06-09 as-executed +log. ENV literals: project capi-mgmt 674171fd28d446d3a37073b6a761e910; mgmt FIP +10.12.7.40; kube-api LB 0f968008-...; regenerate per site on rebuild. + +--- + +## 0. Expectations table (read FIRST; saves an hour of false alarms) + +| Observation | Meaning | +|---|---| +| Magnum UNHEALTHY, reason EMPTY | Conductor cannot reach the mgmt API (VM down / booting). Not D-042. | +| Magnum UNHEALTHY, reason populated, all components 'Ready', infrastructure 'Infrastructure resource not found.' | D-042 cosmetic false-negative. Known good. | +| Horizon Container Infra 504 right after mgmt VM start | Conductor stalled mid-reconnect; nginx proxy timeout. Retry after Step 3. | +| k8sd control.socket deadline / apiserver TLS handshake timeout / mount failures during first ~20 min after boot | Cold-start convergence noise on gp.mid (2 vCPU). Judge by load trend + `k8s status`, not by these. | +| Cluster Available=False with InfrastructureReady LB-timeout message after a cold start | CAPO reconcile raced the storm. Check the LB (Step 4) BEFORE blaming CAPI. | +| LB provisioning ERROR, operating ONLINE | Control-plane op failed; dataplane fine. Needs admin failover (Step 5). No urgency. | +| openstack server list empty in Horizon/CLI | Wrong project scope. CAPI VMs live in capi-mgmt. | +| juju ssh: "cannot get discharge ... EOF" | Stale macaroon + `/dev/null) + echo "[$i] status=$ST" + [ "$ST" = ACTIVE ] && break + sleep 10 + done + echo "=== TCP probe loop: FIP :22 (sshd lags ACTIVE by ~3 min) ===" + for i in $(seq 1 18); do + timeout 5 bash -c 'exec 3<>/dev/tcp/10.12.7.40/22' 2>/dev/null \ + && { echo "[$i] SSH-PORT-OK"; break; } || echo "[$i] not yet" + sleep 10 + done +} ) +------------------------------------------------------------------------ +END runbook block +------------------------------------------------------------------------ +``` +GATE: SSH-PORT-OK. Timing (verified, gp.mid): ACTIVE ~20 s; sshd ~3.5 min. + +## 3. k8s-snap readiness (PATIENCE GATE) + +``` +------------------------------------------------------------------------ +BEGIN runbook block: mgmt k8s readiness poll (cold-start aware) +------------------------------------------------------------------------ +( { + for i in $(seq 1 15); do + echo "--- [$i] $(date -u +%H:%M:%S) ---" + ssh -i ~/.ssh/id_ed25519 -o BatchMode=yes -o StrictHostKeyChecking=no \ + -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10 ubuntu@10.12.7.40 \ + 'uptime; sudo k8s status 2>&1 100 on 2 vCPUs. Do NOT restart services or re-bootstrap inside +this window; the Section-0 noise is expected. (On the phase-06-spec gp.large, +expect substantially faster.) + +## 4. CAPI stack + LB verification (read-only; decides Step 5) + +``` +------------------------------------------------------------------------ +BEGIN runbook block: post-start CAPI + LB verify +------------------------------------------------------------------------ +( { + export KUBECONFIG="$HOME/capi-mgmt.kubeconfig" + kubectl get nodes -o wide + kubectl get pods -A | egrep 'capi-|capo-|cert-manager|orc-system|janitor|addon' + NS=magnum-674171fd28d446d3a37073b6a761e910 + kubectl -n "$NS" get cluster,openstackcluster,machines +} ) +# kubeconfig missing? Re-emit (phase-06 Step 6.5, verbatim): +# ssh -i ~/.ssh/id_ed25519 -o BatchMode=yes -o StrictHostKeyChecking=no \ +# -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10 ubuntu@10.12.7.40 \ +# "sudo k8s config server=https://10.12.7.40:6443 ~/capi-mgmt.kubeconfig +( { + source ~/admin-openrc + unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME OS_PROJECT_DOMAIN_ID + export OS_PROJECT_ID=674171fd28d446d3a37073b6a761e910 + unset OS_PROJECT_NAME OS_PROJECT_DOMAIN_NAME OS_TENANT_NAME OS_TENANT_ID + openstack loadbalancer list -f yaml +} ) +------------------------------------------------------------------------ +END runbook block +------------------------------------------------------------------------ +``` +DECISION: controllers Running + Machines Running + every LB provisioning=ACTIVE +-> skip to Step 6. Any LB provisioning=ERROR (operating ONLINE is typical) +-> Step 5. Cluster Available=False with an LB-timeout message -> the LB is the +cause; fix it first, the condition clears itself afterward. + +## 5. LB repair: zombie sweep, headroom, sequential failover + +5a. ZOMBIE/ORPHAN SWEEP (admin scope). Confirmed pattern, twice in one day: +failed failovers leave amphora servers with no Octavia DB row. Two variants: +ERROR server (failed spawn) and ACTIVE heartbeating zombie (health-manager logs +"missing from the DB ... An operator must manually delete it" every 10 s). + +``` +------------------------------------------------------------------------ +BEGIN runbook block: amphora orphan/zombie sweep (admin scope; verify-then-delete) +------------------------------------------------------------------------ +( { + source ~/admin-openrc + unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME + echo "=== octavia's amphora inventory (the DB truth) ===" + openstack loadbalancer amphora list -f yaml + echo "=== nova's amphora servers (compare; extras are orphans) ===" + openstack server list --all-projects --long -f yaml \ + | grep -B6 -A4 'amphora-haproxy' | grep -E '^(- | (ID|Name|Status)):' +} ) +# For each server whose amphora-NAME-uuid is ABSENT from the amphora list: +# 1) re-grep the amphora list for the uuid (ABORT if present) +# 2) openstack server delete # by UUID; name lookup is project-scoped +# Each deletion frees one amphora slot (charm-octavia: 1024 MB / 1 vCPU / 8 GB). +------------------------------------------------------------------------ +END runbook block +------------------------------------------------------------------------ +``` + +5b. HEADROOM CHECK. Failover transiently needs +1 amphora placement (replacement +is built BEFORE the old one is reaped). Scheduler ceiling per host = +physical_MB * ram_allocation_ratio(1.5) - reserved_host_memory(8192, D-040). +Verify at least one host clears Used + 1024 <= ceiling: +`openstack hypervisor list --long -f yaml | grep -E 'Hostname|Memory MB'`. +If no host clears: free 1024+ MB first (zombie sweep usually suffices; else +power off a disposable VM, e.g. a backend-* test instance). DO NOT retry +failover against NoValidHost -- each attempt mints another zombie. + +5c. FAILOVER, STRICTLY SEQUENTIAL (one slot of headroom = one failover at a +time; completion of each reaps its old amphora and re-frees the slot). + +``` +------------------------------------------------------------------------ +BEGIN runbook block: LB failover + poll (admin scope; v4 Arc D pattern) +------------------------------------------------------------------------ +( { + source ~/admin-openrc + unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME + LB= + openstack loadbalancer failover "$LB" + sleep 2 + for i in $(seq 1 60); do + prov=$(openstack loadbalancer show "$LB" -f value -c provisioning_status 2>/dev/null) + op=$( openstack loadbalancer show "$LB" -f value -c operating_status 2>/dev/null) + printf '%s prov=%s op=%s\n' "$(date +%T)" "${prov:-?}" "${op:-?}" + case "$prov" in + ACTIVE) echo "failover succeeded"; break ;; + ERROR) echo "failover FAILED -- read octavia-worker.log; do NOT retry blind"; break ;; + esac + sleep 10 + done +} ) +------------------------------------------------------------------------ +END runbook block +------------------------------------------------------------------------ +``` +Verified timing: ~108 s to ACTIVE; op holds ONLINE; VIP+FIP preserved (VIP port +is Octavia-owned). A 10-20 s fast-fail to ERROR = early-flow failure (usually +NoValidHost; see 5b). STANDALONE amphora = brief kube-api endpoint blip +mid-failover; nodes/pods unaffected. + +## 6. Top-of-stack verification + +``` +------------------------------------------------------------------------ +BEGIN runbook block: final verify (amphorae, CAPO condition, magnum health) +------------------------------------------------------------------------ +( { + source ~/admin-openrc + unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME + openstack loadbalancer amphora list -f yaml # all ALLOCATED + export KUBECONFIG="$HOME/capi-mgmt.kubeconfig" + NS=magnum-674171fd28d446d3a37073b6a761e910 + kubectl -n "$NS" get cluster,openstackcluster # Available=True (allow ~10 min post-failover for CAPO resync) + source ~/admin-openrc + unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME OS_PROJECT_DOMAIN_ID + export OS_PROJECT_ID=674171fd28d446d3a37073b6a761e910 + unset OS_PROJECT_NAME OS_PROJECT_DOMAIN_NAME OS_TENANT_NAME OS_TENANT_ID + openstack coe cluster show capi-test-1 -f value -c health_status + openstack coe cluster show capi-test-1 -f value -c health_status_reason +} ) +------------------------------------------------------------------------ +END runbook block +------------------------------------------------------------------------ +``` +SUCCESS = amphorae ALLOCATED; Cluster Available=True; Magnum reason POPULATED +with the D-042 cosmetic signature (or HEALTHY post-D-042-fix). Reload Horizon +Container Infra last. Workload check if desired: +`KUBECONFIG=~/capi-test-1-kc/config kubectl get nodes -o wide`.