Newer
Older
openstack-caracal-ipv4 / runbooks / v1-ops-capi-recovery-procedure-20260610.md

v1 ops -- CAPI/Magnum stack recovery procedure (parking, restart, LB repair)

Status: blocks below are AS-EXECUTED-VERIFIED 2026-06-10 (this is their first formal consolidation). Destination: runbooks/ as an ops companion to the phase-NN deploy runbook, cross-referenced from appendix-A and from OpenStack_Test_Deployment-restart-procedure.md.

Applies when: capi-mgmt-v2 has been stopped (parking, host event, OOM) and the CAPI/Magnum stack must be returned to service. ORDER MATTERS: repair from the bottom up (VM -> k8s -> CAPI controllers -> Octavia LB -> CAPO conditions -> Magnum health). Everything upstream stays red until the layer below is green.

Scope-hygiene preambles are the canonical ones from the 2026-06-09 as-executed log. ENV literals: project capi-mgmt 674171fd28d446d3a37073b6a761e910; mgmt FIP 10.12.7.40; kube-api LB 0f968008-...; regenerate per site on rebuild.


0. Expectations table (read FIRST; saves an hour of false alarms)

Observation Meaning
Magnum UNHEALTHY, reason EMPTY Conductor cannot reach the mgmt API (VM down / booting). Not D-042.
Magnum UNHEALTHY, reason populated, all components 'Ready', infrastructure 'Infrastructure resource not found.' D-042 cosmetic false-negative. Known good.
Horizon Container Infra 504 right after mgmt VM start Conductor stalled mid-reconnect; nginx proxy timeout. Retry after Step 3.
k8sd control.socket deadline / apiserver TLS handshake timeout / mount failures during first ~20 min after boot Cold-start convergence noise on gp.mid (2 vCPU). Judge by load trend + k8s status, not by these.
Cluster Available=False with InfrastructureReady LB-timeout message after a cold start CAPO reconcile raced the storm. Check the LB (Step 4) BEFORE blaming CAPI.
LB provisioning ERROR, operating ONLINE Control-plane op failed; dataplane fine. Needs admin failover (Step 5). No urgency.
openstack server list empty in Horizon/CLI Wrong project scope. CAPI VMs live in capi-mgmt.
juju ssh: "cannot get discharge ... EOF" Stale macaroon + </dev/null ate the password prompt. Use </dev/tty or re-login. NOT a controller outage if juju status works interactively.

1. Parking (deliberate stop) -- forward procedure

------------------------------------------------------------------------
BEGIN runbook block: capi-mgmt parking (pre-maintenance / pre-teardown)
------------------------------------------------------------------------
# capi-mgmt scope
source ~/admin-openrc
unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME OS_PROJECT_DOMAIN_ID
export OS_PROJECT_ID=674171fd28d446d3a37073b6a761e910
unset OS_PROJECT_NAME OS_PROJECT_DOMAIN_NAME OS_TENANT_NAME OS_TENANT_ID
openstack server stop capi-mgmt-v2
# NOTE: Nova ACPI stop does NOT produce a clean guest shutdown on this VM
# (no wtmp shutdown entry; verified 2026-06-10). Accepted for this VM class.
# If filing jumphost secrets, record the destination IN THIS LOG, e.g.:
#   ~/sweep-YYYYMMDD/secrets/{capi-mgmt.kubeconfig, capi-test-1-kc/config}
# EXPECT while parked: Magnum UNHEALTHY with EMPTY reason; Container Infra
# panel may 504; workload cluster keeps running (no runtime dependency).
------------------------------------------------------------------------
END runbook block
------------------------------------------------------------------------

2. Start + boot gate

------------------------------------------------------------------------
BEGIN runbook block: capi-mgmt-v2 start + ssh-port gate (D-041 manual start)
------------------------------------------------------------------------
( {
  source ~/admin-openrc
  unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME OS_PROJECT_DOMAIN_ID
  export OS_PROJECT_ID=674171fd28d446d3a37073b6a761e910
  unset OS_PROJECT_NAME OS_PROJECT_DOMAIN_NAME OS_TENANT_NAME OS_TENANT_ID
  openstack server start capi-mgmt-v2
  for i in $(seq 1 20); do
    ST=$(openstack server show capi-mgmt-v2 -f value -c status 2>/dev/null)
    echo "[$i] status=$ST"
    [ "$ST" = ACTIVE ] && break
    sleep 10
  done
  echo "=== TCP probe loop: FIP :22 (sshd lags ACTIVE by ~3 min) ==="
  for i in $(seq 1 18); do
    timeout 5 bash -c 'exec 3<>/dev/tcp/10.12.7.40/22' 2>/dev/null \
      && { echo "[$i] SSH-PORT-OK"; break; } || echo "[$i] not yet"
    sleep 10
  done
} )
------------------------------------------------------------------------
END runbook block
------------------------------------------------------------------------

GATE: SSH-PORT-OK. Timing (verified, gp.mid): ACTIVE ~20 s; sshd ~3.5 min.

3. k8s-snap readiness (PATIENCE GATE)

------------------------------------------------------------------------
BEGIN runbook block: mgmt k8s readiness poll (cold-start aware)
------------------------------------------------------------------------
( {
  for i in $(seq 1 15); do
    echo "--- [$i] $(date -u +%H:%M:%S) ---"
    ssh -i ~/.ssh/id_ed25519 -o BatchMode=yes -o StrictHostKeyChecking=no \
        -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10 ubuntu@10.12.7.40 \
        'uptime; sudo k8s status 2>&1 </dev/null | head -4'
    sleep 120
  done
} )
------------------------------------------------------------------------
END runbook block
------------------------------------------------------------------------

GATE: cluster status: ready. Verified convergence on gp.mid: ~20-21 min from boot, load peak >100 on 2 vCPUs. Do NOT restart services or re-bootstrap inside this window; the Section-0 noise is expected. (On the phase-06-spec gp.large, expect substantially faster.)

4. CAPI stack + LB verification (read-only; decides Step 5)

------------------------------------------------------------------------
BEGIN runbook block: post-start CAPI + LB verify
------------------------------------------------------------------------
( {
  export KUBECONFIG="$HOME/capi-mgmt.kubeconfig"
  kubectl get nodes -o wide
  kubectl get pods -A | egrep 'capi-|capo-|cert-manager|orc-system|janitor|addon'
  NS=magnum-674171fd28d446d3a37073b6a761e910
  kubectl -n "$NS" get cluster,openstackcluster,machines
} )
# kubeconfig missing? Re-emit (phase-06 Step 6.5, verbatim):
#   ssh -i ~/.ssh/id_ed25519 -o BatchMode=yes -o StrictHostKeyChecking=no \
#       -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10 ubuntu@10.12.7.40 \
#       "sudo k8s config server=https://10.12.7.40:6443 </dev/null" > ~/capi-mgmt.kubeconfig
( {
  source ~/admin-openrc
  unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME OS_PROJECT_DOMAIN_ID
  export OS_PROJECT_ID=674171fd28d446d3a37073b6a761e910
  unset OS_PROJECT_NAME OS_PROJECT_DOMAIN_NAME OS_TENANT_NAME OS_TENANT_ID
  openstack loadbalancer list -f yaml
} )
------------------------------------------------------------------------
END runbook block
------------------------------------------------------------------------

DECISION: controllers Running + Machines Running + every LB provisioning=ACTIVE -> skip to Step 6. Any LB provisioning=ERROR (operating ONLINE is typical) -> Step 5. Cluster Available=False with an LB-timeout message -> the LB is the cause; fix it first, the condition clears itself afterward.

5. LB repair: zombie sweep, headroom, sequential failover

5a. ZOMBIE/ORPHAN SWEEP (admin scope). Confirmed pattern, twice in one day: failed failovers leave amphora servers with no Octavia DB row. Two variants: ERROR server (failed spawn) and ACTIVE heartbeating zombie (health-manager logs "missing from the DB ... An operator must manually delete it" every 10 s).

------------------------------------------------------------------------
BEGIN runbook block: amphora orphan/zombie sweep (admin scope; verify-then-delete)
------------------------------------------------------------------------
( {
  source ~/admin-openrc
  unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME
  echo "=== octavia's amphora inventory (the DB truth) ==="
  openstack loadbalancer amphora list -f yaml
  echo "=== nova's amphora servers (compare; extras are orphans) ==="
  openstack server list --all-projects --long -f yaml \
    | grep -B6 -A4 'amphora-haproxy' | grep -E '^(- |  (ID|Name|Status)):'
} )
# For each server whose amphora-NAME-uuid is ABSENT from the amphora list:
#   1) re-grep the amphora list for the uuid (ABORT if present)
#   2) openstack server delete <SERVER-UUID>   # by UUID; name lookup is project-scoped
# Each deletion frees one amphora slot (charm-octavia: 1024 MB / 1 vCPU / 8 GB).
------------------------------------------------------------------------
END runbook block
------------------------------------------------------------------------

5b. HEADROOM CHECK. Failover transiently needs +1 amphora placement (replacement is built BEFORE the old one is reaped). Scheduler ceiling per host = physical_MB ram_allocation_ratio(1.5) - reserved_host_memory(8192, D-040). Verify at least one host clears Used + 1024 <= ceiling: openstack hypervisor list --long -f yaml | grep -E 'Hostname|Memory MB'. If no host clears: free 1024+ MB first (zombie sweep usually suffices; else power off a disposable VM, e.g. a backend- test instance). DO NOT retry failover against NoValidHost -- each attempt mints another zombie.

5c. FAILOVER, STRICTLY SEQUENTIAL (one slot of headroom = one failover at a time; completion of each reaps its old amphora and re-frees the slot).

------------------------------------------------------------------------
BEGIN runbook block: LB failover + poll (admin scope; v4 Arc D pattern)
------------------------------------------------------------------------
( {
  source ~/admin-openrc
  unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME
  LB=<LB-ID>
  openstack loadbalancer failover "$LB"
  sleep 2
  for i in $(seq 1 60); do
    prov=$(openstack loadbalancer show "$LB" -f value -c provisioning_status 2>/dev/null)
    op=$(  openstack loadbalancer show "$LB" -f value -c operating_status    2>/dev/null)
    printf '%s  prov=%s  op=%s\n' "$(date +%T)" "${prov:-?}" "${op:-?}"
    case "$prov" in
      ACTIVE) echo "failover succeeded"; break ;;
      ERROR)  echo "failover FAILED -- read octavia-worker.log; do NOT retry blind"; break ;;
    esac
    sleep 10
  done
} )
------------------------------------------------------------------------
END runbook block
------------------------------------------------------------------------

Verified timing: ~108 s to ACTIVE; op holds ONLINE; VIP+FIP preserved (VIP port is Octavia-owned). A 10-20 s fast-fail to ERROR = early-flow failure (usually NoValidHost; see 5b). STANDALONE amphora = brief kube-api endpoint blip mid-failover; nodes/pods unaffected.

6. Top-of-stack verification

------------------------------------------------------------------------
BEGIN runbook block: final verify (amphorae, CAPO condition, magnum health)
------------------------------------------------------------------------
( {
  source ~/admin-openrc
  unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME
  openstack loadbalancer amphora list -f yaml          # all ALLOCATED
  export KUBECONFIG="$HOME/capi-mgmt.kubeconfig"
  NS=magnum-674171fd28d446d3a37073b6a761e910
  kubectl -n "$NS" get cluster,openstackcluster        # Available=True (allow ~10 min post-failover for CAPO resync)
  source ~/admin-openrc
  unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME OS_PROJECT_DOMAIN_ID
  export OS_PROJECT_ID=674171fd28d446d3a37073b6a761e910
  unset OS_PROJECT_NAME OS_PROJECT_DOMAIN_NAME OS_TENANT_NAME OS_TENANT_ID
  openstack coe cluster show capi-test-1 -f value -c health_status
  openstack coe cluster show capi-test-1 -f value -c health_status_reason
} )
------------------------------------------------------------------------
END runbook block
------------------------------------------------------------------------

SUCCESS = amphorae ALLOCATED; Cluster Available=True; Magnum reason POPULATED with the D-042 cosmetic signature (or HEALTHY post-D-042-fix). Reload Horizon Container Infra last. Workload check if desired: KUBECONFIG=~/capi-test-1-kc/config kubectl get nodes -o wide.