Newer
Older
openstack-caracal-ipv4 / runbooks / ops-restart-procedure.md
@JANeumatrix JANeumatrix 12 hours ago 6 KB Patches

Ops -- Full-Cloud Restart / Recovery Procedure

STATUS: committed to the repo per DOCFIX-075 (previously lived only as an operator-local document + a jumphost-local health script -- unreachable by a Roosevelt team and by any fresh clone). Procedure lineage: authored and validated 2026-05-20 on the prior build; adapted 2026-07-03 to current repo references (appendix-A, cloud-assert, D-043). Stages whose mechanics depend on build state carry [REVALIDATE] markers -- clear them on the first executed restart of the current build and record the as-executed date here.

Scope: PLANNED maintenance windows taking all four hosts down, or UNPLANNED recovery from full power/network loss that took MySQL or Vault offline together. For single-host partial failures, use the targeted entries in runbooks/appendix-A-troubleshooting.md instead.

Conventions: RUN/CHECK/GATE labels per runbooks/README.md. One gated step at a time. Invoke scripts as bash scripts/<name>.sh (no exec bits in the repo).


Pre-shutdown (planned only)

RUN -- jumphost -- baseline capture (reference, not a restore path)

( {
  juju switch | grep -q 'openstack' || { echo "FAIL: switch to the openstack model first"; exit 1; }
  BAK=~/openstack-baseline/$(date -u +%Y%m%d-%H%M%S)-pre-maintenance
  mkdir -p "$BAK"
  juju export-bundle --filename "$BAK/bundle.yaml"
  juju status --format=yaml > "$BAK/juju-status.yaml"
  for v in $(env | awk -F= '/^OS_/{print $1}'); do unset "$v"; done
  # shellcheck source=/dev/null
  source ~/admin-openrc
  openstack server list --all-projects     </dev/null > "$BAK/servers.txt"
  openstack loadbalancer list              </dev/null > "$BAK/loadbalancers.txt"
  openstack loadbalancer amphora list      </dev/null > "$BAK/amphorae.txt"
  echo "Pre-maintenance snapshot at $BAK"
} )

Guest auto-resume policy (D-043, ADOPTED): nova-compute resume-guests-state-on-host-boot=true is charm config in the bundle. Running guests resume automatically on host boot, honoring per-guest pre-shutdown state -- so a guest you want to stay down for forensics must be openstack server stopped BEFORE the window. NOTE the D-043 caveat: the in-cloud CAPI mgmt VM (capi-mgmt-v2) is a control-plane component riding a tenant VM; it WILL auto-resume with the rest. Its manual-start policy (D-041) now governs deliberate stops only.

Shutdown sequence

  1. (Optional) stop guests you want held down post-maintenance.
  2. Power down hosts via MAAS/IPMI/virsh. Order not significant among openstack0-3 (functional peers).

Power-on / recovery

Stage 1 -- foundation

GATE: every juju machine agent reports started (typically 2-5 min).

juju machines --format=json | jq -r '.machines | to_entries[]
  | "\(.key)  agent=\(.value."juju-status".current)"' | grep -v 'agent=started' \
  || echo "all machines started"

Stage 2 -- MySQL InnoDB cluster

Check FIRST; act only on a confirmed outage. reboot-cluster-from-complete-outage is DESTRUCTIVE against an already-reformed cluster (it rewrites topology from the unit it runs on). See appendix-A "mysql-innodb-cluster recovery" for the signature table (D-062 material).

CHECK (read-only) -- jumphost

juju status mysql-innodb-cluster --format=json | jq -r \
  '.applications."mysql-innodb-cluster".units | to_entries[]
   | "\(.key)  \(.value."workload-status".message // "")"'

GATE: three units Cluster is ONLINE, exactly one Mode: R/W. If ALL units report a complete outage (and only then):

juju run mysql-innodb-cluster/leader reboot-cluster-from-complete-outage

Stage 3 -- Vault unseal (sealed-after-restart is BY DESIGN)

Vault stores in MySQL and always restarts sealed; manual 3-of-5 unseal is the v1 standard (D-011 amendment). Keys are NOT in version control -- custody per D-069 / docs/security-ledger.md. Missing keys = recovery BLOCKER: escalate, never improvise (re-init wipes the TLS plane).

RUN -- jumphost (hidden prompts; keys never in argv/history)

( {
  for i in 1 2 3; do
    read -s -p "Unseal key $i: " K; echo
    juju ssh vault/0 -- "VAULT_ADDR=http://127.0.0.1:8200 vault operator unseal $K" </dev/null >/dev/null
    unset K
  done
  juju ssh vault/0 -- 'VAULT_ADDR=http://127.0.0.1:8200 vault status 2>&1 | grep -E "Sealed|Initialized"' </dev/null
} )

GATE: Sealed=false.

Stage 4 -- OVN chassis

CHECK: connection-status on every chassis (cloud-assert A4 runs this). If any is not connected, run the post-vault TLS restart sweep (appendix-A; restart ovn-controller / ovs-vswitchd / neutron-server on affected units), then re-probe. [REVALIDATE: sweep unit list against current topology]

Stage 5 -- guests

With D-043 in force, pre-shutdown-running guests should be ACTIVE.

openstack server list --all-projects -c Name -c Status </dev/null

Unexpectedly SHUTOFF guests: openstack server start <name>; if MANY are down, verify juju config nova-compute resume-guests-state-on-host-boot (a regressed flag takes effect only on the NEXT boot -- start manually now).

Stage 6 -- Octavia LBs

Expect auto-recovery once amphora VMs resume. Any LB stuck provisioning_status=ERROR after ~5 min: `openstack loadbalancer failover

` and poll to ACTIVE (VIP + FIP are preserved). [REVALIDATE: timing budget on current build]

Stage 7 -- end-to-end behavioral gate

source ~/admin-openrc
bash scripts/cloud-assert.sh

GATE: CLOUD-ASSERT: PASS. Any FAIL routes to appendix-A by its message. This replaces the old jumphost-local post-maintenance-health-check.sh.


Quick reference (symptom -> fix)

Symptom Fix
Units in error (stuck relation-broken hook) juju resolved --no-retry <unit>
MySQL will not reform Stage 2 -- action ONLY on confirmed complete outage
Vault sealed Expected; Stage 3 manual unseal
Chassis not connected Stage 4 TLS sweep
Hypervisor down, juju green restart nova-compute on the host; recheck in 60s
LB provisioning_status=ERROR loadbalancer failover; appendix-A LB entries
Horizon login cookie failure restore _99_internal_http_cookies.py (D-044)

Open question (carried for Roosevelt)

Amphora cert validity across long outages: a resumed amphora keeps pre-shutdown cert material; if the amphora-control-plane CA rotated during the window it will fail to health-manager and be failed-over. Theoretical under default TTLs for short outages; validate explicitly on bare metal (rotate CA, hard-stop an amphora host, observe resume vs ERROR->failover) before relying on resume.