# Ops -- Full-Cloud Restart / Recovery Procedure STATUS: committed to the repo per DOCFIX-075 (previously lived only as an operator-local document + a jumphost-local health script -- unreachable by a Roosevelt team and by any fresh clone). Procedure lineage: authored and validated 2026-05-20 on the prior build; adapted 2026-07-03 to current repo references (appendix-A, cloud-assert, D-043). Stages whose mechanics depend on build state carry [REVALIDATE] markers -- clear them on the first executed restart of the current build and record the as-executed date here. Scope: PLANNED maintenance windows taking all four hosts down, or UNPLANNED recovery from full power/network loss that took MySQL or Vault offline together. For single-host partial failures, use the targeted entries in `runbooks/appendix-A-troubleshooting.md` instead. Conventions: RUN/CHECK/GATE labels per runbooks/README.md. One gated step at a time. Invoke scripts as `bash scripts/.sh` (no exec bits in the repo). --- ## Pre-shutdown (planned only) **RUN -- jumphost** -- baseline capture (reference, not a restore path) ```bash ( { juju switch | grep -q 'openstack' || { echo "FAIL: switch to the openstack model first"; exit 1; } BAK=~/openstack-baseline/$(date -u +%Y%m%d-%H%M%S)-pre-maintenance mkdir -p "$BAK" juju export-bundle --filename "$BAK/bundle.yaml" juju status --format=yaml > "$BAK/juju-status.yaml" for v in $(env | awk -F= '/^OS_/{print $1}'); do unset "$v"; done # shellcheck source=/dev/null source ~/admin-openrc openstack server list --all-projects "$BAK/servers.txt" openstack loadbalancer list "$BAK/loadbalancers.txt" openstack loadbalancer amphora list "$BAK/amphorae.txt" echo "Pre-maintenance snapshot at $BAK" } ) ``` **Guest auto-resume policy (D-043, ADOPTED):** `nova-compute resume-guests-state-on-host-boot=true` is charm config in the bundle. Running guests resume automatically on host boot, honoring per-guest pre-shutdown state -- so a guest you want to stay down for forensics must be `openstack server stop`ped BEFORE the window. NOTE the D-043 caveat: the in-cloud CAPI mgmt VM (capi-mgmt-v2) is a control-plane component riding a tenant VM; it WILL auto-resume with the rest. Its manual-start policy (D-041) now governs deliberate stops only. ## Shutdown sequence 1. (Optional) stop guests you want held down post-maintenance. 2. Power down hosts via MAAS/IPMI/virsh. Order not significant among openstack0-3 (functional peers). ## Power-on / recovery ### Stage 1 -- foundation **GATE:** every juju machine agent reports `started` (typically 2-5 min). ```bash juju machines --format=json | jq -r '.machines | to_entries[] | "\(.key) agent=\(.value."juju-status".current)"' | grep -v 'agent=started' \ || echo "all machines started" ``` ### Stage 2 -- MySQL InnoDB cluster Check FIRST; act only on a confirmed outage. `reboot-cluster-from-complete-outage` is DESTRUCTIVE against an already-reformed cluster (it rewrites topology from the unit it runs on). See appendix-A "mysql-innodb-cluster recovery" for the signature table (D-062 material). **CHECK (read-only) -- jumphost** ```bash juju status mysql-innodb-cluster --format=json | jq -r \ '.applications."mysql-innodb-cluster".units | to_entries[] | "\(.key) \(.value."workload-status".message // "")"' ``` **GATE:** three units `Cluster is ONLINE`, exactly one `Mode: R/W`. If ALL units report a complete outage (and only then): ```bash juju run mysql-innodb-cluster/leader reboot-cluster-from-complete-outage ``` ### Stage 3 -- Vault unseal (sealed-after-restart is BY DESIGN) Vault stores in MySQL and always restarts sealed; manual 3-of-5 unseal is the v1 standard (D-011 amendment). Keys are NOT in version control -- custody per D-069 / docs/security-ledger.md. Missing keys = recovery BLOCKER: escalate, never improvise (re-init wipes the TLS plane). **RUN -- jumphost** (hidden prompts; keys never in argv/history) ```bash ( { for i in 1 2 3; do read -s -p "Unseal key $i: " K; echo juju ssh vault/0 -- "VAULT_ADDR=http://127.0.0.1:8200 vault operator unseal $K" /dev/null unset K done juju ssh vault/0 -- 'VAULT_ADDR=http://127.0.0.1:8200 vault status 2>&1 | grep -E "Sealed|Initialized"' `; if MANY are down, verify `juju config nova-compute resume-guests-state-on-host-boot` (a regressed flag takes effect only on the NEXT boot -- start manually now). ### Stage 6 -- Octavia LBs Expect auto-recovery once amphora VMs resume. Any LB stuck `provisioning_status=ERROR` after ~5 min: `openstack loadbalancer failover ` and poll to ACTIVE (VIP + FIP are preserved). [REVALIDATE: timing budget on current build] ### Stage 7 -- end-to-end behavioral gate ```bash source ~/admin-openrc bash scripts/cloud-assert.sh ``` **GATE:** `CLOUD-ASSERT: PASS`. Any FAIL routes to appendix-A by its message. This replaces the old jumphost-local `post-maintenance-health-check.sh`. --- ## Quick reference (symptom -> fix) | Symptom | Fix | |---|---| | Units in `error` (stuck relation-broken hook) | `juju resolved --no-retry ` | | MySQL will not reform | Stage 2 -- action ONLY on confirmed complete outage | | Vault sealed | Expected; Stage 3 manual unseal | | Chassis `not connected` | Stage 4 TLS sweep | | Hypervisor `down`, juju green | restart nova-compute on the host; recheck in 60s | | LB `provisioning_status=ERROR` | `loadbalancer failover`; appendix-A LB entries | | Horizon login cookie failure | restore `_99_internal_http_cookies.py` (D-044) | ## Open question (carried for Roosevelt) Amphora cert validity across long outages: a resumed amphora keeps pre-shutdown cert material; if the amphora-control-plane CA rotated during the window it will fail to health-manager and be failed-over. Theoretical under default TTLs for short outages; validate explicitly on bare metal (rotate CA, hard-stop an amphora host, observe resume vs ERROR->failover) before relying on resume.