STATUS: committed to the repo per DOCFIX-075 (previously lived only as an operator-local document + a jumphost-local health script -- unreachable by a Roosevelt team and by any fresh clone). Procedure lineage: authored and validated 2026-05-20 on the prior build; adapted 2026-07-03 to current repo references (appendix-A, cloud-assert, D-043). Stages whose mechanics depend on build state carry [REVALIDATE] markers -- clear them on the first executed restart of the current build and record the as-executed date here.
Scope: PLANNED maintenance windows taking all four hosts down, or UNPLANNED recovery from full power/network loss that took MySQL or Vault offline together. For single-host partial failures, use the targeted entries in runbooks/appendix-A-troubleshooting.md instead.
Conventions: RUN/CHECK/GATE labels per runbooks/README.md. One gated step at a time. Invoke scripts as bash scripts/<name>.sh (no exec bits in the repo).
RUN -- jumphost -- baseline capture (reference, not a restore path)
( {
juju switch | grep -q 'openstack' || { echo "FAIL: switch to the openstack model first"; exit 1; }
BAK=~/openstack-baseline/$(date -u +%Y%m%d-%H%M%S)-pre-maintenance
mkdir -p "$BAK"
juju export-bundle --filename "$BAK/bundle.yaml"
juju status --format=yaml > "$BAK/juju-status.yaml"
for v in $(env | awk -F= '/^OS_/{print $1}'); do unset "$v"; done
# shellcheck source=/dev/null
source ~/admin-openrc
openstack server list --all-projects </dev/null > "$BAK/servers.txt"
openstack loadbalancer list </dev/null > "$BAK/loadbalancers.txt"
openstack loadbalancer amphora list </dev/null > "$BAK/amphorae.txt"
echo "Pre-maintenance snapshot at $BAK"
} )
Guest auto-resume policy (D-043, ADOPTED): nova-compute resume-guests-state-on-host-boot=true is charm config in the bundle. Running guests resume automatically on host boot, honoring per-guest pre-shutdown state -- so a guest you want to stay down for forensics must be openstack server stopped BEFORE the window. NOTE the D-043 caveat: the in-cloud CAPI mgmt VM (capi-mgmt-v2) is a control-plane component riding a tenant VM; it WILL auto-resume with the rest. Its manual-start policy (D-041) now governs deliberate stops only.
GATE: every juju machine agent reports started (typically 2-5 min).
juju machines --format=json | jq -r '.machines | to_entries[] | "\(.key) agent=\(.value."juju-status".current)"' | grep -v 'agent=started' \ || echo "all machines started"
Check FIRST; act only on a confirmed outage. reboot-cluster-from-complete-outage is DESTRUCTIVE against an already-reformed cluster (it rewrites topology from the unit it runs on). See appendix-A "mysql-innodb-cluster recovery" for the signature table (D-062 material).
CHECK (read-only) -- jumphost
juju status mysql-innodb-cluster --format=json | jq -r \ '.applications."mysql-innodb-cluster".units | to_entries[] | "\(.key) \(.value."workload-status".message // "")"'
GATE: three units Cluster is ONLINE, exactly one Mode: R/W. If ALL units report a complete outage (and only then):
juju run mysql-innodb-cluster/leader reboot-cluster-from-complete-outage
Vault stores in MySQL and always restarts sealed; manual 3-of-5 unseal is the v1 standard (D-011 amendment). Keys are NOT in version control -- custody per D-069 / docs/security-ledger.md. Missing keys = recovery BLOCKER: escalate, never improvise (re-init wipes the TLS plane).
RUN -- jumphost (hidden prompts; keys never in argv/history)
( {
for i in 1 2 3; do
read -s -p "Unseal key $i: " K; echo
juju ssh vault/0 -- "VAULT_ADDR=http://127.0.0.1:8200 vault operator unseal $K" </dev/null >/dev/null
unset K
done
juju ssh vault/0 -- 'VAULT_ADDR=http://127.0.0.1:8200 vault status 2>&1 | grep -E "Sealed|Initialized"' </dev/null
} )
GATE: Sealed=false.
CHECK: connection-status on every chassis (cloud-assert A4 runs this). If any is not connected, run the post-vault TLS restart sweep (appendix-A; restart ovn-controller / ovs-vswitchd / neutron-server on affected units), then re-probe. [REVALIDATE: sweep unit list against current topology]
With D-043 in force, pre-shutdown-running guests should be ACTIVE.
openstack server list --all-projects -c Name -c Status </dev/null
Unexpectedly SHUTOFF guests: openstack server start <name>; if MANY are down, verify juju config nova-compute resume-guests-state-on-host-boot (a regressed flag takes effect only on the NEXT boot -- start manually now).
Expect auto-recovery once amphora VMs resume. Any LB stuck provisioning_status=ERROR after ~5 min: `openstack loadbalancer failover
` and poll to ACTIVE (VIP + FIP are preserved). [REVALIDATE: timing budget on current build]
source ~/admin-openrc bash scripts/cloud-assert.sh
GATE: CLOUD-ASSERT: PASS. Any FAIL routes to appendix-A by its message. This replaces the old jumphost-local post-maintenance-health-check.sh.
| Symptom | Fix |
|---|---|
Units in error (stuck relation-broken hook) |
juju resolved --no-retry <unit> |
| MySQL will not reform | Stage 2 -- action ONLY on confirmed complete outage |
| Vault sealed | Expected; Stage 3 manual unseal |
Chassis not connected |
Stage 4 TLS sweep |
Hypervisor down, juju green |
restart nova-compute on the host; recheck in 60s |
LB provisioning_status=ERROR |
loadbalancer failover; appendix-A LB entries |
| Horizon login cookie failure | restore _99_internal_http_cookies.py (D-044) |
Amphora cert validity across long outages: a resumed amphora keeps pre-shutdown cert material; if the amphora-control-plane CA rotated during the window it will fail to health-manager and be failed-over. Theoretical under default TTLs for short outages; validate explicitly on bare metal (rotate CA, hard-stop an amphora host, observe resume vs ERROR->failover) before relying on resume.