# Ops -- Full-Cloud Restart / Recovery Procedure

STATUS: committed to the repo per DOCFIX-075 (previously lived only as an
operator-local document + a jumphost-local health script -- unreachable by a
Roosevelt team and by any fresh clone). Procedure lineage: authored and
validated 2026-05-20 on the prior build; adapted 2026-07-03 to current repo
references (appendix-A, cloud-assert, D-043). Stages whose mechanics depend on
build state carry [REVALIDATE] markers -- clear them on the first executed
restart of the current build and record the as-executed date here.

Scope: PLANNED maintenance windows taking all four hosts down, or UNPLANNED
recovery from full power/network loss that took MySQL or Vault offline
together. For single-host partial failures, use the targeted entries in
`runbooks/appendix-A-troubleshooting.md` instead.

Conventions: RUN/CHECK/GATE labels per runbooks/README.md. One gated step at a
time. Invoke scripts as `bash scripts/<name>.sh` (no exec bits in the repo).

---

## Pre-shutdown (planned only)

**RUN -- jumphost** -- baseline capture (reference, not a restore path)
```bash
( {
  juju switch | grep -q 'openstack' || { echo "FAIL: switch to the openstack model first"; exit 1; }
  BAK=~/openstack-baseline/$(date -u +%Y%m%d-%H%M%S)-pre-maintenance
  mkdir -p "$BAK"
  juju export-bundle --filename "$BAK/bundle.yaml"
  juju status --format=yaml > "$BAK/juju-status.yaml"
  for v in $(env | awk -F= '/^OS_/{print $1}'); do unset "$v"; done
  # shellcheck source=/dev/null
  source ~/admin-openrc
  openstack server list --all-projects     </dev/null > "$BAK/servers.txt"
  openstack loadbalancer list              </dev/null > "$BAK/loadbalancers.txt"
  openstack loadbalancer amphora list      </dev/null > "$BAK/amphorae.txt"
  echo "Pre-maintenance snapshot at $BAK"
} )
```

**Guest auto-resume policy (D-043, ADOPTED):** `nova-compute
resume-guests-state-on-host-boot=true` is charm config in the bundle. Running
guests resume automatically on host boot, honoring per-guest pre-shutdown
state -- so a guest you want to stay down for forensics must be
`openstack server stop`ped BEFORE the window. NOTE the D-043 caveat: the
in-cloud CAPI mgmt VM (capi-mgmt-v2) is a control-plane component riding a
tenant VM; it WILL auto-resume with the rest. Its manual-start policy (D-041)
now governs deliberate stops only.

## Shutdown sequence

1. (Optional) stop guests you want held down post-maintenance.
2. Power down hosts via MAAS/IPMI/virsh. Order not significant among
   openstack0-3 (functional peers).

## Power-on / recovery

### Stage 1 -- foundation
**GATE:** every juju machine agent reports `started` (typically 2-5 min).
```bash
juju machines --format=json | jq -r '.machines | to_entries[]
  | "\(.key)  agent=\(.value."juju-status".current)"' | grep -v 'agent=started' \
  || echo "all machines started"
```

### Stage 2 -- MySQL InnoDB cluster
Check FIRST; act only on a confirmed outage. `reboot-cluster-from-complete-outage`
is DESTRUCTIVE against an already-reformed cluster (it rewrites topology from
the unit it runs on). See appendix-A "mysql-innodb-cluster recovery" for the
signature table (D-062 material).

**CHECK (read-only) -- jumphost**
```bash
juju status mysql-innodb-cluster --format=json | jq -r \
  '.applications."mysql-innodb-cluster".units | to_entries[]
   | "\(.key)  \(.value."workload-status".message // "")"'
```
**GATE:** three units `Cluster is ONLINE`, exactly one `Mode: R/W`. If ALL
units report a complete outage (and only then):
```bash
juju run mysql-innodb-cluster/leader reboot-cluster-from-complete-outage
```

### Stage 3 -- Vault unseal (sealed-after-restart is BY DESIGN)
Vault stores in MySQL and always restarts sealed; manual 3-of-5 unseal is the
v1 standard (D-011 amendment). Keys are NOT in version control -- custody per
D-069 / docs/security-ledger.md. Missing keys = recovery BLOCKER: escalate,
never improvise (re-init wipes the TLS plane).

**RUN -- jumphost** (hidden prompts; keys never in argv/history)
```bash
( {
  for i in 1 2 3; do
    read -s -p "Unseal key $i: " K; echo
    juju ssh vault/0 -- "VAULT_ADDR=http://127.0.0.1:8200 vault operator unseal $K" </dev/null >/dev/null
    unset K
  done
  juju ssh vault/0 -- 'VAULT_ADDR=http://127.0.0.1:8200 vault status 2>&1 | grep -E "Sealed|Initialized"' </dev/null
} )
```
**GATE:** `Sealed=false`.

### Stage 4 -- OVN chassis
**CHECK:** `connection-status` on every chassis (cloud-assert A4 runs this).
If any is not `connected`, run the post-vault TLS restart sweep (appendix-A;
restart ovn-controller / ovs-vswitchd / neutron-server on affected units),
then re-probe. [REVALIDATE: sweep unit list against current topology]

### Stage 5 -- guests
With D-043 in force, pre-shutdown-running guests should be ACTIVE.
```bash
openstack server list --all-projects -c Name -c Status </dev/null
```
Unexpectedly SHUTOFF guests: `openstack server start <name>`; if MANY are
down, verify `juju config nova-compute resume-guests-state-on-host-boot`
(a regressed flag takes effect only on the NEXT boot -- start manually now).

### Stage 6 -- Octavia LBs
Expect auto-recovery once amphora VMs resume. Any LB stuck
`provisioning_status=ERROR` after ~5 min: `openstack loadbalancer failover
<lb>` and poll to ACTIVE (VIP + FIP are preserved). [REVALIDATE: timing
budget on current build]

### Stage 7 -- end-to-end behavioral gate
```bash
source ~/admin-openrc
bash scripts/cloud-assert.sh
```
**GATE:** `CLOUD-ASSERT: PASS`. Any FAIL routes to appendix-A by its message.
This replaces the old jumphost-local `post-maintenance-health-check.sh`.

---

## Quick reference (symptom -> fix)

| Symptom | Fix |
|---|---|
| Units in `error` (stuck relation-broken hook) | `juju resolved --no-retry <unit>` |
| MySQL will not reform | Stage 2 -- action ONLY on confirmed complete outage |
| Vault sealed | Expected; Stage 3 manual unseal |
| Chassis `not connected` | Stage 4 TLS sweep |
| Hypervisor `down`, juju green | restart nova-compute on the host; recheck in 60s |
| LB `provisioning_status=ERROR` | `loadbalancer failover`; appendix-A LB entries |
| Horizon login cookie failure | restore `_99_internal_http_cookies.py` (D-044) |

## Open question (carried for Roosevelt)
Amphora cert validity across long outages: a resumed amphora keeps pre-shutdown
cert material; if the amphora-control-plane CA rotated during the window it
will fail to health-manager and be failed-over. Theoretical under default TTLs
for short outages; validate explicitly on bare metal (rotate CA, hard-stop an
amphora host, observe resume vs ERROR->failover) before relying on resume.
