# Ops -- Routine Update Procedure (Juju controller, model agents, in-channel charm refreshes) STATUS: authored 2026-07-04 per DOCFIX-086; AS-EXECUTED 2026-07-05 (ops-update-20260705: controller+agents 3.6.24 -> 3.6.25, 17 in-channel charm refreshes; changelog addenda 13-16). [REVALIDATE] markers cleared and live divergences folded back per DOCFIX-088. Policy companion: D-071 (PROPOSED -- update cadence + controller patch policy; amended 2026-07-05: supported controller backup EXISTS on juju 3.6). Scope: a PLANNED maintenance window applying three update layers, in order: (1) Juju controller patch upgrade (single, non-HA controller), (2) model agent upgrade to match, (3) charm refreshes to newer revisions WITHIN their pinned channels (the appendix-B-sanctioned update type). This is explicitly NOT an OpenStack series/track upgrade -- no charm changes channel here, ever. For full-cloud power maintenance use `runbooks/ops-restart-procedure.md`; for incident response use `runbooks/appendix-A-troubleshooting.md`. Conventions: RUN/CHECK/GATE labels per runbooks/README.md. One gated mutation at a time; read-only verification precedes every mutation. Invoke scripts as `bash scripts/.sh` (no exec bits in the repo). Run the whole window inside `bash scripts/run-logged.sh ops-update-` and add the index row to `logs/as-executed-index.md`. --- ## 0. Exclusions (read before anything else) - **VAULT IS OUT OF SCOPE.** Live vault stays on `1.8/stable`. Do not refresh `vault` or `vault-mysql-router` in this procedure. > CAUTION: `bundle.yaml` pins vault `1.16/stable` (D-068 / BUNDLEFIX-007) > while live runs `1.8/stable`. A naive "sync live to the bundle" or a > blanket refresh sweep would attempt a multi-minor major Vault upgrade -- > exactly what D-068 (PROPOSED) says is NOT a casual `juju refresh` (unseal > keys in hand, storage-format compatibility, rehearsal first). Until D-068 > is ruled and rehearsed, vault is untouchable here. - **No channel changes.** D-002 pins channels; this procedure only moves revisions WITHIN a pinned channel. A desired channel change is a D-NNN/BUNDLEFIX proposal, not an update-window action. - Out of scope: MAAS, host OS packages, jumphost snaps (the juju client snap updates itself on `3/stable`), and the CAPI/driver layer (appendix-B B.2/B.3; governed by D-034/D-042 -- their update is a separate procedure). - Apps with no update available at authoring (ceph-*, ovn-*, mysql-*, rabbitmq-server, hacluster, memcached): re-measured at pre-flight. If one NEWLY shows an update, it is LOGGED for the next window, not refreshed in this one (hard rule 1 -- no added scope mid-window). ## 0b. Expectations table (read FIRST; saves false alarms) | Observation | Meaning | |---|---| | `juju status` / API errors for ~1-5 min right after `upgrade-controller` | EXPECTED. Single non-HA controller; jujud restarts. Data plane and workloads unaffected; model MANAGEMENT is blind. Wait, do not react. | | Units cycling `maintenance`/`executing` for minutes after a refresh | Expected settle arc (upgrade-charm hooks). Judge by the settle gate, not by transient states. | | `can-upgrade-to` values differ from this runbook's planning table | EXPECTED. Channels float (appendix-B policy). The live measurement is the worklist; any table here is planning reference only. | | An app's `can-upgrade-to` names a DIFFERENT charm than the app runs | Anomaly (seen once for magnum on 2026-07-04: `ch:amd64/magnum-dashboard-122`). EXCLUDE the app, capture the raw JSON, log the finding. Never refresh across a name mismatch. | | Vault flips to sealed during this window | NOT expected -- nothing here restarts vault. That is an incident: stop, appendix-A. | | An EXCLUDED app's `can-upgrade-to` target changes mid-window | Observed 2026-07-05 (magnum's target moved during the window -- Charmhub republish in flight). Confirms the rule: targets are re-measured per app at refresh time; excluded apps stay excluded until the next window's pre-flight. | ## 1. Pre-flight and baseline ### 1.1 Session bootstrap **RUN -- jumphost** ```bash git -C ~/openstack-caracal-ipv4 pull bash scripts/repo-lint.sh bash scripts/run-logged.sh ops-update-$(date -u +%Y%m%d) ``` **Expect:** lint 0 fail (1 legacy WARN documented); logged subshell open. Add the session row to `logs/as-executed-index.md`. ### 1.2 Measure versions -- never assume **CHECK (read-only) -- jumphost** ```bash juju version juju show-controller --format=json | jq -r 'to_entries[] | "\(.key) agent-version=\(.value.details."agent-version")"' juju status -m openstack --format=json | jq -r ' [.. | objects | select(has("juju-status")) | ."juju-status".version | select(. != null)] | .[]' | sort | uniq -c ``` **Expect:** client at the target patch version; controller and ALL machine, container, and unit agents at ONE uniform current version (agents carry version under `juju-status`, NOT `agent-status` -- verified 2026-07-05; this cloud = 91 agents). Record both values. Any skew among agents = STOP and investigate before adding an upgrade on top. ### 1.3 Measure the refresh worklist (with the charm-name gate) **CHECK (read-only) -- jumphost** ```bash juju status -m openstack --format=json | jq -r ' .applications | to_entries[] | select((.value."can-upgrade-to" // "") != "") | .key as $app | (.value."charm-name") as $name | (.value."can-upgrade-to" | sub("^ch:[^/]*/"; "")) as $t | [$app, $name, (.value."charm-rev"|tostring), ($t | sub("-[0-9]+$"; "")), ($t | capture("-(?[0-9]+)$").r), (if ($t | sub("-[0-9]+$"; "")) == $name then "OK" else "NAME-MISMATCH" end)] | @tsv' | column -t ``` **GATE:** every row `OK`. Any `NAME-MISMATCH` row: capture the app's raw `juju status --format=json`, EXCLUDE it from this window's worklist, and log the finding (appendix-A/DOCFIX material). Record the surviving rows as the window's worklist AND revert table: (app, current-rev, target-rev). Cross-check current revs against appendix-B B.1; any pre-existing divergence is logged (it means the last re-baseline was missed), not corrected here. ### 1.4 Health gate + pre-change BOM **RUN -- jumphost** (writes only `asbuilt//`) ```bash source ~/admin-openrc bash scripts/cloud-assert.sh --capture ``` **GATE:** `CLOUD-ASSERT: PASS`. WARN/HOLD is a no-go: do not update an unhealthy or unverified cloud. Commit the `asbuilt//` BOM as the pre-change baseline before the first mutation. ### 1.5 Quiesce check **CHECK (read-only) -- jumphost** ```bash juju status -m openstack --format=json | jq -r ' .. | objects | select(has("agent-status")) | select(."agent-status".current as $c | ["executing","error","failed"] | index($c)) | ."agent-status".current' | sort | uniq -c openstack coe cluster list -f value -c name -c status /controller`. **CHECK (read-only) -- jumphost** -- reference capture + flag verification ```bash juju controllers --format=json > ~/openstack-baseline/controller-pre-$(date -u +%Y%m%d).json juju help create-backup ``` **RUN -- jumphost** -- controller state backup (gated; jumphost-local archive) ```bash juju create-backup -m admin/controller \ --filename ~/openstack-baseline/juju-controller-backup-pre--$(date -u +%Y%m%d).tar.gz \ "pre ops-update- controller to " ``` **GATE:** archive downloaded, checksum printed, size sane (this cloud: ~900MB). The archive stays on the jumphost (secret-adjacent; never committed). The D-070 rebuild posture remains the documented restore path of last resort. ### 2.2 Upgrade the controller > CAUTION: there is NO in-band downgrade of a Juju controller. The > compensating controls are: patch-level jump only (D-071), healthy > pre-state proven at 1.4, and the D-070 rebuild posture. If the target is > more than a patch jump, STOP -- that is not this runbook. **RUN -- jumphost** (flags per 1.6; form verified as-executed 2026-07-05) ```bash juju upgrade-controller --agent-version ``` **Expect:** command accepted, then the 0b blind window (~1-5 min) while jujud restarts. Do not run other juju commands until it clears. **GATE:** controller at target and the model reachable again: ```bash juju show-controller --format=json | jq -r 'to_entries[] | "\(.key) agent-version=\(.value.details."agent-version")"' juju status -m openstack --format=json | jq -r '.machines | to_entries[] | "\(.key) \(.value."juju-status".current)"' | grep -v started \ || echo "all machines started" ``` Poll up to ~10 min. Controller at target + all machine agents `started`. Beyond budget: STOP, appendix-A; never re-bootstrap inside the window. ### 2.3 Post-controller spot check **CHECK (read-only) -- jumphost** ```bash bash scripts/cloud-assert.sh ``` **GATE:** PASS (A5-A7 need `source ~/admin-openrc` in scope). Do not start the agent stage on a controller that cannot pass the behavioral sweep. ## 3. Model agent stage **RUN -- jumphost** (form verified as-executed 2026-07-05) ```bash juju upgrade-model -m openstack --agent-version ``` **Expect:** agents upgrade rolling; workloads are NOT restarted. The controller model needs NO separate upgrade-model -- `juju upgrade-controller` already brought machine 0's agent to target (verified live; check with `juju status -m admin/controller`). Explicit `--agent-version` is used for determinism (default would pick the controller's version). **GATE:** every machine and unit agent at target, settled: ```bash juju status -m openstack --format=json | jq -r ' [(.machines | to_entries[] | .value."juju-status".version), (.. | objects | select(has("agent-status")) | ."agent-status".version)] | .[] | select(. != null)' | sort | uniq -c ``` **Expect:** ONE version, the target, on every line; no unit stuck in `upgrading`. An agent stuck beyond ~15 min = STOP, appendix-A. ## 4. Charm refresh stage ### 4.0 Rules for EVERY app in this stage 1. Re-verify THAT app immediately before refreshing it (the 1.3 worklist is stale the moment the first refresh lands), including the name gate: **CHECK (read-only) -- jumphost** ```bash APP= juju status "$APP" --format=json | jq -r --arg a "$APP" ' .applications[$a] | [$a, ."charm-name", (."charm-rev"|tostring), (."can-upgrade-to" // "NONE")] | @tsv' ``` `NONE` = already current, skip forward. Name mismatch = exclude + log. 2. The current revision just read IS the revert value -- record it. 3. ONE app per approval: **RUN -- jumphost** ```bash juju refresh ``` 4. Per-app settle gate before the next app: **CHECK (read-only) -- jumphost** ```bash juju status "$APP" --format=json | jq -r '.applications[] | .units // {} | .. | objects | select(has("workload-status")) | "\(."workload-status".current)/\(."agent-status".current // "?")"' \ | sort | uniq -c juju status -m openstack --format=json | jq -r ' .. | objects | select(has("workload-status")) | select(."workload-status".current == "error") | ."workload-status".message' \ | sed 's/^/ERROR: /' ; true ``` **GATE:** all units of the app AND its subordinates `active/idle`, the new revision visible, and NO unit anywhere in error. Budget ~15 min. (`bash scripts/deploy-watch.sh` in a side window is the signal view, not the gate.) 5. Run the group's behavioral probe (below); full `bash scripts/cloud-assert.sh` at GROUP boundaries only. 6. Revert for any app: `juju refresh --revision `. > CAUTION: an explicit `--revision` refresh PINS the app (it stops tracking > the channel). Any revert row therefore carries a follow-up > `juju refresh --channel ` to resume tracking once > the cause is understood -- record both in the revert table. ### 4.1 Group 0 -- keystone (alone, first) Identity underpins every other service; charm-guide practice is keystone first. Refresh `keystone` per 4.0. **Probe:** `openstack token issue `nova-cloud-controller` -> `neutron-api` -> `neutron-api-plugin-ovn` -> `glance` -> `glance-simplestreams-sync` -> `octavia-diskimage-retrofit` -> `cinder` -> `cinder-ceph` -> `barbican` -> `barbican-vault`. Probes after the relevant principal settles: ```bash openstack compute service list `magnum-dashboard` -> `octavia-dashboard` (the latter two are subordinates riding openstack-dashboard -- verify placement live in status before ordering). **Probe (corrected 2026-07-05, DOCFIX-088):** Horizon over the dashboard VIP on the PLAIN-HTTP leg -- `curl http:///horizon/auth/login/` expects 200 -- plus the D-044 override file present at `/usr/share/openstack-dashboard/openstack_dashboard/local/local_settings.d/` (canonical path; the /etc path is a decoy) and a browser login working. Do NOT gate on https://: dashboard VIP TLS has been dead SINCE DEPLOY on this build (haproxy 443 backend targets a vhost-less internal address, masked by its L4-only check; addendum 15 RCA) -- an https probe failure here is the pre-existing defect, not a refresh regression. **GATE:** probe green; cloud-assert not required mid-group here, PASS at group end. ### 4.5 Group 4 -- nova-compute (LAST) Data-plane adjacent (all hypervisor hosts). A charm refresh does NOT restart guests, but this runs last, with everything else proven green. Refresh `nova-compute` per 4.0. **Probe:** ```bash openstack hypervisor list /`) ```bash source ~/admin-openrc bash scripts/cloud-assert.sh --capture ``` **GATE:** `CLOUD-ASSERT: PASS`; this capture is the post-change BOM. **CHECK (read-only) -- jumphost** -- version coherence + BOM diff ```bash juju show-controller --format=json | jq -r 'to_entries[] | .value.details."agent-version"' diff <(sort asbuilt//bundle-exported.yaml) \ <(sort asbuilt//bundle-exported.yaml) | grep -E '^[<>]' | sort ``` **Expect:** controller == agents == target (1.2 query re-run); every worklist app at its recorded target revision; the bundle diff shows ONLY the expected charm revision lines. ANY config/channel/placement delta = stop and explain before closing the window. Behavioral spot set (beyond cloud-assert): `openstack token issue`, `openstack server list --all-projects`, `openstack loadbalancer list`, `openstack coe cluster list`, Horizon login. ## 6. Re-baseline and documentation (window close) 1. `runbooks/appendix-B-asbuilt-version-lock.md` B.1: update as-built revisions to the measured post-state; bump the header date/source line. (This is exactly the appendix-B "refresh the table on a successful validated state" event.) 2. Commit the post-change `asbuilt//` BOM. 3. `docs/v1-redeploy-changelog.md`: as-executed addendum -- what moved (controller x.y.z -> x.y.z', per-app rev table), why, and the revert table (per-app `--revision` + `--channel` re-track pairs; controller = none in-band, D-070 posture). 4. Close the `logs/as-executed-index.md` row; update `docs/session-ledger.md`. 5. First execution only: clear this runbook's [REVALIDATE] markers and set the as-executed date in the STATUS header. ## 7. Revert reference | Layer | In-band revert | Posture if none | |---|---|---| | Controller upgrade | NONE (no downgrade) | Patch-jump-only + healthy pre-state + D-070 rebuild-from-runbooks | | Model agents | NONE (no downgrade) | Same as controller | | Charm refresh (per app) | `juju refresh --revision ` then later `juju refresh --channel ` | Appendix-B B.1 holds the last validated revisions | | Docs / BOM re-baseline | `git revert` of the re-baseline commit | -- | --- ## Quick reference (symptom -> fix) | Symptom | Fix | |---|---| | Agent stuck `upgrading` past budget | STOP; appendix-A; do not stack further mutations | | Unit `error` mid-refresh | Understand the hook error FIRST; `juju resolved --no-retry ` only with cause known; else revert the app (4.0.6) | | `can-upgrade-to` names a different charm | Exclude app + capture JSON + log finding (0b table) | | Controller unreachable past 2.2 budget | Escalate; NEVER re-bootstrap inside the window | | Horizon login cookie failure after dashboard refresh | Restore `_99_internal_http_cookies.py` (D-044; appendix-A) | | Vault sealed mid-window | Incident, not expected here -- appendix-A / restart-procedure Stage 3 | ## Open questions (carried for Roosevelt) - HA controllers change the 2.2 blind-window math (rolling controller upgrade, no full API blackout) -- revalidate gates on bare metal. - Controller backup story on bare metal: evaluate supported juju-db dump tooling as part of the Roosevelt controller design, not improvised here. - Cadence policy and window sizing: D-071 (PROPOSED) -- rule before Roosevelt operations begin.