STATUS: authored 2026-07-04 per DOCFIX-086; NOT yet as-executed. Policy companion: D-071 (PROPOSED -- update cadence + controller patch policy). Steps whose exact mechanics could not be verified against the live client at authoring time carry [REVALIDATE] markers -- clear them on the first executed update window and record the as-executed date here.
Scope: a PLANNED maintenance window applying three update layers, in order: (1) Juju controller patch upgrade (single, non-HA controller), (2) model agent upgrade to match, (3) charm refreshes to newer revisions WITHIN their pinned channels (the appendix-B-sanctioned update type). This is explicitly NOT an OpenStack series/track upgrade -- no charm changes channel here, ever. For full-cloud power maintenance use runbooks/ops-restart-procedure.md; for incident response use runbooks/appendix-A-troubleshooting.md.
Conventions: RUN/CHECK/GATE labels per runbooks/README.md. One gated mutation at a time; read-only verification precedes every mutation. Invoke scripts as bash scripts/<name>.sh (no exec bits in the repo). Run the whole window inside bash scripts/run-logged.sh ops-update-<date> and add the index row to logs/as-executed-index.md.
1.8/stable. Do not refresh vault or vault-mysql-router in this procedure.CAUTION:
bundle.yamlpins vault1.16/stable(D-068 / BUNDLEFIX-007) while live runs1.8/stable. A naive "sync live to the bundle" or a blanket refresh sweep would attempt a multi-minor major Vault upgrade -- exactly what D-068 (PROPOSED) says is NOT a casualjuju refresh(unseal keys in hand, storage-format compatibility, rehearsal first). Until D-068 is ruled and rehearsed, vault is untouchable here.
3/stable), and the CAPI/driver layer (appendix-B B.2/B.3; governed by D-034/D-042 -- their update is a separate procedure).| Observation | Meaning |
|---|---|
juju status / API errors for ~1-5 min right after upgrade-controller |
EXPECTED. Single non-HA controller; jujud restarts. Data plane and workloads unaffected; model MANAGEMENT is blind. Wait, do not react. |
Units cycling maintenance/executing for minutes after a refresh |
Expected settle arc (upgrade-charm hooks). Judge by the settle gate, not by transient states. |
can-upgrade-to values differ from this runbook's planning table |
EXPECTED. Channels float (appendix-B policy). The live measurement is the worklist; any table here is planning reference only. |
An app's can-upgrade-to names a DIFFERENT charm than the app runs |
Anomaly (seen once for magnum on 2026-07-04: ch:amd64/magnum-dashboard-122). EXCLUDE the app, capture the raw JSON, log the finding. Never refresh across a name mismatch. |
| Vault flips to sealed during this window | NOT expected -- nothing here restarts vault. That is an incident: stop, appendix-A. |
RUN -- jumphost
git -C ~/openstack-caracal-ipv4 pull bash scripts/repo-lint.sh bash scripts/run-logged.sh ops-update-$(date -u +%Y%m%d)
Expect: lint 0 fail (1 legacy WARN documented); logged subshell open. Add the session row to logs/as-executed-index.md.
CHECK (read-only) -- jumphost
juju version
juju show-controller --format=json | jq -r 'to_entries[]
| "\(.key) agent-version=\(.value.details."agent-version")"'
juju status -m openstack --format=json | jq -r '
[(.machines | to_entries[] | .value."juju-status".version),
(.. | objects | select(has("agent-status")) | ."agent-status".version)]
| .[] | select(. != null)' | sort | uniq -c
Expect: client at the target patch version; controller and ALL machine + unit agents at ONE uniform current version. Record both values. Any skew among agents = STOP and investigate before adding an upgrade on top. [REVALIDATE: unit agent version field path on current juju]
CHECK (read-only) -- jumphost
juju status -m openstack --format=json | jq -r '
.applications | to_entries[]
| select((.value."can-upgrade-to" // "") != "")
| .key as $app | (.value."charm-name") as $name
| (.value."can-upgrade-to" | sub("^ch:[^/]*/"; "")) as $t
| [$app, $name, (.value."charm-rev"|tostring),
($t | sub("-[0-9]+$"; "")), ($t | capture("-(?<r>[0-9]+)$").r),
(if ($t | sub("-[0-9]+$"; "")) == $name then "OK" else "NAME-MISMATCH" end)]
| @tsv' | column -t
GATE: every row OK. Any NAME-MISMATCH row: capture the app's raw juju status <app> --format=json, EXCLUDE it from this window's worklist, and log the finding (appendix-A/DOCFIX material).
Record the surviving rows as the window's worklist AND revert table: (app, current-rev, target-rev). Cross-check current revs against appendix-B B.1; any pre-existing divergence is logged (it means the last re-baseline was missed), not corrected here.
RUN -- jumphost (writes only asbuilt/<ts>/)
source ~/admin-openrc bash scripts/cloud-assert.sh --capture
GATE: CLOUD-ASSERT: PASS. WARN/HOLD is a no-go: do not update an unhealthy or unverified cloud. Commit the asbuilt/<ts>/ BOM as the pre-change baseline before the first mutation.
CHECK (read-only) -- jumphost
juju status -m openstack --format=json | jq -r '
.. | objects | select(has("agent-status"))
| select(."agent-status".current as $c | ["executing","error","failed"] | index($c))
| ."agent-status".current' | sort | uniq -c
openstack coe cluster list -f value -c name -c status </dev/null
openstack loadbalancer list -f value -c name -c provisioning_status </dev/null
Expect: no units executing/error; no magnum cluster in *_IN_PROGRESS; no LB in PENDING_*. In-flight tenant operations and an update window do not mix.
CHECK (read-only) -- jumphost
juju help upgrade-controller juju help upgrade-model juju help commands | grep -i backup || echo "no backup commands on this client"
GATE: the exact flag names for the two upgrade commands are read from THIS output before composing any mutation below. Nothing in this runbook's candidate invocations overrides what the live client says.
juju create-backup was removed in Juju 3.0. If 1.6 confirmed no backup command exists on this client (expected for 3.6), the accepted posture per D-071 is: patch-level jumps only, proven-healthy pre-state (1.4), and the D-070 restore path (re-bootstrap + rebuild-from-runbooks). The pre-change BOM already holds the exported bundle and status captures.
CHECK (read-only) -- jumphost -- reference captures + optional tooling probe
juju controllers --format=json > ~/openstack-baseline/controller-pre-$(date -u +%Y%m%d).json juju ssh -m controller 0 -- snap list </dev/null
Expect: the juju-db snap listed. IF the operator wants a database-level capture and the juju-db snap ships a dump tool, that is an operator's-call gated extra step -- verify the tool exists on the controller machine first; do not improvise one. Absent tooling, proceed on the documented posture. [REVALIDATE: juju-db snap contents on current controller]
CAUTION: there is NO in-band downgrade of a Juju controller. The compensating controls are: patch-level jump only (D-071), healthy pre-state proven at 1.4, and the D-070 rebuild posture. If the target is more than a patch jump, STOP -- that is not this runbook.
RUN -- jumphost (flags per 1.6; candidate form below) [REVALIDATE]
juju upgrade-controller --agent-version <target>
Expect: command accepted, then the 0b blind window (~1-5 min) while jujud restarts. Do not run other juju commands until it clears.
GATE: controller at target and the model reachable again:
juju show-controller --format=json | jq -r 'to_entries[] | "\(.key) agent-version=\(.value.details."agent-version")"' juju status -m openstack --format=json | jq -r '.machines | to_entries[] | "\(.key) \(.value."juju-status".current)"' | grep -v started \ || echo "all machines started"
Poll up to ~10 min. Controller at target + all machine agents started. Beyond budget: STOP, appendix-A; never re-bootstrap inside the window.
CHECK (read-only) -- jumphost
bash scripts/cloud-assert.sh
GATE: PASS (A5-A7 need source ~/admin-openrc in scope). Do not start the agent stage on a controller that cannot pass the behavioral sweep.
RUN -- jumphost (flags per 1.6) [REVALIDATE]
juju upgrade-model -m controller juju upgrade-model -m openstack
Expect: default target is the controller's version (verify in the 1.6 help output whether an explicit --agent-version is required). Agents upgrade rolling; workloads are NOT restarted.
GATE: every machine and unit agent at target, settled:
juju status -m openstack --format=json | jq -r '
[(.machines | to_entries[] | .value."juju-status".version),
(.. | objects | select(has("agent-status")) | ."agent-status".version)]
| .[] | select(. != null)' | sort | uniq -c
Expect: ONE version, the target, on every line; no unit stuck in upgrading. An agent stuck beyond ~15 min = STOP, appendix-A.
APP=<app> juju status "$APP" --format=json | jq -r --arg a "$APP" ' .applications[$a] | [$a, ."charm-name", (."charm-rev"|tostring), (."can-upgrade-to" // "NONE")] | @tsv'
NONE = already current, skip forward. Name mismatch = exclude + log.juju refresh <app>
juju status "$APP" --format=json | jq -r '.applications[] | .units // {}
| .. | objects | select(has("workload-status"))
| "\(."workload-status".current)/\(."agent-status".current // "?")"' \
| sort | uniq -c
juju status -m openstack --format=json | jq -r '
.. | objects | select(has("workload-status"))
| select(."workload-status".current == "error") | ."workload-status".message' \
| sed 's/^/ERROR: /' ; trueGATE: all units of the app AND its subordinates active/idle, the new revision visible, and NO unit anywhere in error. Budget ~15 min. (bash scripts/deploy-watch.sh in a side window is the signal view, not the gate.)bash scripts/cloud-assert.sh at GROUP boundaries only.juju refresh <app> --revision <recorded-rev>.CAUTION: an explicit
--revisionrefresh PINS the app (it stops tracking the channel). Any revert row therefore carries a follow-upjuju refresh <app> --channel <pinned-channel>to resume tracking once the cause is understood -- record both in the revert table.
Identity underpins every other service; charm-guide practice is keystone first. Refresh keystone per 4.0. Probe: openstack token issue </dev/null succeeds. GATE: full bash scripts/cloud-assert.sh PASS before Group 1.
Order (subordinate immediately after its principal): placement -> nova-cloud-controller -> neutron-api -> neutron-api-plugin-ovn -> glance -> glance-simplestreams-sync -> octavia-diskimage-retrofit -> cinder -> cinder-ceph -> barbican -> barbican-vault.
Probes after the relevant principal settles:
openstack compute service list </dev/null # placement / n-c-c openstack network agent list </dev/null # neutron-api (+ plugin) openstack image list </dev/null # glance openstack volume service list </dev/null # cinder (+ cinder-ceph) openstack secret list </dev/null # barbican (+ barbican-vault)
GATE: full bash scripts/cloud-assert.sh PASS at group end.
Owns the amphora control plane; refresh alone, watch A6 specifically. Probe: openstack loadbalancer list </dev/null -- every LB ACTIVE/ONLINE (compare against the 1.4 baseline inventory). GATE: full bash scripts/cloud-assert.sh PASS.
openstack-dashboard -> magnum-dashboard -> octavia-dashboard (the latter two are subordinates riding openstack-dashboard -- verify placement live in status before ordering). Probe: Horizon over the dashboard VIP answers HTTP 200 and login works; the D-044 secure-cookie override survives the refresh (appendix-A entry if login cookies fail). GATE: probe green; cloud-assert not required mid-group here, PASS at group end.
Data-plane adjacent (all hypervisor hosts). A charm refresh does NOT restart guests, but this runs last, with everything else proven green. Refresh nova-compute per 4.0. Probe:
openstack hypervisor list </dev/null openstack compute service list </dev/null openstack server list --all-projects -c Name -c Status </dev/null
Expect: all hypervisors up, compute services up, guests unchanged (compare the 1.4 baseline server list). GATE: full bash scripts/cloud-assert.sh PASS.
CHECK (read-only) -- jumphost -- re-run the 1.3 worklist query. Expect: empty (or only documented exclusions: vault). Anything new is LOGGED for the next window, not refreshed now.
RUN -- jumphost (writes only asbuilt/<ts>/)
source ~/admin-openrc bash scripts/cloud-assert.sh --capture
GATE: CLOUD-ASSERT: PASS; this capture is the post-change BOM.
CHECK (read-only) -- jumphost -- version coherence + BOM diff
juju show-controller --format=json | jq -r 'to_entries[]
| .value.details."agent-version"'
diff <(sort asbuilt/<pre-ts>/bundle-exported.yaml) \
<(sort asbuilt/<post-ts>/bundle-exported.yaml) | grep -E '^[<>]' | sort
Expect: controller == agents == target (1.2 query re-run); every worklist app at its recorded target revision; the bundle diff shows ONLY the expected charm revision lines. ANY config/channel/placement delta = stop and explain before closing the window.
Behavioral spot set (beyond cloud-assert): openstack token issue, openstack server list --all-projects, openstack loadbalancer list, openstack coe cluster list, Horizon login.
runbooks/appendix-B-asbuilt-version-lock.md B.1: update as-built revisions to the measured post-state; bump the header date/source line. (This is exactly the appendix-B "refresh the table on a successful validated state" event.)asbuilt/<ts>/ BOM.docs/v1-redeploy-changelog.md: as-executed addendum -- what moved (controller x.y.z -> x.y.z', per-app rev table), why, and the revert table (per-app --revision + --channel re-track pairs; controller = none in-band, D-070 posture).logs/as-executed-index.md row; update docs/session-ledger.md.| Layer | In-band revert | Posture if none |
|---|---|---|
| Controller upgrade | NONE (no downgrade) | Patch-jump-only + healthy pre-state + D-070 rebuild-from-runbooks |
| Model agents | NONE (no downgrade) | Same as controller |
| Charm refresh (per app) | juju refresh <app> --revision <old> then later juju refresh <app> --channel <pinned> |
Appendix-B B.1 holds the last validated revisions |
| Docs / BOM re-baseline | git revert of the re-baseline commit |
-- |
| Symptom | Fix |
|---|---|
Agent stuck upgrading past budget |
STOP; appendix-A; do not stack further mutations |
Unit error mid-refresh |
Understand the hook error FIRST; juju resolved --no-retry <unit> only with cause known; else revert the app (4.0.6) |
can-upgrade-to names a different charm |
Exclude app + capture JSON + log finding (0b table) |
| Controller unreachable past 2.2 budget | Escalate; NEVER re-bootstrap inside the window |
| Horizon login cookie failure after dashboard refresh | Restore _99_internal_http_cookies.py (D-044; appendix-A) |
| Vault sealed mid-window | Incident, not expected here -- appendix-A / restart-procedure Stage 3 |