# Ops -- Routine Update Procedure (Juju controller, model agents, in-channel charm refreshes)

STATUS: authored 2026-07-04 per DOCFIX-086; NOT yet as-executed. Policy
companion: D-071 (PROPOSED -- update cadence + controller patch policy).
Steps whose exact mechanics could not be verified against the live client at
authoring time carry [REVALIDATE] markers -- clear them on the first executed
update window and record the as-executed date here.

Scope: a PLANNED maintenance window applying three update layers, in order:
(1) Juju controller patch upgrade (single, non-HA controller), (2) model
agent upgrade to match, (3) charm refreshes to newer revisions WITHIN their
pinned channels (the appendix-B-sanctioned update type). This is explicitly
NOT an OpenStack series/track upgrade -- no charm changes channel here, ever.
For full-cloud power maintenance use `runbooks/ops-restart-procedure.md`; for
incident response use `runbooks/appendix-A-troubleshooting.md`.

Conventions: RUN/CHECK/GATE labels per runbooks/README.md. One gated mutation
at a time; read-only verification precedes every mutation. Invoke scripts as
`bash scripts/<name>.sh` (no exec bits in the repo). Run the whole window
inside `bash scripts/run-logged.sh ops-update-<date>` and add the index row
to `logs/as-executed-index.md`.

---

## 0. Exclusions (read before anything else)

- **VAULT IS OUT OF SCOPE.** Live vault stays on `1.8/stable`. Do not
  refresh `vault` or `vault-mysql-router` in this procedure.

> CAUTION: `bundle.yaml` pins vault `1.16/stable` (D-068 / BUNDLEFIX-007)
> while live runs `1.8/stable`. A naive "sync live to the bundle" or a
> blanket refresh sweep would attempt a multi-minor major Vault upgrade --
> exactly what D-068 (PROPOSED) says is NOT a casual `juju refresh` (unseal
> keys in hand, storage-format compatibility, rehearsal first). Until D-068
> is ruled and rehearsed, vault is untouchable here.

- **No channel changes.** D-002 pins channels; this procedure only moves
  revisions WITHIN a pinned channel. A desired channel change is a
  D-NNN/BUNDLEFIX proposal, not an update-window action.
- Out of scope: MAAS, host OS packages, jumphost snaps (the juju client snap
  updates itself on `3/stable`), and the CAPI/driver layer (appendix-B
  B.2/B.3; governed by D-034/D-042 -- their update is a separate procedure).
- Apps with no update available at authoring (ceph-*, ovn-*, mysql-*,
  rabbitmq-server, hacluster, memcached): re-measured at pre-flight. If one
  NEWLY shows an update, it is LOGGED for the next window, not refreshed in
  this one (hard rule 1 -- no added scope mid-window).

## 0b. Expectations table (read FIRST; saves false alarms)

| Observation | Meaning |
|---|---|
| `juju status` / API errors for ~1-5 min right after `upgrade-controller` | EXPECTED. Single non-HA controller; jujud restarts. Data plane and workloads unaffected; model MANAGEMENT is blind. Wait, do not react. |
| Units cycling `maintenance`/`executing` for minutes after a refresh | Expected settle arc (upgrade-charm hooks). Judge by the settle gate, not by transient states. |
| `can-upgrade-to` values differ from this runbook's planning table | EXPECTED. Channels float (appendix-B policy). The live measurement is the worklist; any table here is planning reference only. |
| An app's `can-upgrade-to` names a DIFFERENT charm than the app runs | Anomaly (seen once for magnum on 2026-07-04: `ch:amd64/magnum-dashboard-122`). EXCLUDE the app, capture the raw JSON, log the finding. Never refresh across a name mismatch. |
| Vault flips to sealed during this window | NOT expected -- nothing here restarts vault. That is an incident: stop, appendix-A. |

## 1. Pre-flight and baseline

### 1.1 Session bootstrap

**RUN -- jumphost**
```bash
git -C ~/openstack-caracal-ipv4 pull
bash scripts/repo-lint.sh
bash scripts/run-logged.sh ops-update-$(date -u +%Y%m%d)
```
**Expect:** lint 0 fail (1 legacy WARN documented); logged subshell open.
Add the session row to `logs/as-executed-index.md`.

### 1.2 Measure versions -- never assume

**CHECK (read-only) -- jumphost**
```bash
juju version
juju show-controller --format=json | jq -r 'to_entries[]
  | "\(.key)  agent-version=\(.value.details."agent-version")"'
juju status -m openstack --format=json | jq -r '
  [(.machines | to_entries[] | .value."juju-status".version),
   (.. | objects | select(has("agent-status")) | ."agent-status".version)]
  | .[] | select(. != null)' | sort | uniq -c
```
**Expect:** client at the target patch version; controller and ALL machine +
unit agents at ONE uniform current version. Record both values. Any skew
among agents = STOP and investigate before adding an upgrade on top.
[REVALIDATE: unit agent version field path on current juju]

### 1.3 Measure the refresh worklist (with the charm-name gate)

**CHECK (read-only) -- jumphost**
```bash
juju status -m openstack --format=json | jq -r '
  .applications | to_entries[]
  | select((.value."can-upgrade-to" // "") != "")
  | .key as $app | (.value."charm-name") as $name
  | (.value."can-upgrade-to" | sub("^ch:[^/]*/"; "")) as $t
  | [$app, $name, (.value."charm-rev"|tostring),
     ($t | sub("-[0-9]+$"; "")), ($t | capture("-(?<r>[0-9]+)$").r),
     (if ($t | sub("-[0-9]+$"; "")) == $name then "OK" else "NAME-MISMATCH" end)]
  | @tsv' | column -t
```
**GATE:** every row `OK`. Any `NAME-MISMATCH` row: capture the app's raw
`juju status <app> --format=json`, EXCLUDE it from this window's worklist,
and log the finding (appendix-A/DOCFIX material).

Record the surviving rows as the window's worklist AND revert table:
(app, current-rev, target-rev). Cross-check current revs against appendix-B
B.1; any pre-existing divergence is logged (it means the last re-baseline
was missed), not corrected here.

### 1.4 Health gate + pre-change BOM

**RUN -- jumphost** (writes only `asbuilt/<ts>/`)
```bash
source ~/admin-openrc
bash scripts/cloud-assert.sh --capture
```
**GATE:** `CLOUD-ASSERT: PASS`. WARN/HOLD is a no-go: do not update an
unhealthy or unverified cloud. Commit the `asbuilt/<ts>/` BOM as the
pre-change baseline before the first mutation.

### 1.5 Quiesce check

**CHECK (read-only) -- jumphost**
```bash
juju status -m openstack --format=json | jq -r '
  .. | objects | select(has("agent-status"))
  | select(."agent-status".current as $c | ["executing","error","failed"] | index($c))
  | ."agent-status".current' | sort | uniq -c
openstack coe cluster list -f value -c name -c status </dev/null
openstack loadbalancer list -f value -c name -c provisioning_status </dev/null
```
**Expect:** no units executing/error; no magnum cluster in `*_IN_PROGRESS`;
no LB in `PENDING_*`. In-flight tenant operations and an update window do
not mix.

### 1.6 Command-surface verification (anti-fabrication gate)

**CHECK (read-only) -- jumphost**
```bash
juju help upgrade-controller
juju help upgrade-model
juju help commands | grep -i backup || echo "no backup commands on this client"
```
**GATE:** the exact flag names for the two upgrade commands are read from
THIS output before composing any mutation below. Nothing in this runbook's
candidate invocations overrides what the live client says.

## 2. Controller stage

### 2.1 Controller-state posture (reference captures; not a restore path)

`juju create-backup` was removed in Juju 3.0. If 1.6 confirmed no backup
command exists on this client (expected for 3.6), the accepted posture per
D-071 is: patch-level jumps only, proven-healthy pre-state (1.4), and the
D-070 restore path (re-bootstrap + rebuild-from-runbooks). The pre-change
BOM already holds the exported bundle and status captures.

**CHECK (read-only) -- jumphost** -- reference captures + optional tooling probe
```bash
juju controllers --format=json > ~/openstack-baseline/controller-pre-$(date -u +%Y%m%d).json
juju ssh -m controller 0 -- snap list </dev/null
```
**Expect:** the juju-db snap listed. IF the operator wants a database-level
capture and the juju-db snap ships a dump tool, that is an operator's-call
gated extra step -- verify the tool exists on the controller machine first;
do not improvise one. Absent tooling, proceed on the documented posture.
[REVALIDATE: juju-db snap contents on current controller]

### 2.2 Upgrade the controller

> CAUTION: there is NO in-band downgrade of a Juju controller. The
> compensating controls are: patch-level jump only (D-071), healthy
> pre-state proven at 1.4, and the D-070 rebuild posture. If the target is
> more than a patch jump, STOP -- that is not this runbook.

**RUN -- jumphost** (flags per 1.6; candidate form below) [REVALIDATE]
```bash
juju upgrade-controller --agent-version <target>
```
**Expect:** command accepted, then the 0b blind window (~1-5 min) while
jujud restarts. Do not run other juju commands until it clears.

**GATE:** controller at target and the model reachable again:
```bash
juju show-controller --format=json | jq -r 'to_entries[]
  | "\(.key)  agent-version=\(.value.details."agent-version")"'
juju status -m openstack --format=json | jq -r '.machines | to_entries[]
  | "\(.key)  \(.value."juju-status".current)"' | grep -v started \
  || echo "all machines started"
```
Poll up to ~10 min. Controller at target + all machine agents `started`.
Beyond budget: STOP, appendix-A; never re-bootstrap inside the window.

### 2.3 Post-controller spot check

**CHECK (read-only) -- jumphost**
```bash
bash scripts/cloud-assert.sh
```
**GATE:** PASS (A5-A7 need `source ~/admin-openrc` in scope). Do not start
the agent stage on a controller that cannot pass the behavioral sweep.

## 3. Model agent stage

**RUN -- jumphost** (flags per 1.6) [REVALIDATE]
```bash
juju upgrade-model -m controller
juju upgrade-model -m openstack
```
**Expect:** default target is the controller's version (verify in the 1.6
help output whether an explicit `--agent-version` is required). Agents
upgrade rolling; workloads are NOT restarted.

**GATE:** every machine and unit agent at target, settled:
```bash
juju status -m openstack --format=json | jq -r '
  [(.machines | to_entries[] | .value."juju-status".version),
   (.. | objects | select(has("agent-status")) | ."agent-status".version)]
  | .[] | select(. != null)' | sort | uniq -c
```
**Expect:** ONE version, the target, on every line; no unit stuck in
`upgrading`. An agent stuck beyond ~15 min = STOP, appendix-A.

## 4. Charm refresh stage

### 4.0 Rules for EVERY app in this stage

1. Re-verify THAT app immediately before refreshing it (the 1.3 worklist is
   stale the moment the first refresh lands), including the name gate:

   **CHECK (read-only) -- jumphost**
   ```bash
   APP=<app>
   juju status "$APP" --format=json | jq -r --arg a "$APP" '
     .applications[$a] | [$a, ."charm-name", (."charm-rev"|tostring),
     (."can-upgrade-to" // "NONE")] | @tsv'
   ```
   `NONE` = already current, skip forward. Name mismatch = exclude + log.
2. The current revision just read IS the revert value -- record it.
3. ONE app per approval:

   **RUN -- jumphost**
   ```bash
   juju refresh <app>
   ```
4. Per-app settle gate before the next app:

   **CHECK (read-only) -- jumphost**
   ```bash
   juju status "$APP" --format=json | jq -r '.applications[] | .units // {}
     | .. | objects | select(has("workload-status"))
     | "\(."workload-status".current)/\(."agent-status".current // "?")"' \
     | sort | uniq -c
   juju status -m openstack --format=json | jq -r '
     .. | objects | select(has("workload-status"))
     | select(."workload-status".current == "error") | ."workload-status".message' \
     | sed 's/^/ERROR: /' ; true
   ```
   **GATE:** all units of the app AND its subordinates `active/idle`, the
   new revision visible, and NO unit anywhere in error. Budget ~15 min.
   (`bash scripts/deploy-watch.sh` in a side window is the signal view,
   not the gate.)
5. Run the group's behavioral probe (below); full
   `bash scripts/cloud-assert.sh` at GROUP boundaries only.
6. Revert for any app: `juju refresh <app> --revision <recorded-rev>`.

> CAUTION: an explicit `--revision` refresh PINS the app (it stops tracking
> the channel). Any revert row therefore carries a follow-up
> `juju refresh <app> --channel <pinned-channel>` to resume tracking once
> the cause is understood -- record both in the revert table.

### 4.1 Group 0 -- keystone (alone, first)

Identity underpins every other service; charm-guide practice is keystone
first. Refresh `keystone` per 4.0.
**Probe:** `openstack token issue </dev/null` succeeds.
**GATE:** full `bash scripts/cloud-assert.sh` PASS before Group 1.

### 4.2 Group 1 -- control-plane API services (one at a time)

Order (subordinate immediately after its principal):
`placement` -> `nova-cloud-controller` -> `neutron-api` ->
`neutron-api-plugin-ovn` -> `glance` -> `glance-simplestreams-sync` ->
`octavia-diskimage-retrofit` -> `cinder` -> `cinder-ceph` -> `barbican` ->
`barbican-vault`.

Probes after the relevant principal settles:
```bash
openstack compute service list </dev/null      # placement / n-c-c
openstack network agent list </dev/null        # neutron-api (+ plugin)
openstack image list </dev/null                # glance
openstack volume service list </dev/null       # cinder (+ cinder-ceph)
openstack secret list </dev/null               # barbican (+ barbican-vault)
```
**GATE:** full `bash scripts/cloud-assert.sh` PASS at group end.

### 4.3 Group 2 -- octavia (alone)

Owns the amphora control plane; refresh alone, watch A6 specifically.
**Probe:** `openstack loadbalancer list </dev/null` -- every LB
`ACTIVE`/`ONLINE` (compare against the 1.4 baseline inventory).
**GATE:** full `bash scripts/cloud-assert.sh` PASS.

### 4.4 Group 3 -- dashboards (lowest risk, user-facing)

`openstack-dashboard` -> `magnum-dashboard` -> `octavia-dashboard`
(the latter two are subordinates riding openstack-dashboard -- verify
placement live in status before ordering).
**Probe:** Horizon over the dashboard VIP answers HTTP 200 and login works;
the D-044 secure-cookie override survives the refresh (appendix-A entry if
login cookies fail).
**GATE:** probe green; cloud-assert not required mid-group here, PASS at
group end.

### 4.5 Group 4 -- nova-compute (LAST)

Data-plane adjacent (all hypervisor hosts). A charm refresh does NOT
restart guests, but this runs last, with everything else proven green.
Refresh `nova-compute` per 4.0.
**Probe:**
```bash
openstack hypervisor list </dev/null
openstack compute service list </dev/null
openstack server list --all-projects -c Name -c Status </dev/null
```
**Expect:** all hypervisors up, compute services up, guests unchanged
(compare the 1.4 baseline server list).
**GATE:** full `bash scripts/cloud-assert.sh` PASS.

### 4.6 Skip-list re-check

**CHECK (read-only) -- jumphost** -- re-run the 1.3 worklist query.
**Expect:** empty (or only documented exclusions: vault). Anything new is
LOGGED for the next window, not refreshed now.

## 5. Post-verification

**RUN -- jumphost** (writes only `asbuilt/<ts>/`)
```bash
source ~/admin-openrc
bash scripts/cloud-assert.sh --capture
```
**GATE:** `CLOUD-ASSERT: PASS`; this capture is the post-change BOM.

**CHECK (read-only) -- jumphost** -- version coherence + BOM diff
```bash
juju show-controller --format=json | jq -r 'to_entries[]
  | .value.details."agent-version"'
diff <(sort asbuilt/<pre-ts>/bundle-exported.yaml) \
     <(sort asbuilt/<post-ts>/bundle-exported.yaml) | grep -E '^[<>]' | sort
```
**Expect:** controller == agents == target (1.2 query re-run); every
worklist app at its recorded target revision; the bundle diff shows ONLY
the expected charm revision lines. ANY config/channel/placement delta =
stop and explain before closing the window.

Behavioral spot set (beyond cloud-assert): `openstack token issue`,
`openstack server list --all-projects`, `openstack loadbalancer list`,
`openstack coe cluster list`, Horizon login.

## 6. Re-baseline and documentation (window close)

1. `runbooks/appendix-B-asbuilt-version-lock.md` B.1: update as-built
   revisions to the measured post-state; bump the header date/source line.
   (This is exactly the appendix-B "refresh the table on a successful
   validated state" event.)
2. Commit the post-change `asbuilt/<ts>/` BOM.
3. `docs/v1-redeploy-changelog.md`: as-executed addendum -- what moved
   (controller x.y.z -> x.y.z', per-app rev table), why, and the revert
   table (per-app `--revision` + `--channel` re-track pairs; controller =
   none in-band, D-070 posture).
4. Close the `logs/as-executed-index.md` row; update
   `docs/session-ledger.md`.
5. First execution only: clear this runbook's [REVALIDATE] markers and set
   the as-executed date in the STATUS header.

## 7. Revert reference

| Layer | In-band revert | Posture if none |
|---|---|---|
| Controller upgrade | NONE (no downgrade) | Patch-jump-only + healthy pre-state + D-070 rebuild-from-runbooks |
| Model agents | NONE (no downgrade) | Same as controller |
| Charm refresh (per app) | `juju refresh <app> --revision <old>` then later `juju refresh <app> --channel <pinned>` | Appendix-B B.1 holds the last validated revisions |
| Docs / BOM re-baseline | `git revert` of the re-baseline commit | -- |

---

## Quick reference (symptom -> fix)

| Symptom | Fix |
|---|---|
| Agent stuck `upgrading` past budget | STOP; appendix-A; do not stack further mutations |
| Unit `error` mid-refresh | Understand the hook error FIRST; `juju resolved --no-retry <unit>` only with cause known; else revert the app (4.0.6) |
| `can-upgrade-to` names a different charm | Exclude app + capture JSON + log finding (0b table) |
| Controller unreachable past 2.2 budget | Escalate; NEVER re-bootstrap inside the window |
| Horizon login cookie failure after dashboard refresh | Restore `_99_internal_http_cookies.py` (D-044; appendix-A) |
| Vault sealed mid-window | Incident, not expected here -- appendix-A / restart-procedure Stage 3 |

## Open questions (carried for Roosevelt)

- HA controllers change the 2.2 blind-window math (rolling controller
  upgrade, no full API blackout) -- revalidate gates on bare metal.
- Controller backup story on bare metal: evaluate supported juju-db dump
  tooling as part of the Roosevelt controller design, not improvised here.
- Cadence policy and window sizing: D-071 (PROPOSED) -- rule before
  Roosevelt operations begin.
