Newer
Older
openstack-caracal-ipv4 / runbooks / ops-update-procedure.md

Ops -- Routine Update Procedure (Juju controller, model agents, in-channel charm refreshes)

STATUS: authored 2026-07-04 per DOCFIX-086; AS-EXECUTED 2026-07-05 (ops-update-20260705: controller+agents 3.6.24 -> 3.6.25, 17 in-channel charm refreshes; changelog addenda 13-16). [REVALIDATE] markers cleared and live divergences folded back per DOCFIX-088. Policy companion: D-071 (PROPOSED -- update cadence + controller patch policy; amended 2026-07-05: supported controller backup EXISTS on juju 3.6).

Scope: a PLANNED maintenance window applying three update layers, in order: (1) Juju controller patch upgrade (single, non-HA controller), (2) model agent upgrade to match, (3) charm refreshes to newer revisions WITHIN their pinned channels (the appendix-B-sanctioned update type). This is explicitly NOT an OpenStack series/track upgrade -- no charm changes channel here, ever. For full-cloud power maintenance use runbooks/ops-restart-procedure.md; for incident response use runbooks/appendix-A-troubleshooting.md.

Conventions: RUN/CHECK/GATE labels per runbooks/README.md. One gated mutation at a time; read-only verification precedes every mutation. Invoke scripts as bash scripts/<name>.sh (no exec bits in the repo). Run the whole window inside bash scripts/run-logged.sh ops-update-<date> and add the index row to logs/as-executed-index.md.


0. Exclusions (read before anything else)

  • VAULT IS OUT OF SCOPE. Live vault stays on 1.8/stable. Do not refresh vault or vault-mysql-router in this procedure.

CAUTION: bundle.yaml pins vault 1.16/stable (D-068 / BUNDLEFIX-007) while live runs 1.8/stable. A naive "sync live to the bundle" or a blanket refresh sweep would attempt a multi-minor major Vault upgrade -- exactly what D-068 (PROPOSED) says is NOT a casual juju refresh (unseal keys in hand, storage-format compatibility, rehearsal first). Until D-068 is ruled and rehearsed, vault is untouchable here.

  • No channel changes. D-002 pins channels; this procedure only moves revisions WITHIN a pinned channel. A desired channel change is a D-NNN/BUNDLEFIX proposal, not an update-window action.
  • Out of scope: MAAS, host OS packages, jumphost snaps (the juju client snap updates itself on 3/stable), and the CAPI/driver layer (appendix-B B.2/B.3; governed by D-034/D-042 -- their update is a separate procedure).
  • Apps with no update available at authoring (ceph-, ovn-, mysql-*, rabbitmq-server, hacluster, memcached): re-measured at pre-flight. If one NEWLY shows an update, it is LOGGED for the next window, not refreshed in this one (hard rule 1 -- no added scope mid-window).

0b. Expectations table (read FIRST; saves false alarms)

Observation Meaning
juju status / API errors for ~1-5 min right after upgrade-controller EXPECTED. Single non-HA controller; jujud restarts. Data plane and workloads unaffected; model MANAGEMENT is blind. Wait, do not react.
Units cycling maintenance/executing for minutes after a refresh Expected settle arc (upgrade-charm hooks). Judge by the settle gate, not by transient states.
can-upgrade-to values differ from this runbook's planning table EXPECTED. Channels float (appendix-B policy). The live measurement is the worklist; any table here is planning reference only.
An app's can-upgrade-to names a DIFFERENT charm than the app runs Anomaly (seen once for magnum on 2026-07-04: ch:amd64/magnum-dashboard-122). EXCLUDE the app, capture the raw JSON, log the finding. Never refresh across a name mismatch.
Vault flips to sealed during this window NOT expected -- nothing here restarts vault. That is an incident: stop, appendix-A.
An EXCLUDED app's can-upgrade-to target changes mid-window Observed 2026-07-05 (magnum's target moved during the window -- Charmhub republish in flight). Confirms the rule: targets are re-measured per app at refresh time; excluded apps stay excluded until the next window's pre-flight.

1. Pre-flight and baseline

1.1 Session bootstrap

RUN -- jumphost

git -C ~/openstack-caracal-ipv4 pull
bash scripts/repo-lint.sh
bash scripts/run-logged.sh ops-update-$(date -u +%Y%m%d)

Expect: lint 0 fail (1 legacy WARN documented); logged subshell open. Add the session row to logs/as-executed-index.md.

1.2 Measure versions -- never assume

CHECK (read-only) -- jumphost

juju version
juju show-controller --format=json | jq -r 'to_entries[]
  | "\(.key)  agent-version=\(.value.details."agent-version")"'
juju status -m openstack --format=json | jq -r '
  [.. | objects | select(has("juju-status")) | ."juju-status".version
   | select(. != null)] | .[]' | sort | uniq -c

Expect: client at the target patch version; controller and ALL machine, container, and unit agents at ONE uniform current version (agents carry version under juju-status, NOT agent-status -- verified 2026-07-05; this cloud = 91 agents). Record both values. Any skew among agents = STOP and investigate before adding an upgrade on top.

1.3 Measure the refresh worklist (with the charm-name gate)

CHECK (read-only) -- jumphost

juju status -m openstack --format=json | jq -r '
  .applications | to_entries[]
  | select((.value."can-upgrade-to" // "") != "")
  | .key as $app | (.value."charm-name") as $name
  | (.value."can-upgrade-to" | sub("^ch:[^/]*/"; "")) as $t
  | [$app, $name, (.value."charm-rev"|tostring),
     ($t | sub("-[0-9]+$"; "")), ($t | capture("-(?<r>[0-9]+)$").r),
     (if ($t | sub("-[0-9]+$"; "")) == $name then "OK" else "NAME-MISMATCH" end)]
  | @tsv' | column -t

GATE: every row OK. Any NAME-MISMATCH row: capture the app's raw juju status <app> --format=json, EXCLUDE it from this window's worklist, and log the finding (appendix-A/DOCFIX material).

Record the surviving rows as the window's worklist AND revert table: (app, current-rev, target-rev). Cross-check current revs against appendix-B B.1; any pre-existing divergence is logged (it means the last re-baseline was missed), not corrected here.

1.4 Health gate + pre-change BOM

RUN -- jumphost (writes only asbuilt/<ts>/)

source ~/admin-openrc
bash scripts/cloud-assert.sh --capture

GATE: CLOUD-ASSERT: PASS. WARN/HOLD is a no-go: do not update an unhealthy or unverified cloud. Commit the asbuilt/<ts>/ BOM as the pre-change baseline before the first mutation.

1.5 Quiesce check

CHECK (read-only) -- jumphost

juju status -m openstack --format=json | jq -r '
  .. | objects | select(has("agent-status"))
  | select(."agent-status".current as $c | ["executing","error","failed"] | index($c))
  | ."agent-status".current' | sort | uniq -c
openstack coe cluster list -f value -c name -c status </dev/null
openstack loadbalancer list -f value -c name -c provisioning_status </dev/null

Expect: no units executing/error; no magnum cluster in *_IN_PROGRESS; no LB in PENDING_*. In-flight tenant operations and an update window do not mix.

1.6 Command-surface verification (anti-fabrication gate)

CHECK (read-only) -- jumphost

juju help upgrade-controller
juju help upgrade-model
juju help commands | grep -i backup || echo "no backup commands on this client"

GATE: the exact flag names for the two upgrade commands are read from THIS output before composing any mutation below. Nothing in this runbook's candidate invocations overrides what the live client says.

2. Controller stage

2.1 Controller-state backup (supported; STANDARD step per D-071 amendment)

CORRECTION (2026-07-05 as-executed, DOCFIX-088): juju create-backup EXISTS on juju 3.6 (the "removed in 3.0" authoring assumption was wrong for this series) and was exercised in the first window (902MB archive, ~35s). Taking the backup is the STANDARD pre-upgrade step; it converts the D-071 no-revert risk into a real controller-state restore artifact. NOTE: the controller model is admin/controller on this cloud, not <user>/controller.

CHECK (read-only) -- jumphost -- reference capture + flag verification

juju controllers --format=json > ~/openstack-baseline/controller-pre-$(date -u +%Y%m%d).json
juju help create-backup

RUN -- jumphost -- controller state backup (gated; jumphost-local archive)

juju create-backup -m admin/controller \
  --filename ~/openstack-baseline/juju-controller-backup-pre-<target>-$(date -u +%Y%m%d).tar.gz \
  "pre ops-update-<date> controller <old> to <target>"

GATE: archive downloaded, checksum printed, size sane (this cloud: ~900MB). The archive stays on the jumphost (secret-adjacent; never committed). The D-070 rebuild posture remains the documented restore path of last resort.

2.2 Upgrade the controller

CAUTION: there is NO in-band downgrade of a Juju controller. The compensating controls are: patch-level jump only (D-071), healthy pre-state proven at 1.4, and the D-070 rebuild posture. If the target is more than a patch jump, STOP -- that is not this runbook.

RUN -- jumphost (flags per 1.6; form verified as-executed 2026-07-05)

juju upgrade-controller --agent-version <target>

Expect: command accepted, then the 0b blind window (~1-5 min) while jujud restarts. Do not run other juju commands until it clears.

GATE: controller at target and the model reachable again:

juju show-controller --format=json | jq -r 'to_entries[]
  | "\(.key)  agent-version=\(.value.details."agent-version")"'
juju status -m openstack --format=json | jq -r '.machines | to_entries[]
  | "\(.key)  \(.value."juju-status".current)"' | grep -v started \
  || echo "all machines started"

Poll up to ~10 min. Controller at target + all machine agents started. Beyond budget: STOP, appendix-A; never re-bootstrap inside the window.

2.3 Post-controller spot check

CHECK (read-only) -- jumphost

bash scripts/cloud-assert.sh

GATE: PASS (A5-A7 need source ~/admin-openrc in scope). Do not start the agent stage on a controller that cannot pass the behavioral sweep.

3. Model agent stage

RUN -- jumphost (form verified as-executed 2026-07-05)

juju upgrade-model -m openstack --agent-version <target>

Expect: agents upgrade rolling; workloads are NOT restarted. The controller model needs NO separate upgrade-model -- juju upgrade-controller already brought machine 0's agent to target (verified live; check with juju status -m admin/controller). Explicit --agent-version is used for determinism (default would pick the controller's version).

GATE: every machine and unit agent at target, settled:

juju status -m openstack --format=json | jq -r '
  [(.machines | to_entries[] | .value."juju-status".version),
   (.. | objects | select(has("agent-status")) | ."agent-status".version)]
  | .[] | select(. != null)' | sort | uniq -c

Expect: ONE version, the target, on every line; no unit stuck in upgrading. An agent stuck beyond ~15 min = STOP, appendix-A.

4. Charm refresh stage

4.0 Rules for EVERY app in this stage

  1. Re-verify THAT app immediately before refreshing it (the 1.3 worklist is stale the moment the first refresh lands), including the name gate:CHECK (read-only) -- jumphost
    APP=<app>
    juju status "$APP" --format=json | jq -r --arg a "$APP" '
      .applications[$a] | [$a, ."charm-name", (."charm-rev"|tostring),
      (."can-upgrade-to" // "NONE")] | @tsv'
    NONE = already current, skip forward. Name mismatch = exclude + log.
  2. The current revision just read IS the revert value -- record it.
  3. ONE app per approval:RUN -- jumphost
    juju refresh <app>
  4. Per-app settle gate before the next app:CHECK (read-only) -- jumphost
    juju status "$APP" --format=json | jq -r '.applications[] | .units // {}
      | .. | objects | select(has("workload-status"))
      | "\(."workload-status".current)/\(."agent-status".current // "?")"' \
      | sort | uniq -c
    juju status -m openstack --format=json | jq -r '
      .. | objects | select(has("workload-status"))
      | select(."workload-status".current == "error") | ."workload-status".message' \
      | sed 's/^/ERROR: /' ; true
    GATE: all units of the app AND its subordinates active/idle, the new revision visible, and NO unit anywhere in error. Budget ~15 min. (bash scripts/deploy-watch.sh in a side window is the signal view, not the gate.)
  5. Run the group's behavioral probe (below); full bash scripts/cloud-assert.sh at GROUP boundaries only.
  6. Revert for any app: juju refresh <app> --revision <recorded-rev>.

CAUTION: an explicit --revision refresh PINS the app (it stops tracking the channel). Any revert row therefore carries a follow-up juju refresh <app> --channel <pinned-channel> to resume tracking once the cause is understood -- record both in the revert table.

4.1 Group 0 -- keystone (alone, first)

Identity underpins every other service; charm-guide practice is keystone first. Refresh keystone per 4.0. Probe: openstack token issue </dev/null succeeds. GATE: full bash scripts/cloud-assert.sh PASS before Group 1.

4.2 Group 1 -- control-plane API services (one at a time)

Order (subordinate immediately after its principal): placement -> nova-cloud-controller -> neutron-api -> neutron-api-plugin-ovn -> glance -> glance-simplestreams-sync -> octavia-diskimage-retrofit -> cinder -> cinder-ceph -> barbican -> barbican-vault.

Probes after the relevant principal settles:

openstack compute service list </dev/null      # placement / n-c-c
openstack network agent list </dev/null        # neutron-api (+ plugin)
openstack image list </dev/null                # glance
openstack volume service list </dev/null       # cinder (+ cinder-ceph)
openstack secret list </dev/null               # barbican (+ barbican-vault)

GATE: full bash scripts/cloud-assert.sh PASS at group end.

4.3 Group 2 -- octavia (alone)

Owns the amphora control plane; refresh alone, watch A6 specifically. Probe: openstack loadbalancer list </dev/null -- every LB ACTIVE/ONLINE (compare against the 1.4 baseline inventory). GATE: full bash scripts/cloud-assert.sh PASS.

4.4 Group 3 -- dashboards (lowest risk, user-facing)

openstack-dashboard -> magnum-dashboard -> octavia-dashboard (the latter two are subordinates riding openstack-dashboard -- verify placement live in status before ordering). Probe (corrected 2026-07-05, DOCFIX-088): Horizon over the dashboard VIP on the PLAIN-HTTP leg -- curl http://<dashboard-vip>/horizon/auth/login/ expects 200 -- plus the D-044 override file present at /usr/share/openstack-dashboard/openstack_dashboard/local/local_settings.d/ (canonical path; the /etc path is a decoy) and a browser login working. Do NOT gate on https://: dashboard VIP TLS has been dead SINCE DEPLOY on this build (haproxy 443 backend targets a vhost-less internal address, masked by its L4-only check; addendum 15 RCA) -- an https probe failure here is the pre-existing defect, not a refresh regression. GATE: probe green; cloud-assert not required mid-group here, PASS at group end.

4.5 Group 4 -- nova-compute (LAST)

Data-plane adjacent (all hypervisor hosts). A charm refresh does NOT restart guests, but this runs last, with everything else proven green. Refresh nova-compute per 4.0. Probe:

openstack hypervisor list </dev/null
openstack compute service list </dev/null
openstack server list --all-projects -c Name -c Status </dev/null

Expect: all hypervisors up, compute services up, guests unchanged (compare the 1.4 baseline server list). GATE: full bash scripts/cloud-assert.sh PASS.

4.6 Skip-list re-check

CHECK (read-only) -- jumphost -- re-run the 1.3 worklist query. Expect: empty (or only documented exclusions: vault). Anything new is LOGGED for the next window, not refreshed now.

5. Post-verification

RUN -- jumphost (writes only asbuilt/<ts>/)

source ~/admin-openrc
bash scripts/cloud-assert.sh --capture

GATE: CLOUD-ASSERT: PASS; this capture is the post-change BOM.

CHECK (read-only) -- jumphost -- version coherence + BOM diff

juju show-controller --format=json | jq -r 'to_entries[]
  | .value.details."agent-version"'
diff <(sort asbuilt/<pre-ts>/bundle-exported.yaml) \
     <(sort asbuilt/<post-ts>/bundle-exported.yaml) | grep -E '^[<>]' | sort

Expect: controller == agents == target (1.2 query re-run); every worklist app at its recorded target revision; the bundle diff shows ONLY the expected charm revision lines. ANY config/channel/placement delta = stop and explain before closing the window.

Behavioral spot set (beyond cloud-assert): openstack token issue, openstack server list --all-projects, openstack loadbalancer list, openstack coe cluster list, Horizon login.

6. Re-baseline and documentation (window close)

  1. runbooks/appendix-B-asbuilt-version-lock.md B.1: update as-built revisions to the measured post-state; bump the header date/source line. (This is exactly the appendix-B "refresh the table on a successful validated state" event.)
  2. Commit the post-change asbuilt/<ts>/ BOM.
  3. docs/v1-redeploy-changelog.md: as-executed addendum -- what moved (controller x.y.z -> x.y.z', per-app rev table), why, and the revert table (per-app --revision + --channel re-track pairs; controller = none in-band, D-070 posture).
  4. Close the logs/as-executed-index.md row; update docs/session-ledger.md.
  5. First execution only: clear this runbook's [REVALIDATE] markers and set the as-executed date in the STATUS header.

7. Revert reference

Layer In-band revert Posture if none
Controller upgrade NONE (no downgrade) Patch-jump-only + healthy pre-state + D-070 rebuild-from-runbooks
Model agents NONE (no downgrade) Same as controller
Charm refresh (per app) juju refresh <app> --revision <old> then later juju refresh <app> --channel <pinned> Appendix-B B.1 holds the last validated revisions
Docs / BOM re-baseline git revert of the re-baseline commit --

Quick reference (symptom -> fix)

Symptom Fix
Agent stuck upgrading past budget STOP; appendix-A; do not stack further mutations
Unit error mid-refresh Understand the hook error FIRST; juju resolved --no-retry <unit> only with cause known; else revert the app (4.0.6)
can-upgrade-to names a different charm Exclude app + capture JSON + log finding (0b table)
Controller unreachable past 2.2 budget Escalate; NEVER re-bootstrap inside the window
Horizon login cookie failure after dashboard refresh Restore _99_internal_http_cookies.py (D-044; appendix-A)
Vault sealed mid-window Incident, not expected here -- appendix-A / restart-procedure Stage 3

Open questions (carried for Roosevelt)

  • HA controllers change the 2.2 blind-window math (rolling controller upgrade, no full API blackout) -- revalidate gates on bare metal.
  • Controller backup story on bare metal: evaluate supported juju-db dump tooling as part of the Roosevelt controller design, not improvised here.
  • Cadence policy and window sizing: D-071 (PROPOSED) -- rule before Roosevelt operations begin.