diff --git a/docs/design-decisions.md b/docs/design-decisions.md index 0b13264..590a375 100644 --- a/docs/design-decisions.md +++ b/docs/design-decisions.md @@ -1597,3 +1597,50 @@ **Related:** D-012 (superseded), D-017/D-018 (the actual restore path), DOCFIX-075 (cloud-assert --capture). + +## D-071: Routine update cadence and Juju controller patch policy + +**Status:** PROPOSED 2026-07-04 (filed by the jumphost controller-update workstream; +number relinquished by the main stream per the 2026-07-03 addendum-10 contention note). +Mechanism: `runbooks/ops-update-procedure.md` (DOCFIX-086). Operator to rule on the +policy points below; the runbook is usable per-window under individual gating meanwhile. + +**Problem:** the cloud accumulates available updates on two layers with no standing +policy for either: (a) the Juju controller/agents (patch releases on the 3.6 track; +live measured 3.6.24 with client already 3.6.25), and (b) charm revisions floating +within their D-002-pinned channels (~20 apps showed `can-upgrade-to` on 2026-07-04). +Appendix-B says revision float is EXPECTED and re-baselines after validated state, but +nothing says WHEN updates are applied, in what order, or what controller jumps are +acceptable on a single non-HA controller. + +**Proposed policy (rule each point):** +1. **Cadence trigger.** Updates are applied in deliberate maintenance windows (not on + discovery). Proposed trigger: a monthly review of `can-upgrade-to` + controller + patch delta, executed via ops-update-procedure when non-empty; security-driven + charm revisions may pull a window forward. +2. **Patch-only controller rule.** A routine window may move the controller by PATCH + level only within the pinned 3.6 track (e.g. 3.6.24 -> 3.6.25). Minor/major jumps + (3.6 -> 3.7+) are each their own D-NNN with rehearsal, never a routine-window action. +3. **Standing order.** Controller -> model agents -> charm refreshes, keystone first, + nova-compute last (the ops-update-procedure grouping). Never interleave layers. +4. **In-channel-only refreshes.** A routine window never changes a charm channel + (D-002); channel moves are per-decision (see D-068 for the live example). + +**Risk (explicit, for future reviewers): no-backup single-controller patch acceptance.** +The testcloud controller is SINGLE (non-HA) and Juju 3.6 ships no supported controller +backup (`create-backup` removed in 3.0). There is NO in-band downgrade of a controller. +Accepting routine controller patches therefore rests entirely on three compensating +controls: (a) patch-level-only jumps within 3.6/stable, (b) a proven-healthy pre-state +(`cloud-assert.sh --capture` PASS committed as the pre-change BOM), and (c) the D-070 +restore posture -- a failed controller means re-bootstrap + rebuild-from-runbooks, i.e. +hours of rebuild, not minutes of restore. This risk is accepted for the virtual +rehearsal phase ONLY; Roosevelt must revisit it with HA controllers and a real +controller-state backup story before production operations. + +**Roosevelt:** HA controllers change the upgrade blind-window and the risk calculus; +cadence and window sizing need an SLA-aware ruling; controller backup tooling is a +design item, not an ops improvisation. + +**Related:** D-002 (channel pins), D-068 (Vault channel move, the non-routine +counter-example), D-070 (restore posture), DOCFIX-086 (ops-update-procedure runbook), +appendix-B (revision re-baseline policy). diff --git a/docs/session-ledger.md b/docs/session-ledger.md index a07f283..4e8a4cd 100644 --- a/docs/session-ledger.md +++ b/docs/session-ledger.md @@ -20,16 +20,20 @@ ## Machine-derived (re-seed from `scripts/ledger-scan.sh`; do not hand-edit) -_As of repo HEAD around 2026-07-03 (re-run the scan to refresh):_ +_As of repo HEAD 2026-07-04 (re-run the scan to refresh):_ - **PROPOSED / OPEN decisions:** D-050 (keystone policyd override, no policy zip), - D-068 (Vault substrate hardening, Roosevelt). _(D-063 is CLOSED as of 2026-07-03.)_ + D-068 (Vault substrate hardening, Roosevelt), D-071 (routine update cadence + + controller patch policy -- filed 2026-07-04 by the jumphost stream, operator to rule). + _(D-063 is CLOSED as of 2026-07-03.)_ - **OPEN security rows:** SEC-001 (libvirt cred rotate), SEC-003 (Vault unseal custody / second-person rehearsal), SEC-004 (repo public -> private at v1 close). -- **Next-free numbers:** D = 071, DOCFIX = 086, BUNDLEFIX = 010. - **CONTENDED:** D-071 is being filed by the parallel Claude Code (jumphost) stream - (Juju controller update cadence / patch policy). This repo relinquishes D-071 and - resumes at D-072+ AFTER that stream's push lands and the scan is re-run. +- **Next-free numbers:** D = 072, DOCFIX = 087, BUNDLEFIX = 010. + The 2026-07-03 D-071 contention is RESOLVED: the jumphost stream filed D-071 + + DOCFIX-086 (ops-update-procedure) in changelog addendum 13 (2026-07-04); main + stream numbering resumes at the values above. The wrapped-pointer scan artifact + was fixed by the main stream (addendum 11) and independently confirmed at merge + (addendum 13 FINDING 2); the scan and these numbers now agree. --- @@ -46,9 +50,7 @@ - Batch 3 DRAFTED (mock-tested only, NEEDS LIVE VALIDATION): `d011-04-octavia-lb` (RR+member additive; amphora failover --disruptive, N+1 headroom-guarded) + `d011-05-magnum-e2e` (wraps tenant-acceptance + timing). full-d011 profile now COMPLETE. - - **Next:** live-validate batch 3 (see Verify-live queue), then item-3/#5-#8 backlog (2-backend round-robin -> member failover -> AMPHORA - failover with an N+1 scheduler-headroom HOLD-guard -> recovery -> self-cleanup; - --disruptive-gated) and `d011-05-magnum-e2e` (wrap `tenant-acceptance.sh` + timing). + - **Next:** live-validate batch 3 (see Verify-live queue), then item-3/#5-#8 backlog. - D-011 AMENDED bar: 1 charms; 2 VIP jumphost; 3 VIP tenant; 4 octavia RR/failover/recovery; 5 magnum e2e + OCCM; 6 second-person manual unseal (attestation, D-069); 7 DROPPED (D-070 supersedes D-012 snapshots); 8 DROPPED (D-019, no Designate). @@ -99,7 +101,10 @@ - beta cluster left at **node_count=2** (deliberate; bonus resize acceptance coverage). - repo is temporarily **PUBLIC** for Claude web_fetch (SEC-004) -- flip private at v1 close. -- Parallel Claude Code stream active on the jumphost (controller-update workstream, filing D-071). +- Parallel Claude Code stream on the jumphost: D-071 + DOCFIX-086 FILED (addendum 11, + 2026-07-04). Its next step is the LIVE EXECUTION of runbooks/ops-update-procedure.md + (controller 3.6.24 -> 3.6.25 + ~20 in-channel charm refreshes), operator-gated, in a + run-logged session. Vault stays 1.8/stable (D-068 unruled). ## Project-completion (execute after D-011 passes) diff --git a/docs/v1-redeploy-changelog.md b/docs/v1-redeploy-changelog.md index 56b7f6c..e0e7ec2 100644 --- a/docs/v1-redeploy-changelog.md +++ b/docs/v1-redeploy-changelog.md @@ -1790,3 +1790,59 @@ beta sits at node_count=2 (adequate for a 2-replica service). Next-free: D-071 (CONTENDED -- parallel stream), DOCFIX-086, BUNDLEFIX-010. + +### 2026-07-04 (addendum 13, jumphost stream) -- ops-update-procedure runbook (DOCFIX-086) + D-071 filed + +Authored by the parallel jumphost workstream (controller-update); identifiers consumed +per the addendum-10 contention note: D-071 and DOCFIX-086. No BUNDLEFIX consumed. + +Trigger: live measurement 2026-07-04 -- controller 3.6.24 (single, non-HA) with client +snap already 3.6.25 (3/stable), and ~20 apps showing can-upgrade-to newer revisions +WITHIN their pinned channels. The repo had no update runbook and no update policy. + +NEW runbooks/ops-update-procedure.md (DOCFIX-086) -- routine update window in three +gated layers: controller patch -> model agents -> in-channel charm refreshes (keystone +first, control-plane APIs, octavia, dashboards, nova-compute LAST; full cloud-assert at +group boundaries, per-app settle gates between). Vault EXPLICITLY excluded (D-068 +PROPOSED; bundle pins 1.16/stable while live is 1.8/stable -- the runbook carries the +CAUTION so a naive bundle-sync cannot trip the major upgrade). Anti-fabrication gate: +upgrade-command flags are read from live `juju help` before composing mutations; the +2026-07-04 magnum can-upgrade-to anomaly (target named ch:amd64/magnum-dashboard-122, +a DIFFERENT charm) is generalized into a charm-name self-consistency GATE that excludes +and logs any mismatched app. Not yet as-executed; carries [REVALIDATE] markers. +REVERT: git checkout HEAD~ -- runbooks/ops-update-procedure.md (plus the README index +line below). + +NEW docs/design-decisions.md D-071 (PROPOSED) -- routine update cadence + controller +patch policy: monthly-review cadence trigger, PATCH-only controller jumps within 3.6, +standing controller->agents->charms order, in-channel-only refreshes. Carries an +explicit Risk section: single non-HA controller, no supported backup in Juju 3.6, no +in-band downgrade -- acceptance rests on patch-only jumps + proven-healthy pre-state +(committed pre-change BOM) + D-070 rebuild posture; rehearsal-phase acceptance only, +Roosevelt must revisit with HA + a real backup story. Operator to rule. +REVERT: git checkout HEAD~ -- docs/design-decisions.md + +runbooks/README.md -- indexed ops-update-procedure.md AND fixed a found gap: +ops-restart-procedure.md was never added to the runbook index (only ops-capi-recovery +was listed). Folded under DOCFIX-086 delivery, no separate number consumed. +REVERT: git checkout HEAD~ -- runbooks/README.md + +FINDING 1 (logged, not fixed -- hard rule 1): repo-lint L5 "next-free identifiers" +and ledger-scan.sh disagree (L5 claimed BUNDLEFIX 051 / D 073 / DOCFIX 100, written +here WITHOUT the hyphenated token shape so this very line cannot inflate the scan). +Cause read from repo_lint.py: L5 derives next-free from ANY textual mention of an +identifier across all live docs (max+1), so prose forward references inflate it; +ledger-scan derives D from headers and drops lines containing "next-free". The scan +is authoritative per operator ruling. DOCFIX candidate for a future window: make +L5's next-free output header-based or drop it in favor of the scan. + +FINDING 2 (CONVERGED with the main stream at merge): this stream independently hit +the same ledger-scan limitation the main stream found and fixed in addendum 11 -- +a "Next-free:" pointer that word-wraps escapes the scan's line-based exclusion and +falsely inflates the count (the addendum-10 wrapped pointer briefly showed BUNDLEFIX +one higher than truth). The main stream unwrapped that pointer and set the +convention: keep "Next-free:" pointer lines on ONE line. This entry follows it; +no scan over-report remains after this merge. Still open as a DOCFIX candidate: +make the exclusion paragraph-aware so a future wrap cannot regress this. + +Next-free after this push (per scan, keep this line unwrapped): D-072, DOCFIX-087, BUNDLEFIX-010. diff --git a/runbooks/README.md b/runbooks/README.md index d0a2cf4..a0a9233 100644 --- a/runbooks/README.md +++ b/runbooks/README.md @@ -56,3 +56,5 @@ sanitation sweep. Git history preserves them. - ops-capi-recovery.md -- parking, restart, and LB repair for the CAPI/Magnum stack (post-deploy operations companion; not a deploy phase). Added 2026-06-10. +- ops-restart-procedure.md -- full-cloud restart / recovery for planned windows and full power/network-loss events (DOCFIX-075). Indexed 2026-07-04 (was missing from this list since commit). +- ops-update-procedure.md -- routine update window: Juju controller patch, model agents, in-channel charm refreshes (DOCFIX-086; policy: D-071). Added 2026-07-04. diff --git a/runbooks/ops-update-procedure.md b/runbooks/ops-update-procedure.md new file mode 100644 index 0000000..de112f0 --- /dev/null +++ b/runbooks/ops-update-procedure.md @@ -0,0 +1,407 @@ +# Ops -- Routine Update Procedure (Juju controller, model agents, in-channel charm refreshes) + +STATUS: authored 2026-07-04 per DOCFIX-086; NOT yet as-executed. Policy +companion: D-071 (PROPOSED -- update cadence + controller patch policy). +Steps whose exact mechanics could not be verified against the live client at +authoring time carry [REVALIDATE] markers -- clear them on the first executed +update window and record the as-executed date here. + +Scope: a PLANNED maintenance window applying three update layers, in order: +(1) Juju controller patch upgrade (single, non-HA controller), (2) model +agent upgrade to match, (3) charm refreshes to newer revisions WITHIN their +pinned channels (the appendix-B-sanctioned update type). This is explicitly +NOT an OpenStack series/track upgrade -- no charm changes channel here, ever. +For full-cloud power maintenance use `runbooks/ops-restart-procedure.md`; for +incident response use `runbooks/appendix-A-troubleshooting.md`. + +Conventions: RUN/CHECK/GATE labels per runbooks/README.md. One gated mutation +at a time; read-only verification precedes every mutation. Invoke scripts as +`bash scripts/.sh` (no exec bits in the repo). Run the whole window +inside `bash scripts/run-logged.sh ops-update-` and add the index row +to `logs/as-executed-index.md`. + +--- + +## 0. Exclusions (read before anything else) + +- **VAULT IS OUT OF SCOPE.** Live vault stays on `1.8/stable`. Do not + refresh `vault` or `vault-mysql-router` in this procedure. + +> CAUTION: `bundle.yaml` pins vault `1.16/stable` (D-068 / BUNDLEFIX-007) +> while live runs `1.8/stable`. A naive "sync live to the bundle" or a +> blanket refresh sweep would attempt a multi-minor major Vault upgrade -- +> exactly what D-068 (PROPOSED) says is NOT a casual `juju refresh` (unseal +> keys in hand, storage-format compatibility, rehearsal first). Until D-068 +> is ruled and rehearsed, vault is untouchable here. + +- **No channel changes.** D-002 pins channels; this procedure only moves + revisions WITHIN a pinned channel. A desired channel change is a + D-NNN/BUNDLEFIX proposal, not an update-window action. +- Out of scope: MAAS, host OS packages, jumphost snaps (the juju client snap + updates itself on `3/stable`), and the CAPI/driver layer (appendix-B + B.2/B.3; governed by D-034/D-042 -- their update is a separate procedure). +- Apps with no update available at authoring (ceph-*, ovn-*, mysql-*, + rabbitmq-server, hacluster, memcached): re-measured at pre-flight. If one + NEWLY shows an update, it is LOGGED for the next window, not refreshed in + this one (hard rule 1 -- no added scope mid-window). + +## 0b. Expectations table (read FIRST; saves false alarms) + +| Observation | Meaning | +|---|---| +| `juju status` / API errors for ~1-5 min right after `upgrade-controller` | EXPECTED. Single non-HA controller; jujud restarts. Data plane and workloads unaffected; model MANAGEMENT is blind. Wait, do not react. | +| Units cycling `maintenance`/`executing` for minutes after a refresh | Expected settle arc (upgrade-charm hooks). Judge by the settle gate, not by transient states. | +| `can-upgrade-to` values differ from this runbook's planning table | EXPECTED. Channels float (appendix-B policy). The live measurement is the worklist; any table here is planning reference only. | +| An app's `can-upgrade-to` names a DIFFERENT charm than the app runs | Anomaly (seen once for magnum on 2026-07-04: `ch:amd64/magnum-dashboard-122`). EXCLUDE the app, capture the raw JSON, log the finding. Never refresh across a name mismatch. | +| Vault flips to sealed during this window | NOT expected -- nothing here restarts vault. That is an incident: stop, appendix-A. | + +## 1. Pre-flight and baseline + +### 1.1 Session bootstrap + +**RUN -- jumphost** +```bash +git -C ~/openstack-caracal-ipv4 pull +bash scripts/repo-lint.sh +bash scripts/run-logged.sh ops-update-$(date -u +%Y%m%d) +``` +**Expect:** lint 0 fail (1 legacy WARN documented); logged subshell open. +Add the session row to `logs/as-executed-index.md`. + +### 1.2 Measure versions -- never assume + +**CHECK (read-only) -- jumphost** +```bash +juju version +juju show-controller --format=json | jq -r 'to_entries[] + | "\(.key) agent-version=\(.value.details."agent-version")"' +juju status -m openstack --format=json | jq -r ' + [(.machines | to_entries[] | .value."juju-status".version), + (.. | objects | select(has("agent-status")) | ."agent-status".version)] + | .[] | select(. != null)' | sort | uniq -c +``` +**Expect:** client at the target patch version; controller and ALL machine + +unit agents at ONE uniform current version. Record both values. Any skew +among agents = STOP and investigate before adding an upgrade on top. +[REVALIDATE: unit agent version field path on current juju] + +### 1.3 Measure the refresh worklist (with the charm-name gate) + +**CHECK (read-only) -- jumphost** +```bash +juju status -m openstack --format=json | jq -r ' + .applications | to_entries[] + | select((.value."can-upgrade-to" // "") != "") + | .key as $app | (.value."charm-name") as $name + | (.value."can-upgrade-to" | sub("^ch:[^/]*/"; "")) as $t + | [$app, $name, (.value."charm-rev"|tostring), + ($t | sub("-[0-9]+$"; "")), ($t | capture("-(?[0-9]+)$").r), + (if ($t | sub("-[0-9]+$"; "")) == $name then "OK" else "NAME-MISMATCH" end)] + | @tsv' | column -t +``` +**GATE:** every row `OK`. Any `NAME-MISMATCH` row: capture the app's raw +`juju status --format=json`, EXCLUDE it from this window's worklist, +and log the finding (appendix-A/DOCFIX material). + +Record the surviving rows as the window's worklist AND revert table: +(app, current-rev, target-rev). Cross-check current revs against appendix-B +B.1; any pre-existing divergence is logged (it means the last re-baseline +was missed), not corrected here. + +### 1.4 Health gate + pre-change BOM + +**RUN -- jumphost** (writes only `asbuilt//`) +```bash +source ~/admin-openrc +bash scripts/cloud-assert.sh --capture +``` +**GATE:** `CLOUD-ASSERT: PASS`. WARN/HOLD is a no-go: do not update an +unhealthy or unverified cloud. Commit the `asbuilt//` BOM as the +pre-change baseline before the first mutation. + +### 1.5 Quiesce check + +**CHECK (read-only) -- jumphost** +```bash +juju status -m openstack --format=json | jq -r ' + .. | objects | select(has("agent-status")) + | select(."agent-status".current as $c | ["executing","error","failed"] | index($c)) + | ."agent-status".current' | sort | uniq -c +openstack coe cluster list -f value -c name -c status ~/openstack-baseline/controller-pre-$(date -u +%Y%m%d).json +juju ssh -m controller 0 -- snap list CAUTION: there is NO in-band downgrade of a Juju controller. The +> compensating controls are: patch-level jump only (D-071), healthy +> pre-state proven at 1.4, and the D-070 rebuild posture. If the target is +> more than a patch jump, STOP -- that is not this runbook. + +**RUN -- jumphost** (flags per 1.6; candidate form below) [REVALIDATE] +```bash +juju upgrade-controller --agent-version +``` +**Expect:** command accepted, then the 0b blind window (~1-5 min) while +jujud restarts. Do not run other juju commands until it clears. + +**GATE:** controller at target and the model reachable again: +```bash +juju show-controller --format=json | jq -r 'to_entries[] + | "\(.key) agent-version=\(.value.details."agent-version")"' +juju status -m openstack --format=json | jq -r '.machines | to_entries[] + | "\(.key) \(.value."juju-status".current)"' | grep -v started \ + || echo "all machines started" +``` +Poll up to ~10 min. Controller at target + all machine agents `started`. +Beyond budget: STOP, appendix-A; never re-bootstrap inside the window. + +### 2.3 Post-controller spot check + +**CHECK (read-only) -- jumphost** +```bash +bash scripts/cloud-assert.sh +``` +**GATE:** PASS (A5-A7 need `source ~/admin-openrc` in scope). Do not start +the agent stage on a controller that cannot pass the behavioral sweep. + +## 3. Model agent stage + +**RUN -- jumphost** (flags per 1.6) [REVALIDATE] +```bash +juju upgrade-model -m controller +juju upgrade-model -m openstack +``` +**Expect:** default target is the controller's version (verify in the 1.6 +help output whether an explicit `--agent-version` is required). Agents +upgrade rolling; workloads are NOT restarted. + +**GATE:** every machine and unit agent at target, settled: +```bash +juju status -m openstack --format=json | jq -r ' + [(.machines | to_entries[] | .value."juju-status".version), + (.. | objects | select(has("agent-status")) | ."agent-status".version)] + | .[] | select(. != null)' | sort | uniq -c +``` +**Expect:** ONE version, the target, on every line; no unit stuck in +`upgrading`. An agent stuck beyond ~15 min = STOP, appendix-A. + +## 4. Charm refresh stage + +### 4.0 Rules for EVERY app in this stage + +1. Re-verify THAT app immediately before refreshing it (the 1.3 worklist is + stale the moment the first refresh lands), including the name gate: + + **CHECK (read-only) -- jumphost** + ```bash + APP= + juju status "$APP" --format=json | jq -r --arg a "$APP" ' + .applications[$a] | [$a, ."charm-name", (."charm-rev"|tostring), + (."can-upgrade-to" // "NONE")] | @tsv' + ``` + `NONE` = already current, skip forward. Name mismatch = exclude + log. +2. The current revision just read IS the revert value -- record it. +3. ONE app per approval: + + **RUN -- jumphost** + ```bash + juju refresh + ``` +4. Per-app settle gate before the next app: + + **CHECK (read-only) -- jumphost** + ```bash + juju status "$APP" --format=json | jq -r '.applications[] | .units // {} + | .. | objects | select(has("workload-status")) + | "\(."workload-status".current)/\(."agent-status".current // "?")"' \ + | sort | uniq -c + juju status -m openstack --format=json | jq -r ' + .. | objects | select(has("workload-status")) + | select(."workload-status".current == "error") | ."workload-status".message' \ + | sed 's/^/ERROR: /' ; true + ``` + **GATE:** all units of the app AND its subordinates `active/idle`, the + new revision visible, and NO unit anywhere in error. Budget ~15 min. + (`bash scripts/deploy-watch.sh` in a side window is the signal view, + not the gate.) +5. Run the group's behavioral probe (below); full + `bash scripts/cloud-assert.sh` at GROUP boundaries only. +6. Revert for any app: `juju refresh --revision `. + +> CAUTION: an explicit `--revision` refresh PINS the app (it stops tracking +> the channel). Any revert row therefore carries a follow-up +> `juju refresh --channel ` to resume tracking once +> the cause is understood -- record both in the revert table. + +### 4.1 Group 0 -- keystone (alone, first) + +Identity underpins every other service; charm-guide practice is keystone +first. Refresh `keystone` per 4.0. +**Probe:** `openstack token issue `nova-cloud-controller` -> `neutron-api` -> +`neutron-api-plugin-ovn` -> `glance` -> `glance-simplestreams-sync` -> +`octavia-diskimage-retrofit` -> `cinder` -> `cinder-ceph` -> `barbican` -> +`barbican-vault`. + +Probes after the relevant principal settles: +```bash +openstack compute service list `magnum-dashboard` -> `octavia-dashboard` +(the latter two are subordinates riding openstack-dashboard -- verify +placement live in status before ordering). +**Probe:** Horizon over the dashboard VIP answers HTTP 200 and login works; +the D-044 secure-cookie override survives the refresh (appendix-A entry if +login cookies fail). +**GATE:** probe green; cloud-assert not required mid-group here, PASS at +group end. + +### 4.5 Group 4 -- nova-compute (LAST) + +Data-plane adjacent (all hypervisor hosts). A charm refresh does NOT +restart guests, but this runs last, with everything else proven green. +Refresh `nova-compute` per 4.0. +**Probe:** +```bash +openstack hypervisor list /`) +```bash +source ~/admin-openrc +bash scripts/cloud-assert.sh --capture +``` +**GATE:** `CLOUD-ASSERT: PASS`; this capture is the post-change BOM. + +**CHECK (read-only) -- jumphost** -- version coherence + BOM diff +```bash +juju show-controller --format=json | jq -r 'to_entries[] + | .value.details."agent-version"' +diff <(sort asbuilt//bundle-exported.yaml) \ + <(sort asbuilt//bundle-exported.yaml) | grep -E '^[<>]' | sort +``` +**Expect:** controller == agents == target (1.2 query re-run); every +worklist app at its recorded target revision; the bundle diff shows ONLY +the expected charm revision lines. ANY config/channel/placement delta = +stop and explain before closing the window. + +Behavioral spot set (beyond cloud-assert): `openstack token issue`, +`openstack server list --all-projects`, `openstack loadbalancer list`, +`openstack coe cluster list`, Horizon login. + +## 6. Re-baseline and documentation (window close) + +1. `runbooks/appendix-B-asbuilt-version-lock.md` B.1: update as-built + revisions to the measured post-state; bump the header date/source line. + (This is exactly the appendix-B "refresh the table on a successful + validated state" event.) +2. Commit the post-change `asbuilt//` BOM. +3. `docs/v1-redeploy-changelog.md`: as-executed addendum -- what moved + (controller x.y.z -> x.y.z', per-app rev table), why, and the revert + table (per-app `--revision` + `--channel` re-track pairs; controller = + none in-band, D-070 posture). +4. Close the `logs/as-executed-index.md` row; update + `docs/session-ledger.md`. +5. First execution only: clear this runbook's [REVALIDATE] markers and set + the as-executed date in the STATUS header. + +## 7. Revert reference + +| Layer | In-band revert | Posture if none | +|---|---|---| +| Controller upgrade | NONE (no downgrade) | Patch-jump-only + healthy pre-state + D-070 rebuild-from-runbooks | +| Model agents | NONE (no downgrade) | Same as controller | +| Charm refresh (per app) | `juju refresh --revision ` then later `juju refresh --channel ` | Appendix-B B.1 holds the last validated revisions | +| Docs / BOM re-baseline | `git revert` of the re-baseline commit | -- | + +--- + +## Quick reference (symptom -> fix) + +| Symptom | Fix | +|---|---| +| Agent stuck `upgrading` past budget | STOP; appendix-A; do not stack further mutations | +| Unit `error` mid-refresh | Understand the hook error FIRST; `juju resolved --no-retry ` only with cause known; else revert the app (4.0.6) | +| `can-upgrade-to` names a different charm | Exclude app + capture JSON + log finding (0b table) | +| Controller unreachable past 2.2 budget | Escalate; NEVER re-bootstrap inside the window | +| Horizon login cookie failure after dashboard refresh | Restore `_99_internal_http_cookies.py` (D-044; appendix-A) | +| Vault sealed mid-window | Incident, not expected here -- appendix-A / restart-procedure Stage 3 | + +## Open questions (carried for Roosevelt) + +- HA controllers change the 2.2 blind-window math (rolling controller + upgrade, no full API blackout) -- revalidate gates on bare metal. +- Controller backup story on bare metal: evaluate supported juju-db dump + tooling as part of the Roosevelt controller design, not improvised here. +- Cadence policy and window sizing: D-071 (PROPOSED) -- rule before + Roosevelt operations begin.