diff --git a/docs/design-decisions.md b/docs/design-decisions.md index 85cb97e..3098d6f 100644 --- a/docs/design-decisions.md +++ b/docs/design-decisions.md @@ -1671,3 +1671,17 @@ **Related:** D-068 (this amends item 1), D-002 amendment (vault row), D-067/BUNDLEFIX-007 (vault-kv), D-069 (unseal). **Evidence:** `docs/D-068-vault-1.8-vs-1.16-analysis.md`. + +## D-071 -- AMENDMENT (2026-07-05): controller backup EXISTS on juju 3.6; risk section corrected + +First execution of ops-update-procedure (ops-update-20260705) disproved the risk section's +premise "no supported backup in Juju 3.6": `juju create-backup` / `download-backup` exist on +the 3.6 client and were exercised live (902MB checksummed archive of the single controller, +~35s, downloaded to the jumphost pre-upgrade). The compensating controls for routine +controller patches are therefore FOUR, not three: patch-only jumps, proven-healthy pre-state +(committed pre-change BOM), a fresh `juju create-backup` archive (now the STANDARD 2.1 step, +DOCFIX-088), and the D-070 rebuild posture as last resort. Policy points 1-4 unchanged and +still PROPOSED for operator ruling. Restore-from-archive remains UNREHEARSED -- rehearsing a +controller restore is added to the Roosevelt open questions. + +**Related:** D-071 (amends the Risk section), DOCFIX-088 (runbook 2.1 correction). diff --git a/docs/session-ledger.md b/docs/session-ledger.md index 91db2af..c3e4b8a 100644 --- a/docs/session-ledger.md +++ b/docs/session-ledger.md @@ -28,8 +28,8 @@ _(D-063 is CLOSED as of 2026-07-03.)_ - **OPEN security rows:** SEC-001 (libvirt cred rotate), SEC-003 (Vault unseal custody / second-person rehearsal), SEC-004 (repo public -> private at v1 close). -- **Next-free numbers:** D = 072, DOCFIX = 088, BUNDLEFIX = 011. (BUNDLEFIX-010 - consumed by the jumphost stream, addendum 15: vault bundle revert.) +- **Next-free numbers:** D = 072, DOCFIX = 089, BUNDLEFIX = 011. (BUNDLEFIX-010 + + DOCFIX-088 consumed by the jumphost stream, addenda 15-16.) The 2026-07-03 D-071 contention is RESOLVED: the jumphost stream filed D-071 + DOCFIX-086 (ops-update-procedure) in changelog addendum 13 (2026-07-04); main stream numbering resumes at the values above. The wrapped-pointer scan artifact @@ -99,7 +99,7 @@ remain PROPOSED/OPEN, not actioned. -## Active window (jumphost stream) -- ops-update-20260705, IN FLIGHT +## Jumphost stream -- ops-update-20260705: WINDOW CLOSED 2026-07-05 (addenda 13-16) First execution of runbooks/ops-update-procedure.md (DOCFIX-086; addendum 13). Logged session ops-update-20260705; checkpointed here at each milestone for @@ -133,8 +133,16 @@ - **DONE (coordinated):** vault bundle revert 1.16->1.8/stable (BUNDLEFIX-010) + D-068 AMENDMENT recorded; 1.16 ruled out (certs V0/V1, Raft-only, Ceph, BUSL); "off EOL 1.8" remains OPEN. Evidence: docs/D-068-vault-1.8-vs-1.16-analysis.md. -- **NEXT:** G4 nova-compute 827->894 (approved); then post-verify (S5), - re-baseline + close-out (S6). +- **DONE Section 4 G4:** nova-compute 827->894; hypervisors up; guests + byte-identical pre/post. +- **DONE Sections 5-6:** post cloud-assert PASS; post BOM asbuilt/20260705-194617 + committed; version coherence exact (91 agents @ 3.6.25); BOM diff = 17 expected + rev pairs + 1 explained metadata delta (cinder secrets-storage endpoint); + appendix-B B.1 fully re-baselined; runbook as-executed corrections DOCFIX-088; + D-071 amended (backup exists). WINDOW COMPLETE. +- **POST-WINDOW QUEUE:** upstream dashboard-TLS bug; phase-03 3.3 fail-open DOCFIX; + dashboard-TLS long-term ruling; magnum re-check next window; Vault off-EOL path + (1.16 ruled out per D-068 amendment). - **Window findings logged for close-out (do not action mid-window):** 1. magnum can-upgrade-to anomaly REPRODUCED (points at magnum-dashboard-122; evidence ~/openstack-baseline/magnum-can-upgrade-anomaly-20260705.json) AND diff --git a/docs/v1-redeploy-changelog.md b/docs/v1-redeploy-changelog.md index e46c97f..5d795b3 100644 --- a/docs/v1-redeploy-changelog.md +++ b/docs/v1-redeploy-changelog.md @@ -1920,3 +1920,54 @@ REVERT: git checkout HEAD~ -- bundle.yaml docs/design-decisions.md Next-free after this push (per scan, keep this line unwrapped): D-072, DOCFIX-088, BUNDLEFIX-011. + +### 2026-07-05 (addendum 16, jumphost stream) -- ops-update-20260705 WINDOW CLOSED (DOCFIX-088) + +Identifier consumed: DOCFIX-088 (runbook as-executed corrections). D-071 AMENDMENT +recorded (no new number). + +WINDOW RESULT: COMPLETE, all gates green. Controller 3.6.24 -> 3.6.25 (single, +non-HA; blind window ~1 min); all 91 model agents + controller machine at 3.6.25; +17 apps refreshed in-channel (G0 keystone 817; G1 placement 154, n-c-c 823, +neutron-api 710, n-a-p-ovn 215, glance 681, gss 152, o-d-retrofit 232, cinder 820, +cinder-ceph 568, barbican 265, barbican-vault 99; G2 octavia 542; G3 dashboards +750/122/168; G4 nova-compute 894). Every refresh individually settle-gated; +cloud-assert PASS at every group boundary and at close; guests byte-identical +pre/post (6). Post BOM asbuilt/20260705-194617 committed; appendix-B B.1 fully +re-baselined from it (also absorbing stale pre-window rows -- hacluster 131->166, +ceph-mon 268->491, mysql-innodb 159->164, router rows -- the post-rebuild +re-baseline had been missed; that gap is exactly what the 1.3 divergence check +caught). BOM diff vs pre: 17 expected revision pairs + ONE explained metadata +delta (cinder rev 820 adds an optional `secrets-storage` endpoint; unbound -> +inherits default space metal-admin in the export; no relation, no live change; +if ever related, bind deliberately). + +EXCLUSIONS HELD: vault untouched (1.8/stable rev 372; its in-channel 372->714 +offer logged only -- D-068/BUNDLEFIX-010 posture). magnum untouched (name-mismatch +gate + s390x-only channel state at pre-flight; NOTE the Charmhub target CHANGED +mid-window to a name-consistent magnum-96 -- republish in flight upstream; +re-evaluate at the NEXT window's pre-flight, do not chase mid-window). + +DOCFIX-088 (runbook ops-update-procedure, as-executed fold-back): STATUS set +as-executed; [REVALIDATE] markers cleared; 1.2 version sweep corrected to +`juju-status` (units do NOT carry `agent-status.version` on juju 3.6); 2.1 +REWRITTEN -- `juju create-backup` EXISTS on 3.6 and is now the STANDARD gated +pre-upgrade step (902MB/~35s measured; admin/controller namespace note); 3 +corrected (no separate upgrade-model for the controller model; explicit +--agent-version); G3 probe corrected to the D-044 http leg + canonical override +path + browser login, with the VIP-TLS since-deploy defect cross-referenced +(addendum 15 RCA); 0b gains the mid-window-target-drift expectation row. +REVERT: git checkout HEAD~ -- runbooks/ops-update-procedure.md + +D-071 AMENDMENT (2026-07-05): risk-section premise corrected (backup exists; +four compensating controls); controller-restore rehearsal added to Roosevelt +open questions; policy still PROPOSED. +REVERT: git checkout HEAD~ -- docs/design-decisions.md + +STILL QUEUED POST-WINDOW (logged, not actioned): upstream bug to OpenStack +Charmers (dashboard haproxy TLS backend / L4 mask, addendum 15 evidence); +DOCFIX candidate phase-03 3.3 fail-open https check; operator ruling on the +dashboard-TLS long-term fix (rebind vs Roosevelt edge-TLS); magnum channel +re-check next window; Vault off-EOL-1.8 path (D-068 amendment, 1.16 ruled out). + +Next-free after this push (per scan, keep this line unwrapped): D-072, DOCFIX-089, BUNDLEFIX-011. diff --git a/runbooks/ops-update-procedure.md b/runbooks/ops-update-procedure.md index de112f0..35077b3 100644 --- a/runbooks/ops-update-procedure.md +++ b/runbooks/ops-update-procedure.md @@ -1,10 +1,11 @@ # Ops -- Routine Update Procedure (Juju controller, model agents, in-channel charm refreshes) -STATUS: authored 2026-07-04 per DOCFIX-086; NOT yet as-executed. Policy -companion: D-071 (PROPOSED -- update cadence + controller patch policy). -Steps whose exact mechanics could not be verified against the live client at -authoring time carry [REVALIDATE] markers -- clear them on the first executed -update window and record the as-executed date here. +STATUS: authored 2026-07-04 per DOCFIX-086; AS-EXECUTED 2026-07-05 +(ops-update-20260705: controller+agents 3.6.24 -> 3.6.25, 17 in-channel charm +refreshes; changelog addenda 13-16). [REVALIDATE] markers cleared and live +divergences folded back per DOCFIX-088. Policy companion: D-071 (PROPOSED -- +update cadence + controller patch policy; amended 2026-07-05: supported +controller backup EXISTS on juju 3.6). Scope: a PLANNED maintenance window applying three update layers, in order: (1) Juju controller patch upgrade (single, non-HA controller), (2) model @@ -54,6 +55,7 @@ | `can-upgrade-to` values differ from this runbook's planning table | EXPECTED. Channels float (appendix-B policy). The live measurement is the worklist; any table here is planning reference only. | | An app's `can-upgrade-to` names a DIFFERENT charm than the app runs | Anomaly (seen once for magnum on 2026-07-04: `ch:amd64/magnum-dashboard-122`). EXCLUDE the app, capture the raw JSON, log the finding. Never refresh across a name mismatch. | | Vault flips to sealed during this window | NOT expected -- nothing here restarts vault. That is an incident: stop, appendix-A. | +| An EXCLUDED app's `can-upgrade-to` target changes mid-window | Observed 2026-07-05 (magnum's target moved during the window -- Charmhub republish in flight). Confirms the rule: targets are re-measured per app at refresh time; excluded apps stay excluded until the next window's pre-flight. | ## 1. Pre-flight and baseline @@ -76,14 +78,14 @@ juju show-controller --format=json | jq -r 'to_entries[] | "\(.key) agent-version=\(.value.details."agent-version")"' juju status -m openstack --format=json | jq -r ' - [(.machines | to_entries[] | .value."juju-status".version), - (.. | objects | select(has("agent-status")) | ."agent-status".version)] - | .[] | select(. != null)' | sort | uniq -c + [.. | objects | select(has("juju-status")) | ."juju-status".version + | select(. != null)] | .[]' | sort | uniq -c ``` -**Expect:** client at the target patch version; controller and ALL machine + -unit agents at ONE uniform current version. Record both values. Any skew -among agents = STOP and investigate before adding an upgrade on top. -[REVALIDATE: unit agent version field path on current juju] +**Expect:** client at the target patch version; controller and ALL machine, +container, and unit agents at ONE uniform current version (agents carry +version under `juju-status`, NOT `agent-status` -- verified 2026-07-05; +this cloud = 91 agents). Record both values. Any skew among agents = STOP +and investigate before adding an upgrade on top. ### 1.3 Measure the refresh worklist (with the charm-name gate) @@ -148,24 +150,30 @@ ## 2. Controller stage -### 2.1 Controller-state posture (reference captures; not a restore path) +### 2.1 Controller-state backup (supported; STANDARD step per D-071 amendment) -`juju create-backup` was removed in Juju 3.0. If 1.6 confirmed no backup -command exists on this client (expected for 3.6), the accepted posture per -D-071 is: patch-level jumps only, proven-healthy pre-state (1.4), and the -D-070 restore path (re-bootstrap + rebuild-from-runbooks). The pre-change -BOM already holds the exported bundle and status captures. +CORRECTION (2026-07-05 as-executed, DOCFIX-088): `juju create-backup` EXISTS +on juju 3.6 (the "removed in 3.0" authoring assumption was wrong for this +series) and was exercised in the first window (902MB archive, ~35s). Taking +the backup is the STANDARD pre-upgrade step; it converts the D-071 no-revert +risk into a real controller-state restore artifact. NOTE: the controller +model is `admin/controller` on this cloud, not `/controller`. -**CHECK (read-only) -- jumphost** -- reference captures + optional tooling probe +**CHECK (read-only) -- jumphost** -- reference capture + flag verification ```bash juju controllers --format=json > ~/openstack-baseline/controller-pre-$(date -u +%Y%m%d).json -juju ssh -m controller 0 -- snap list -$(date -u +%Y%m%d).tar.gz \ + "pre ops-update- controller to " +``` +**GATE:** archive downloaded, checksum printed, size sane (this cloud: ~900MB). +The archive stays on the jumphost (secret-adjacent; never committed). The +D-070 rebuild posture remains the documented restore path of last resort. ### 2.2 Upgrade the controller @@ -174,7 +182,7 @@ > pre-state proven at 1.4, and the D-070 rebuild posture. If the target is > more than a patch jump, STOP -- that is not this runbook. -**RUN -- jumphost** (flags per 1.6; candidate form below) [REVALIDATE] +**RUN -- jumphost** (flags per 1.6; form verified as-executed 2026-07-05) ```bash juju upgrade-controller --agent-version ``` @@ -203,14 +211,15 @@ ## 3. Model agent stage -**RUN -- jumphost** (flags per 1.6) [REVALIDATE] +**RUN -- jumphost** (form verified as-executed 2026-07-05) ```bash -juju upgrade-model -m controller -juju upgrade-model -m openstack +juju upgrade-model -m openstack --agent-version ``` -**Expect:** default target is the controller's version (verify in the 1.6 -help output whether an explicit `--agent-version` is required). Agents -upgrade rolling; workloads are NOT restarted. +**Expect:** agents upgrade rolling; workloads are NOT restarted. The +controller model needs NO separate upgrade-model -- `juju upgrade-controller` +already brought machine 0's agent to target (verified live; check with +`juju status -m admin/controller`). Explicit `--agent-version` is used for +determinism (default would pick the controller's version). **GATE:** every machine and unit agent at target, settled: ```bash @@ -307,9 +316,15 @@ `openstack-dashboard` -> `magnum-dashboard` -> `octavia-dashboard` (the latter two are subordinates riding openstack-dashboard -- verify placement live in status before ordering). -**Probe:** Horizon over the dashboard VIP answers HTTP 200 and login works; -the D-044 secure-cookie override survives the refresh (appendix-A entry if -login cookies fail). +**Probe (corrected 2026-07-05, DOCFIX-088):** Horizon over the dashboard VIP +on the PLAIN-HTTP leg -- `curl http:///horizon/auth/login/` +expects 200 -- plus the D-044 override file present at +`/usr/share/openstack-dashboard/openstack_dashboard/local/local_settings.d/` +(canonical path; the /etc path is a decoy) and a browser login working. +Do NOT gate on https://: dashboard VIP TLS has been dead SINCE DEPLOY +on this build (haproxy 443 backend targets a vhost-less internal address, +masked by its L4-only check; addendum 15 RCA) -- an https probe failure here +is the pre-existing defect, not a refresh regression. **GATE:** probe green; cloud-assert not required mid-group here, PASS at group end.