diff --git a/docs/v1-redeploy-changelog.md b/docs/v1-redeploy-changelog.md new file mode 100644 index 0000000..082f50d --- /dev/null +++ b/docs/v1-redeploy-changelog.md @@ -0,0 +1,121 @@ +# v1 Redeploy -- Running Change Log + +**Purpose:** Living log of design decisions, doc fixes, and runbook edits discovered +DURING the v1 redeploy rehearsal that must be folded into `docs/design-decisions.md` +and the phase runbooks UPON COMPLETION. This is the staging list for the completion +consolidation -- nothing here is applied to the runbooks or design-decisions yet. + +**Status:** OPEN -- accumulating. Append-only. ASCII + LF. + +**Session opened:** 2026-06-26 (redeploy from clean teardown; D-052/D-053 plane set). + +**Next free numbers at session open:** design decision D-054; doc fix DOCFIX-039. +(Verified by grep of design-decisions.md: max D-053, max DOCFIX-038.) + +--- + +## Verified-state checkpoint (measured this session -- authoritative as-built) + +`scripts/pre-flight-checks.sh` @ commit 40e3f9e -- ALL PASS, exit 0, 2026-06-26: + +Six MAAS planes resolved BY CIDR (subnet IDs are post-D-052-cutover, NOT the old map): + + provider-public 10.12.4.0/22 id=1 vid=0 gw=10.12.4.1 dns=[10.12.4.1] + metal-admin 10.12.8.0/22 id=2 vid=0 gw=10.12.8.1 dns=[10.12.8.1] + metal-internal 10.12.12.0/22 id=10 vid=103 gw=none dns=[10.12.8.1] (bridged br-internal) + data-tenant 10.12.16.0/22 id=6 vid=0 gw=none dns=[10.12.8.1] + storage 10.12.32.0/22 id=7 vid=0 gw=none dns=[10.12.8.1] + replication 10.12.36.0/22 id=8 vid=0 gw=none dns=[10.12.8.1] + +Per-host data/storage NIC links by CIDR, octets .40-.43, all four hosts: +br-internal -> .12, enp8s0 -> .16, enp9s0 -> .32, enp10s0 -> .36. + +Nodes openstack0-3 (4na83t / qdbqd6 / h8frng / tmsafc): all Ready, power off. +OSD secondary disks (`osd-blank-check.sh`): all four 512 GiB / 200 KiB blank, RC=0. +Bundle VIPs: 11 triple-column VIPs, aligned, .50-.60 band, OK=11 bad=0. +octavia-pki overlay: present, 5 lb-mgmt-* keys, ASCII clean. + +--- + +## Pending design-decisions.md appends + +### D-054 -- Reusable tested scripts in scripts/; runbooks reference them (ADOPTED in practice; formal append pending) + +**What:** Repeated discovery/verify logic lives in `scripts/`, authored and tested in a +sandbox against synthetic fixtures, committed to the repo, and referenced by the runbooks. +Runbooks document expected output and remain the gate authority; the scripts are the +executable truth. All pinned network values live once in `scripts/lib-net.sh` (single +source of truth), resolved BY CIDR (subnet IDs drift across cutovers). + +**Delivery workflow:** author + test in sandbox -> publish file + sha256 -> commit from +Windows -> jumphost `git pull` -> `sha256sum` match -> run via `bash scripts/X.sh`. + +**Convention:** ASCII + LF (`.gitattributes` `*.sh eol=lf`); `set -euo pipefail` + +`shopt -s inherit_errexit` + `IFS=$'\n\t'`; `fail`/`warn`/`pass`/`note` helpers with +exit 0 (pass) / 1 (fatal) / 2 (warning) for gate scripts; read-only discovery kept +separate from gated mutation; `lib-net.sh` is sourced, never executed (direct-run guard). + +**Why:** Eliminates the paste-corruption failure class (see Findings below) and turns +repeated discovery -- polled every redeploy cycle -- into a one-liner with a byte-identity +guarantee (sha256) instead of a fragile copy-paste block. + +**Scripts added this session:** `lib-net.sh` (new), `pre-flight-checks.sh` (implemented the +placeholder), `juju-spaces-check.sh` (new), `osd-blank-check.sh` (new). All tested +end-to-end against mock `maas`/`juju` + fixtures (positive + 7 negative fault injections +for pre-flight; 4 scenarios for spaces). Committed at 40e3f9e. + +--- + +## Pending DOCFIX entries + +### DOCFIX-039 -- phase-01-bundle-deploy.md gate reconciliation (PROPOSED) + +The phase-01 pre-deploy GATES encode the OLD plane layout (pre-D-052 CIDR->role map); the +deploy COMMANDS are fine. Superseded by `scripts/pre-flight-checks.sh`. Five stale items: + +1. Constants: hardcoded subnet ids `1 2 6 7 8 9` + old CIDR->role map -> resolve BY CIDR + (now in `lib-net.sh`; metal-internal is id=10 post-cutover, not id=6). +2. CHECK 1 / Step 1.3 deploy guard: provider-column-only VIP check -> triple-column + validator (provider/admin/internal, aligned, .50-.60). +3. CHECK 2: `enp8s0` + `10.12.12.0/22` (old "data") -> links BY CIDR; `enp8s0` now carries + `10.12.16.0/22` (data-tenant), metal-internal is on `br-internal`. +4. CHECK 3: hardcoded ids/DNS -> subnets BY CIDR. +5. EXIT GATE binding plane map (old: ceph->.16 / octavia->.12.1 / nova->.12.4x / vault->.8) + -> corrected per D-052: ceph public/osd/mon->storage(.32); octavia overlay->data-tenant + (.16); nova-compute neutron-plugin->data-tenant(.16); vault default->metal-admin(.8) + + cluster->metal-internal(.12). + +**Action at completion:** replace the inline CHECK blocks in phase-01 with +`bash scripts/pre-flight-checks.sh` (document expected PASS output) and add a post-add-model +`bash scripts/juju-spaces-check.sh openstack` as the per-model space gate (the old inline +CHECK 5 ran `juju spaces` pre-model and failed "model not found"; spaces are per-model). + +--- + +## Pending runbook / file edits (apply at completion) + +1. `runbooks/phase-01-bundle-deploy.md` -- DOCFIX-039 (above): swap inline pre-flight blocks + for `bash scripts/pre-flight-checks.sh`; add post-add-model `bash scripts/juju-spaces-check.sh + openstack`; fix the 5 stale gate items; document expected output. +2. `scripts/validate.sh` -- convert UTF-8 to ASCII when implementing the D-011 runner + (phase-08). `file` reports "Unicode text, UTF-8 text" (em-dashes from the placeholder); + violates the ASCII-only convention. Currently a placeholder, not yet run. +3. Teardown runbook -- reference `scripts/osd-blank-check.sh` for the OSD-blank verification + step (replaces the inline qemu-img loop). +4. `runbooks/` README / pre-flight references -- point at the new scripts where the old + inline discovery blocks were described. + +--- + +## Findings / process learnings (this session) + +- **Paste-corruption failure class.** A hand-built base64 pre-flight block shipped two + transcription defects: `[:space:]` (single bracket, must be `[[:space:]]`) on the grep + count line, and `ENV{` instead of `END{` on the awk tally (so the summary silently never + printed). Root cause: the base64 was hand-edited AFTER testing a clean version -- the + bytes sent were never round-tripped through the sandbox. Mitigation is now standard + practice (D-054): tested scripts committed to the repo, verified by sha256 on the jumphost. + +- **Juju spaces are per-model.** `juju spaces` / `juju reload-spaces` cannot run until after + `juju add-model`; the old phase-01 CHECK 5 ran pre-model and failed with "model not found". + Split into `juju-spaces-check.sh`, gated to run post-add-model.