# v1 Redeploy -- Running Change Log

**Purpose:** Living log of design decisions, doc fixes, and runbook edits discovered
DURING the v1 redeploy rehearsal that must be folded into `docs/design-decisions.md`
and the phase runbooks UPON COMPLETION. This is the staging list for the completion
consolidation -- nothing here is applied to the runbooks or design-decisions yet.

**Status:** OPEN -- accumulating. Append-only. ASCII + LF.

**Session opened:** 2026-06-26 (redeploy from clean teardown; D-052/D-053 plane set).

**Next free numbers at session open:** design decision D-054; doc fix DOCFIX-039.
(Verified by grep of design-decisions.md: max D-053, max DOCFIX-038.)

---

## Verified-state checkpoint (measured this session -- authoritative as-built)

`scripts/pre-flight-checks.sh` @ commit 40e3f9e -- ALL PASS, exit 0, 2026-06-26:

Six MAAS planes resolved BY CIDR (subnet IDs are post-D-052-cutover, NOT the old map):

    provider-public  10.12.4.0/22   id=1   vid=0    gw=10.12.4.1   dns=[10.12.4.1]
    metal-admin      10.12.8.0/22   id=2   vid=0    gw=10.12.8.1   dns=[10.12.8.1]
    metal-internal   10.12.12.0/22  id=10  vid=103  gw=none        dns=[10.12.8.1]  (bridged br-internal)
    data-tenant      10.12.16.0/22  id=6   vid=0    gw=none        dns=[10.12.8.1]
    storage          10.12.32.0/22  id=7   vid=0    gw=none        dns=[10.12.8.1]
    replication      10.12.36.0/22  id=8   vid=0    gw=none        dns=[10.12.8.1]

Per-host data/storage NIC links by CIDR, octets .40-.43, all four hosts:
br-internal -> .12, enp8s0 -> .16, enp9s0 -> .32, enp10s0 -> .36.

Nodes openstack0-3 (4na83t / qdbqd6 / h8frng / tmsafc): all Ready, power off.
OSD secondary disks (`osd-blank-check.sh`): all four 512 GiB / 200 KiB blank, RC=0.
Bundle VIPs: 11 triple-column VIPs, aligned, .50-.60 band, OK=11 bad=0.
octavia-pki overlay: present, 5 lb-mgmt-* keys, ASCII clean.

---

## Pending design-decisions.md appends

### D-054 -- Reusable tested scripts in scripts/; runbooks reference them (ADOPTED in practice; formal append pending)

**What:** Repeated discovery/verify logic lives in `scripts/`, authored and tested in a
sandbox against synthetic fixtures, committed to the repo, and referenced by the runbooks.
Runbooks document expected output and remain the gate authority; the scripts are the
executable truth. All pinned network values live once in `scripts/lib-net.sh` (single
source of truth), resolved BY CIDR (subnet IDs drift across cutovers).

**Delivery workflow:** author + test in sandbox -> publish file + sha256 -> commit from
Windows -> jumphost `git pull` -> `sha256sum` match -> run via `bash scripts/X.sh`.

**Convention:** ASCII + LF (`.gitattributes` `*.sh eol=lf`); `set -euo pipefail` +
`shopt -s inherit_errexit` + `IFS=$'\n\t'`; `fail`/`warn`/`pass`/`note` helpers with
exit 0 (pass) / 1 (fatal) / 2 (warning) for gate scripts; read-only discovery kept
separate from gated mutation; `lib-net.sh` is sourced, never executed (direct-run guard).

**Why:** Eliminates the paste-corruption failure class (see Findings below) and turns
repeated discovery -- polled every redeploy cycle -- into a one-liner with a byte-identity
guarantee (sha256) instead of a fragile copy-paste block.

**Scripts added this session:** `lib-net.sh` (new), `pre-flight-checks.sh` (implemented the
placeholder), `juju-spaces-check.sh` (new), `osd-blank-check.sh` (new). All tested
end-to-end against mock `maas`/`juju` + fixtures (positive + 7 negative fault injections
for pre-flight; 4 scenarios for spaces). Committed at 40e3f9e.

---

## Pending DOCFIX entries

### DOCFIX-039 -- phase-01-bundle-deploy.md gate reconciliation (PROPOSED)

The phase-01 pre-deploy GATES encode the OLD plane layout (pre-D-052 CIDR->role map); the
deploy COMMANDS are fine. Superseded by `scripts/pre-flight-checks.sh`. Five stale items:

1. Constants: hardcoded subnet ids `1 2 6 7 8 9` + old CIDR->role map -> resolve BY CIDR
   (now in `lib-net.sh`; metal-internal is id=10 post-cutover, not id=6).
2. CHECK 1 / Step 1.3 deploy guard: provider-column-only VIP check -> triple-column
   validator (provider/admin/internal, aligned, .50-.60).
3. CHECK 2: `enp8s0` + `10.12.12.0/22` (old "data") -> links BY CIDR; `enp8s0` now carries
   `10.12.16.0/22` (data-tenant), metal-internal is on `br-internal`.
4. CHECK 3: hardcoded ids/DNS -> subnets BY CIDR.
5. EXIT GATE binding plane map (old: ceph->.16 / octavia->.12.1 / nova->.12.4x / vault->.8)
   -> corrected per D-052: ceph public/osd/mon->storage(.32); octavia overlay->data-tenant
   (.16); nova-compute neutron-plugin->data-tenant(.16); vault default->metal-admin(.8) +
   cluster->metal-internal(.12).

**Action at completion:** replace the inline CHECK blocks in phase-01 with
`bash scripts/pre-flight-checks.sh` (document expected PASS output) and add a post-add-model
`bash scripts/juju-spaces-check.sh openstack` as the per-model space gate (the old inline
CHECK 5 ran `juju spaces` pre-model and failed "model not found"; spaces are per-model).

---

## Pending runbook / file edits (apply at completion)

1. `runbooks/phase-01-bundle-deploy.md` -- DOCFIX-039 (above): swap inline pre-flight blocks
   for `bash scripts/pre-flight-checks.sh`; add post-add-model `bash scripts/juju-spaces-check.sh
   openstack`; fix the 5 stale gate items; document expected output.
2. `scripts/validate.sh` -- convert UTF-8 to ASCII when implementing the D-011 runner
   (phase-08). `file` reports "Unicode text, UTF-8 text" (em-dashes from the placeholder);
   violates the ASCII-only convention. Currently a placeholder, not yet run.
3. Teardown runbook -- reference `scripts/osd-blank-check.sh` for the OSD-blank verification
   step (replaces the inline qemu-img loop).
4. `runbooks/` README / pre-flight references -- point at the new scripts where the old
   inline discovery blocks were described.

---

## Findings / process learnings (this session)

- **Paste-corruption failure class.** A hand-built base64 pre-flight block shipped two
  transcription defects: `[:space:]` (single bracket, must be `[[:space:]]`) on the grep
  count line, and `ENV{` instead of `END{` on the awk tally (so the summary silently never
  printed). Root cause: the base64 was hand-edited AFTER testing a clean version -- the
  bytes sent were never round-tripped through the sandbox. Mitigation is now standard
  practice (D-054): tested scripts committed to the repo, verified by sha256 on the jumphost.

- **Juju spaces are per-model.** `juju spaces` / `juju reload-spaces` cannot run until after
  `juju add-model`; the old phase-01 CHECK 5 ran pre-model and failed with "model not found".
  Split into `juju-spaces-check.sh`, gated to run post-add-model.

- **Default-space globally poisons network-get (deploy root cause).** The full D-052
  binding deploy failed universally (`network-get ... ERROR space "metal" not found`,
  install hook dies on nearly every charm). Every static layer was correct -- bundle,
  model bindings, MAAS spaces/VLANs/per-NIC space tags all read `metal-internal`. The
  single stale value was controller `model-defaults default-space = metal` (a dead
  pre-D-052 name). An INVALID default-space poisons `network-get` for ALL endpoints
  regardless of their explicit binding. Fix: set `juju model-defaults
  default-space=metal-admin` (a live space) before add-model. A `default-space`-resolves-
  to-a-live-space gate is to be added to `pre-flight-checks.sh`.

- **Teardown --destroy-storage on virsh DELETES machine objects (does NOT release).**
  The phase-00 teardown (`juju destroy-model openstack --force --destroy-storage` then
  per-host `maas machine release`) assumes release-to-Ready. On a virsh/KVM MAAS,
  `--destroy-storage` DECOMPOSES (deletes) the VM-backed machine objects. All four
  openstack hosts were removed from MAAS. Recoverable only because the libvirt domains
  + disks (incl the blank OSD vdb) survived. See D-055.

---

## Pending design-decisions.md appends (continued)

### D-055 -- virsh teardown defect + host re-enrollment procedure (ADOPTED)

**Defect:** `juju destroy-model --destroy-storage` against virsh-power MAAS machines
deletes (decomposes) the machine objects rather than releasing them to Ready. The
phase-00 teardown must NOT pass `--destroy-storage` for virsh hosts; release to Ready
without it.

**Recovery (now a reusable procedure):** the libvirt domains survive, so re-enroll via
`maas admin machines create` per host with virsh power + the boot NIC MAC (NOT add-chassis
-- it would re-grab juju/lxd/tailscale). `machines create` auto-commissions
(New->Commissioning->Ready) by PXE off the 2_metal boot NIC. Then re-tag `openstack`,
then reconstruct the host interface tree (Strategy-B carve, from the captured as-built),
then verify (pre-flight), then redeploy with the default-space fix.

**Artifacts:** `scripts/lib-hosts.sh`, `scripts/reenroll-hosts.sh`,
`docs/maas-as-built-reference.md`. Proven live on openstack0 (2026-06-26): created
virsh, commissioned, Ready, all six NICs discovered, boot NIC on 2_metal.

### DOCFIX-040 -- host identity must be hostname-keyed, not system_id-keyed

`lib-net.sh` lines 45-47 key the host maps (`SYSIDS`, `SYSID_HOST`, `SYSID_OCTET`) on
the system_ids 4na83t/qdbqd6/h8frng/tmsafc -- which DIED on re-enrollment (new random
ids). Any script keyed on them silently breaks. New `scripts/lib-hosts.sh` keys all host
identity on hostname (stable) and resolves system_id at runtime (`host_sysid`). At
completion: retire the SYSID-keyed maps from lib-net.sh (or repoint them to lib-hosts).

---

## Security note (action required)

The libvirt SSH password (`logxen@10.12.64.1`) was printed in plaintext on 2026-06-26 by
`maas admin machine power-parameters` during virsh power-template discovery. Treat as
exposed: **rotate the libvirt SSH credential after the rebuild** and scrub terminal
scrollback. Runbook rule added: never use `machine power-parameters` for templating; read
`power_type` and reconstruct the address pattern instead. `reenroll-hosts.sh` reads the
password interactively (never a CLI arg, never logged, never in the repo).

---

## Scripts / docs added (this batch)

- `scripts/lib-hosts.sh` -- hostname-keyed host identity + virsh power constants (no secret).
- `scripts/reenroll-hosts.sh` -- gated/idempotent re-enrollment (auto-commission, poll Ready,
  boot-NIC-on-2_metal verify; --check read-only mode). Tested: bash -n, shellcheck clean,
  mock-maas behavior test of --check (discover-by-hostname, NOT-ENROLLED detection, exit 0).
- `docs/maas-as-built-reference.md` -- captured MAAS substrate + per-host NIC inventory +
  interface-carve target + virsh template, for DC-DC replay.
- Pending next artifact: the Strategy-B interface-carve script (built once all four are Ready;
  bridge_type pulled verbatim from captured release JSON) -> then consolidate into
  `runbooks/phase-00b-host-reenrollment.md`.

### DOCFIX-041 -- as-built reference: br-ex is charm-built, not a MAAS bridge

Correction to `docs/maas-as-built-reference.md` (first committed this session). The
bundle's ovn-chassis `bridge-interface-mappings` maps `br-ex:<provider-MAC>` for all
four hosts -> **br-ex is built by the ovn-chassis charm at deploy (OVS), enslaving the
provider NIC by MAC; it is NOT a MAAS interface.** The MAAS carve therefore:
- provider plane = **raw enp1s0 + static 10.12.4.N** (MAAS leaves it raw; the charm
  enslaves it into br-ex at deploy). MAAS does NOT create br-ex.
- storage/replication = raw enp9s0/enp10s0 + statics; Juju auto-bridges them
  (br-enp9s0/br-enp10s0, Linux) at deploy.
- the ONLY MAAS-built bridges are the metal-internal stack:
  enp7s0 -> br-metal -> br-metal.103 (VID 103) -> br-internal.

bridge_type: br-internal = standard (confirmed, D-052 command). br-metal = standard
(RECOMMENDED, reasoned-not-measured -- original bring-up predates the repo and the
capture did not preserve bridge_type; pending confirm before carve). The
deployed-host `ip`-level read that showed br-metal/br-internal "OVS" was taken during
the FAILED deploy and is reclassified UNRELIABLE.

### Carve script added + MAAS interface CLI confirmations

- `scripts/carve-host-interfaces.sh <hostname> [--apply]` -- Strategy-B per-host
  interface carve. Default DRY-RUN (resolves every id live, prints each mutation it
  WOULD run, changes nothing); --apply executes. Idempotent (skips existing
  bridge/vlan/link), resolves system_id by hostname / interface id by name / subnet
  id + VLAN object id by CIDR, asserts metal-internal is VID 103, requires Ready.
  Builds: enp1s0 raw+static (provider); enp7s0 -> br-metal(std) -> br-metal.103(VID
  103) -> br-internal(std); enp8/9/10 raw+static (data/storage/repl); enp11s0 idle.
  Does NOT create br-ex (charm-built). Tested: bash -n, shellcheck clean, mock-MAAS
  dry-run (full id resolution + command preview), input guards.

- MAAS 3.7 interface CLI confirmed (canonical.com/maas/docs/3.7 reference):
  create-bridge takes `bridge_type=standard|ovs parent=<ifid> vlan=<vlan-obj-id>`;
  create-vlan takes `vlan=<VLAN-OBJECT-ID> parent=<ifid>` (NOT the VID tag -- resolve
  the object id via the metal-internal subnet); link-subnet `mode=STATIC
  subnet=<id> ip_address=<ip>`; a NIC is moved to a plane's fabric via `interface
  update <sid> <ifid> vlan=<vlan-obj-id>` before link-subnet (re-enrolled raw NICs
  sit on transient auto-fabrics).

- FINDING (teardown runbook bug): `runbooks/phase-00-teardown-maas-reset.md`
  "Phase 3" link-subnet block uses PRE-D-052 CIDRs
  (`enp8s0=10.12.12.0/22 enp9s0=10.12.16.0/22 enp10s0=10.12.20.0/22`) and dead
  system_ids -- it would link NICs to the WRONG subnets (10.12.12 is now
  metal-internal, 10.12.16 is now data-tenant, 10.12.20 no longer exists). Must be
  rewritten to current planes + hostname-keyed before that runbook is trusted. Note:
  the normal release-to-Ready path PRESERVES host interfaces, so that block only ran
  on a normal teardown; the full carve (this script) is needed only after a
  decompose, which is why the bridges were never scripted before.
