Purpose: Living log of design decisions, doc fixes, and runbook edits discovered DURING the v1 redeploy rehearsal that must be folded into docs/design-decisions.md and the phase runbooks UPON COMPLETION. This is the staging list for the completion consolidation -- nothing here is applied to the runbooks or design-decisions yet.
Status: OPEN -- accumulating. Append-only. ASCII + LF.
Session opened: 2026-06-26 (redeploy from clean teardown; D-052/D-053 plane set).
Next free numbers at session open: design decision D-054; doc fix DOCFIX-039. (Verified by grep of design-decisions.md: max D-053, max DOCFIX-038.)
scripts/pre-flight-checks.sh @ commit 40e3f9e -- ALL PASS, exit 0, 2026-06-26:
Six MAAS planes resolved BY CIDR (subnet IDs are post-D-052-cutover, NOT the old map):
provider-public 10.12.4.0/22 id=1 vid=0 gw=10.12.4.1 dns=[10.12.4.1] metal-admin 10.12.8.0/22 id=2 vid=0 gw=10.12.8.1 dns=[10.12.8.1] metal-internal 10.12.12.0/22 id=10 vid=103 gw=none dns=[10.12.8.1] (bridged br-internal) data-tenant 10.12.16.0/22 id=6 vid=0 gw=none dns=[10.12.8.1] storage 10.12.32.0/22 id=7 vid=0 gw=none dns=[10.12.8.1] replication 10.12.36.0/22 id=8 vid=0 gw=none dns=[10.12.8.1]
Per-host data/storage NIC links by CIDR, octets .40-.43, all four hosts: br-internal -> .12, enp8s0 -> .16, enp9s0 -> .32, enp10s0 -> .36.
Nodes openstack0-3 (4na83t / qdbqd6 / h8frng / tmsafc): all Ready, power off. OSD secondary disks (osd-blank-check.sh): all four 512 GiB / 200 KiB blank, RC=0. Bundle VIPs: 11 triple-column VIPs, aligned, .50-.60 band, OK=11 bad=0. octavia-pki overlay: present, 5 lb-mgmt-* keys, ASCII clean.
What: Repeated discovery/verify logic lives in scripts/, authored and tested in a sandbox against synthetic fixtures, committed to the repo, and referenced by the runbooks. Runbooks document expected output and remain the gate authority; the scripts are the executable truth. All pinned network values live once in scripts/lib-net.sh (single source of truth), resolved BY CIDR (subnet IDs drift across cutovers).
Delivery workflow: author + test in sandbox -> publish file + sha256 -> commit from Windows -> jumphost git pull -> sha256sum match -> run via bash scripts/X.sh.
Convention: ASCII + LF (.gitattributes *.sh eol=lf); set -euo pipefail + shopt -s inherit_errexit + IFS=$'\n\t'; fail/warn/pass/note helpers with exit 0 (pass) / 1 (fatal) / 2 (warning) for gate scripts; read-only discovery kept separate from gated mutation; lib-net.sh is sourced, never executed (direct-run guard).
Why: Eliminates the paste-corruption failure class (see Findings below) and turns repeated discovery -- polled every redeploy cycle -- into a one-liner with a byte-identity guarantee (sha256) instead of a fragile copy-paste block.
Scripts added this session: lib-net.sh (new), pre-flight-checks.sh (implemented the placeholder), juju-spaces-check.sh (new), osd-blank-check.sh (new). All tested end-to-end against mock maas/juju + fixtures (positive + 7 negative fault injections for pre-flight; 4 scenarios for spaces). Committed at 40e3f9e.
The phase-01 pre-deploy GATES encode the OLD plane layout (pre-D-052 CIDR->role map); the deploy COMMANDS are fine. Superseded by scripts/pre-flight-checks.sh. Five stale items:
1 2 6 7 8 9 + old CIDR->role map -> resolve BY CIDR (now in lib-net.sh; metal-internal is id=10 post-cutover, not id=6).enp8s0 + 10.12.12.0/22 (old "data") -> links BY CIDR; enp8s0 now carries 10.12.16.0/22 (data-tenant), metal-internal is on br-internal.Action at completion: replace the inline CHECK blocks in phase-01 with bash scripts/pre-flight-checks.sh (document expected PASS output) and add a post-add-model bash scripts/juju-spaces-check.sh openstack as the per-model space gate (the old inline CHECK 5 ran juju spaces pre-model and failed "model not found"; spaces are per-model).
runbooks/phase-01-bundle-deploy.md -- DOCFIX-039 (above): swap inline pre-flight blocks for bash scripts/pre-flight-checks.sh; add post-add-model bash scripts/juju-spaces-check.sh openstack; fix the 5 stale gate items; document expected output.scripts/validate.sh -- convert UTF-8 to ASCII when implementing the D-011 runner (phase-08). file reports "Unicode text, UTF-8 text" (em-dashes from the placeholder); violates the ASCII-only convention. Currently a placeholder, not yet run.scripts/osd-blank-check.sh for the OSD-blank verification step (replaces the inline qemu-img loop).runbooks/ README / pre-flight references -- point at the new scripts where the old inline discovery blocks were described.Paste-corruption failure class. A hand-built base64 pre-flight block shipped two transcription defects: [:space:] (single bracket, must be [[:space:]]) on the grep count line, and ENV{ instead of END{ on the awk tally (so the summary silently never printed). Root cause: the base64 was hand-edited AFTER testing a clean version -- the bytes sent were never round-tripped through the sandbox. Mitigation is now standard practice (D-054): tested scripts committed to the repo, verified by sha256 on the jumphost.
Juju spaces are per-model. juju spaces / juju reload-spaces cannot run until after juju add-model; the old phase-01 CHECK 5 ran pre-model and failed with "model not found". Split into juju-spaces-check.sh, gated to run post-add-model.
Default-space globally poisons network-get (deploy root cause). The full D-052 binding deploy failed universally (network-get ... ERROR space "metal" not found, install hook dies on nearly every charm). Every static layer was correct -- bundle, model bindings, MAAS spaces/VLANs/per-NIC space tags all read metal-internal. The single stale value was controller model-defaults default-space = metal (a dead pre-D-052 name). An INVALID default-space poisons network-get for ALL endpoints regardless of their explicit binding. Fix: set juju model-defaults default-space=metal-admin (a live space) before add-model. A default-space-resolves- to-a-live-space gate is to be added to pre-flight-checks.sh.
Teardown --destroy-storage on virsh DELETES machine objects (does NOT release). The phase-00 teardown (juju destroy-model openstack --force --destroy-storage then per-host maas machine release) assumes release-to-Ready. On a virsh/KVM MAAS, --destroy-storage DECOMPOSES (deletes) the VM-backed machine objects. All four openstack hosts were removed from MAAS. Recoverable only because the libvirt domains
Defect: juju destroy-model --destroy-storage against virsh-power MAAS machines deletes (decomposes) the machine objects rather than releasing them to Ready. The phase-00 teardown must NOT pass --destroy-storage for virsh hosts; release to Ready without it.
Recovery (now a reusable procedure): the libvirt domains survive, so re-enroll via maas admin machines create per host with virsh power + the boot NIC MAC (NOT add-chassis -- it would re-grab juju/lxd/tailscale). machines create auto-commissions (New->Commissioning->Ready) by PXE off the 2_metal boot NIC. Then re-tag openstack, then reconstruct the host interface tree (Strategy-B carve, from the captured as-built), then verify (pre-flight), then redeploy with the default-space fix.
Artifacts: scripts/lib-hosts.sh, scripts/reenroll-hosts.sh, docs/maas-as-built-reference.md. Proven live on openstack0 (2026-06-26): created virsh, commissioned, Ready, all six NICs discovered, boot NIC on 2_metal.
lib-net.sh lines 45-47 key the host maps (SYSIDS, SYSID_HOST, SYSID_OCTET) on the system_ids 4na83t/qdbqd6/h8frng/tmsafc -- which DIED on re-enrollment (new random ids). Any script keyed on them silently breaks. New scripts/lib-hosts.sh keys all host identity on hostname (stable) and resolves system_id at runtime (host_sysid). At completion: retire the SYSID-keyed maps from lib-net.sh (or repoint them to lib-hosts).
The libvirt SSH password (logxen@10.12.64.1) was printed in plaintext on 2026-06-26 by maas admin machine power-parameters during virsh power-template discovery. Treat as exposed: rotate the libvirt SSH credential after the rebuild and scrub terminal scrollback. Runbook rule added: never use machine power-parameters for templating; read power_type and reconstruct the address pattern instead. reenroll-hosts.sh reads the password interactively (never a CLI arg, never logged, never in the repo).
scripts/lib-hosts.sh -- hostname-keyed host identity + virsh power constants (no secret).scripts/reenroll-hosts.sh -- gated/idempotent re-enrollment (auto-commission, poll Ready, boot-NIC-on-2_metal verify; --check read-only mode). Tested: bash -n, shellcheck clean, mock-maas behavior test of --check (discover-by-hostname, NOT-ENROLLED detection, exit 0).docs/maas-as-built-reference.md -- captured MAAS substrate + per-host NIC inventory + interface-carve target + virsh template, for DC-DC replay.runbooks/phase-00b-host-reenrollment.md.Correction to docs/maas-as-built-reference.md (first committed this session). The bundle's ovn-chassis bridge-interface-mappings maps br-ex:<provider-MAC> for all four hosts -> br-ex is built by the ovn-chassis charm at deploy (OVS), enslaving the provider NIC by MAC; it is NOT a MAAS interface. The MAAS carve therefore:
bridge_type: br-internal = standard (confirmed, D-052 command). br-metal = standard (RECOMMENDED, reasoned-not-measured -- original bring-up predates the repo and the capture did not preserve bridge_type; pending confirm before carve). The deployed-host ip-level read that showed br-metal/br-internal "OVS" was taken during the FAILED deploy and is reclassified UNRELIABLE.
scripts/carve-host-interfaces.sh <hostname> [--apply] -- Strategy-B per-host interface carve. Default DRY-RUN (resolves every id live, prints each mutation it WOULD run, changes nothing); --apply executes. Idempotent (skips existing bridge/vlan/link), resolves system_id by hostname / interface id by name / subnet id + VLAN object id by CIDR, asserts metal-internal is VID 103, requires Ready. Builds: enp1s0 raw+static (provider); enp7s0 -> br-metal(std) -> br-metal.103(VID 103) -> br-internal(std); enp8/9/10 raw+static (data/storage/repl); enp11s0 idle. Does NOT create br-ex (charm-built). Tested: bash -n, shellcheck clean, mock-MAAS dry-run (full id resolution + command preview), input guards.
MAAS 3.7 interface CLI confirmed (canonical.com/maas/docs/3.7 reference): create-bridge takes bridge_type=standard|ovs parent=<ifid> vlan=<vlan-obj-id>; create-vlan takes vlan=<VLAN-OBJECT-ID> parent=<ifid> (NOT the VID tag -- resolve the object id via the metal-internal subnet); link-subnet mode=STATIC subnet=<id> ip_address=<ip>; a NIC is moved to a plane's fabric via interface update <sid> <ifid> vlan=<vlan-obj-id> before link-subnet (re-enrolled raw NICs sit on transient auto-fabrics).
FINDING (teardown runbook bug): runbooks/phase-00-teardown-maas-reset.md "Phase 3" link-subnet block uses PRE-D-052 CIDRs (enp8s0=10.12.12.0/22 enp9s0=10.12.16.0/22 enp10s0=10.12.20.0/22) and dead system_ids -- it would link NICs to the WRONG subnets (10.12.12 is now metal-internal, 10.12.16 is now data-tenant, 10.12.20 no longer exists). Must be rewritten to current planes + hostname-keyed before that runbook is trusted. Note: the normal release-to-Ready path PRESERVES host interfaces, so that block only ran on a normal teardown; the full carve (this script) is needed only after a decompose, which is why the bridges were never scripted before.
Root cause (cost several diagnostic rounds): after re-enrollment each host PXE-leases its own metal IP (10.12.8.4N) at commission. MAAS records this as a StaticIPAddress of alloc_type 6 (DISCOVERED) tied to the node via its boot NIC. This is a SEPARATE object from the network-discovery table (discoveries clear-by-mac-and-ip does NOT clear it) and from user allocations (ipaddresses read user-scope does NOT show it). It causes link-subnet ... ip_address=10.12.8.4N to fail with the misleading "IP address is already in use".
Authoritative read (the lesson): maas admin subnet ip-addresses <subnet_id> reports every in-use IP WITH its alloc_type and owning node -- this is the single correct "who holds this IP and why" query. Lead with it; do not probe ipaddresses/discovery/ leases piecemeal.
Release: maas admin ipaddresses release ip=<ip> force=true discovered=true (BOTH flags required; force alone returns "does not exist" for a discovered address).
Script fix (carve-host-interfaces.sh): release_self_discovered() runs before every STATIC link -- releases an alloc_type-6 record for the target IP ONLY when its owning node == this host (node_summary.system_id), and REFUSES (fatal) if a different node discovered it (a real conflict). Plus emit now captures and prints the MAAS error on a failed mutation instead of discarding it to /dev/null (the discard hid the real message and prolonged diagnosis). Only the metal plane (dhcp_on=true) is affected; the no-DHCP planes never produced a self-lease. Verified: mock self-release path + foreign-node refuse gate.
NOTE (design consistency, not a blocker): host statics .40-.43 sit inside the metal-admin/provider/internal VIP+mgmt reserve band (.2-.100). A reserved range blocks AUTO assignment, not explicit STATIC, so it did not break the carve -- but host octets arguably belong outside the VIP band. Log for the reserve-layout review.
Reviewed all MAAS scripts against what this session actually hit, so the DC-DC build replays cleanly instead of re-deriving the metal-IP archaeology.
carve gate rewrite (the big one). release_self_discovered keyed on node_summary.system_id, which is EMPTY on a fresh discovered record -> it silently no-op'd and the metal static (.8.41/.42/.43) had to be released by hand on three hosts. Replaced with release_self_indexed: the target is this host's architecturally-indexed metal IP (10.12.8. from HOST_OCTET), so a DISCOVERED observation on it is this host's own commissioning ghost. SAFETY: refuses if the record's system_id (when present) OR the discoveries-table MAC (when present) identifies a DIFFERENT host; releases otherwise. Removed the (unneeded) release call from carve_raw -- the no-DHCP planes never produce discovered records. Tested: 5 branches (foreign-sysid refuse, foreign-MAC refuse, indexed-basis release, MAC-basis release, no-record no-op).
missing step added: openstack tag. reenroll-hosts.sh now ensures the openstack tag exists and applies it to all four hosts after the Ready/boot-NIC gate (idempotent; --check-aware). Without it the bundle cannot place units (constraint tags=openstack). Was a manual step every rebuild.
DOCFIX-040 COMPLETE. pre-flight-checks.sh and osd-blank-check.sh both looped over the dead system_ids (4na83t...) via lib-net's SYSID maps -- broken for any rebuilt/DC-DC cluster. Migrated both to hostname-keyed (lib-hosts HOSTS / HOST_OCTET / host_sysid). Retired the SYSID/SYSID_HOST/SYSID_OCTET maps from lib-net.sh and added its sourced-library shellcheck directive. osd-blank verified via mock (iterates the four hostnames, RC=0).
validate.sh: em-dashes -> ASCII (the silent-UnicodeDecodeError class; ASCII-only rule for all scripts). Still a placeholder body otherwise.
REMAINING DC-DC scope (done MANUALLY this session; scripting them would make the bring-up fully hands-off -- NOT yet built):
Session: 2026-06-27. Origin/jumphost HEAD at phase-02 start: 1a103f5 ("Create phase-02-vault-preflight.sh"; was 68a0bd5 at the redeploy handoff). Model: openstack. Next free numbers at section open: design decision D-056; doc fix DOCFIX-042 (verified by grep: changelog max D-055 / DOCFIX-041; design-decisions max D-053 / DOCFIX-038).
Manual A-E audit on the jumphost cleared all gates; the new scripts/phase-02-vault-preflight.sh then reproduced it identically with REAL jq:
A auth jessea123 / juju-controller / model openstack; no macaroon EOF.
B machines 4/4 started.
C mysql mysql-innodb-cluster 3/3 -- /0 R/W, /1 R/O, /2 R/O, all
"Cluster is ONLINE and can tolerate up to ONE failure." (vault backend OK)
D vault vault/0 [blocked] "Vault needs to be initialized" -- FRESH
(irreversibility guard satisfied).
E census units=63 workload-error=0 agent-error(hook)=0
blocked=2 (vault + octavia "Awaiting configure-resources")
waiting=9 active=51 unknown=1 (glance-simplestreams-sync).
Census 63 vs the handoff's 31 is NOT a discrepancy: the handoff counted PRINCIPALS (active=25/blocked=2/waiting=3/unknown=1); the script recurses into subordinates. waiting 3->9 reconciles against the handoff prose (ovn-central x3 principals + ovn-chassis x3 + ovn-chassis-octavia + neutron-api-plugin-ovn + nova-compute certs); active 25->51 is the hacluster/mysql-router/filebeat subordinate layer. blocked=2 and unknown=1 match exactly.
Committed bundle.yaml declares SYMBOLIC machine IDs "8"/"9"/"10"/"11" (machines: section, constraints tags=openstack). Juju treats bundle machine keys as PLACEHOLDERS, not real IDs; deployed into a fresh model they map in order to real IDs 0/1/2/3. Live (confirmed by the preflight script's machine display lines):
real m0 = openstack0 = 10.12.12.40 (bundle "8") control-only, 7 LXD: 0/lxd/0..6 real m1 = openstack1 = 10.12.12.41 (bundle "9") real m2 = openstack3 = 10.12.12.43 (bundle "10") real m3 = openstack2 = 10.12.12.42 (bundle "11") holds vault/0 (juju-f5a310-3-lxd-5)
The openstack2/openstack3 <-> m3/m2 "swap" is MAAS tag-based allocation (hosts pinned by tag=openstack, NOT by system_id), so the host->machine binding floats per deploy. nova-compute to: ["9","10","11"] (symbolic) therefore landed on real m1/m2/m3 = openstack1/openstack3/openstack2, leaving m0/openstack0 control-only -- CONSISTENT with the handoff's intended role split. ceph to: ["8","9","10","11"] -> all four real machines.
IMPACT: zero on phase-02 (vault/0 resolves by unit name). The handoff text "= bundle machines 8/9/10/11" is stale on LIVE ids. RULE to fold into the runbook: resolve everything by unit name / hostname / CIDR, NEVER by machine ID; document the bundle-symbolic vs live-real mapping so a future operator does not mistake it for a deploy fault. Phase-03 host-role verify (open-item 2) confirms which units run on which machine definitively.
Read-only verify-before-mutate gate packaging the A-E audit into one re-runnable command. Mutates NOTHING; the vault init/unseal/authorize MUTATIONS stay gated human steps (item-8 principle: scripts own the deterministic/read-only/repeated; the human gate owns the consequential mutation + secret custody). Gates: B all machines started; C mysql 3 units / all active+ONLINE / exactly 1 R/W; D vault fresh ([blocked] "needs to be initialized" -- REFUSES and escalates if not, since a non-fresh vault may already hold keys); E zero workload-error AND zero agent-error(hook), subordinates included. Exit 0 PROCEED / 1 HOLD / 2 precondition. Sources lib-net.sh (need_jq); whoami-direct-first so a stale-macaroon prompt reaches the tty before captured calls; single juju-status snapshot; one jq metrics pass (eval'd key=value); dynamic lookups, nothing host/IP/ID hardcoded. ASCII+LF, bash -n clean.
Testing: shellcheck + jq both ABSENT from Claude's sandbox -> behavior-tested with juju+jq shims across 5 fixtures (1 healthy + 4 single-fault: vault-already-initialized D, mysql OFFLINE C, hook-failure E, machine-down B); each produced the correct exit code and gate attribution. jq metrics algorithm mirrored/validated in Python. REAL-jq/REAL-data confirmation on the jumphost first run reproduced the manual audit EXACTLY (units=63, errors 0, PROCEED, EXIT=0); Windows -> GitHub Desktop -> push -> jumphost-pull preserved LF/ASCII/parse. FOLD INTO phase-02 do-doc: invoke this script as the Step 2.1 pre-flight gate.
run-tests.sh + make_fixtures.py + fakebin/{juju,jq} shims. Offline regression for the preflight script: drives the REAL script's decision/exit logic against the 5 generated fixtures; touches NO live infra (fake juju emits fixtures, fake jq mirrors the metrics in Python); runs anywhere with python3 + bash (no real jq needed); re-asserts shim exec bits so the Windows -> git round trip dropping them will not break it. Sandbox run: ALL PASS / exit 0. Target paths: tests/phase-02/{run-tests.sh, make_fixtures.py, fakebin/juju, fakebin/jq}.
The do-doc presents Step 2.1's in-session block as one paste (env-setup; vault status; vault operator init | tee; grep -c; grep -q). Split at the verify/mutate boundary into two gated pastes: 2.1a (read-only verify): export VAULT_ADDR...; umask 077; mkdir -p ~/vault-init + vault status 2>&1 | grep -E 'Initialized|Sealed|Storage Type|HA Enabled' || true 2.1b (irreversible) : vault operator init -key-shares=5 -key-threshold=3 2>&1 | tee ~/vault-init/init.txt + grep -c '^Unseal Key' ... + grep -q '^Initial Root Token:' ... Rationale: the vault status line exists to be OBSERVED before the irreversible init; a single paste runs init before it can be read, defeating verify-before-mutate. Commands are verbatim/unchanged -- only the paste boundary moves. Amend phase-02 do-doc Step 2.1 to present 2.1a/2.1b as two gated pastes.
Session opened on vault/0: juju ssh -m openstack vault/0 -> ubuntu@juju-f5a310-3-lxd-5 (= real machine 3 = openstack2, LXD container 5). 2.1a output:
Initialized false uninitialized; safe to init Sealed true Storage Type mysql vault-on-mysql backend (mysql-innodb-cluster) HA Enabled false CORRECT for vault-on-mysql (R3); NOT a defect
Vault's own status agrees with the Juju workload-status. Cleared for 2.1b (vault init).
vault operator init -key-shares=5 -key-threshold=3 2>&1 | tee ~/vault-init/init.txt ran once. Token gate: grep -q '^Initial Root Token:' -> TOKEN_OK (root token line captured in init.txt). Unseal-Key count gate (grep -c '^Unseal Key' MUST = 5): = 5 (operator confirmed); not inferred. Operator confirmed all key material (5 shares + root token) saved OFF cloud/host; ~/vault-init/init.txt on the unit is the only on-unit copy (dies with the unit).
Post-init expected state: vault Initialized true / Sealed true (init does NOT unseal a vault-on-mysql; unseal is the separate 2.2 step).
3-of-5 via vault operator unseal (no arg, vault's own hidden prompt; keys never on argv/ history -- L4). Final vault status: Initialized true / Sealed false / Storage Type mysql / HA Enabled false (HA false correct for single-unit vault-on-mysql -- R3). Vault is now initialized AND unsealed. v1 policy: MANUAL unseal is the v1 standard -- re-run 3-of-5 at the hidden prompt after any vault-unit reboot (auto-unseal via transit/KMS not configured in v1; D-011.6 re-confirms in phase-08).
Short-lived child token (10m TTL) minted in vault/0 via hidden read -s root token + vault token create -ttl=10m -field=token (NOT the root token -- juju op-log persists action params; DOCFIX-011 param=token). juju run vault/leader authorize-charm token=... then juju run vault/leader generate-root-ca (REQUIRED -- DOCFIX-014) both completed; child token entered via hidden read -s on the jumphost too (narrows, does not eliminate, op-log exposure). Root CA PEM emitted ("Vault Root Certificate Authority (charm-pki-local)") and copied OFF cloud.
Result (juju status vault): vault 1.8.8 active "Unit is ready (active: true, mlock: disabled)"; vault/0 active/idle on 3/lxd/5 (= machine 3 = openstack2; container 10.12.12.106); vault-mysql-router/0 active. The "Missing CA cert" block cleared STRAIGHT to active -- validates DOCFIX-014. mlock: disabled is expected/benign for container vault (no IPC_LOCK).
ss on magnum/0 (juju exec): 9501 -> `:9501(all-interfaces; NOT 127.0.0.1); 9511 ->0.0.0.0:9511+:9511`. NOT loopback -> escalation condition NOT met; the phase-01 9501 line was the expected pre-vault posture, NOT a defect. Settle also confirmed at principal level via deploy-watch.sh: active=29 / blocked=1 (octavia) / unknown=1 (glance-ss-sync) = 31 principals -- reconciles with the handoff's original 31. *PHASE-02 EXIT GATE CLOSED.to:.PHASE-02 COMPLETE -- discrete vault mutations done. Cascade-settle + the post-init sweep are the opening activities of phase-03 (runbooks/phase-03-core-verify.md).
Design decision: D-056. Doc fix: DOCFIX-044. DOCFIX-042 = phase-02 Step 2.1 split (2.1a verify / 2.1b init) + invoke preflight script. DOCFIX-043 = document bundle-symbolic vs live-real machine-ID remap + MAAS tag-allocation host swap; resolve by unit/hostname/CIDR, never by machine ID.
Session: 2026-06-27 (continues). Next free numbers at section open: D-056; DOCFIX-044.
scripts/phase-02-vault-preflight.sh (committed 1a103f5) computed the agent/hook-error count as select(."agent-status".current=="error"). In juju status --format json a UNIT carries workload-status + juju-status (the agent state: idle/executing/error); there is NO agent-status key on units. Confirmed against two authoritative consumers: deploy-watch.sh:43 (.value."juju-status".current=="error" for units) and the phase-03 do-doc acceptance walk (u.get('juju-status')). So the ae (hook-failed) half of the E gate was INERT -- it read a nonexistent key and always returned 0; a real hook failure would NOT have been caught.
juju status JSON schema; a fixture that agrees with the script's bug hides the bug. The phase-03 harness surfaced this because its fixtures (built from the do-doc's juju-status walk) disagreed with the bad key.ae -> select(."juju-status".current=="error") + an anti-regression header note. Harness corrected (fixtures + shim now juju-status); the FAIL-E case now sets juju-status.current=error and only passes because the key is right. RE-COMMIT REQUIRED over 1a103f5. Re-running on the (healthy) cloud still yields PROCEED; the fix matters for catching FUTURE hook failures.3.1a acceptance walk: 2 non-active/idle, BOTH expected -- glance-simplestreams-sync/0 (unknown, image-sync state) + octavia/0 (blocked "Awaiting configure-resources", D-021). No TLS consumer stuck. 3.1b haproxy backend-health sweep (D-045/DOCFIX-031): ZERO DOWN across all principal units -- the plaintext-vs-SSL backend failure did NOT recur this cycle (cert cascade + haproxy reload state healthy). No remediation needed.
Read-only Step 3.1 gate packaging 3.1a (acceptance walk) + 3.1b (haproxy sweep). HARDENED beyond the do-doc's bare count gate: phase03_accept_walk.py gates on IDENTITY -- only octavia (blocked/configure-resources) and glance-simplestreams-sync (unknown/waiting) may be non-active/idle; a different app blocked also yields count==2 yet correctly FAILS. The do-doc's inline python-in-bash acceptance walk is moved to its own tested .py (convention); the haproxy sweep's unit list comes from jq on the captured snapshot (no second juju call, no inline python). Mutations stay gated: a DOWN backend's haproxy -c + systemctl reload is a per-unit human step; Step 3.2 (admin-openrc) and 3.3 (Horizon) too. tests/phase-03/: unit-tests the .py (pass/unexpected-blocked) + behavior-tests the .sh with juju+jq shims (settled / unexpected-unit / injected haproxy-DOWN). ALL PASS, offline, no real jq. Real-jq/real-data: 3.1a+3.1b already ran by hand this session and PASSED; the script reproduces them.
Design decision: D-056. Doc fix: DOCFIX-045. DOCFIX-044 = phase-02 preflight hook-error key agent-status -> juju-status (+ harness fix).
ae ran live and reported agent-error(hook)=0 via juju-status (post-settle census: units=63, workload-error=0, agent-error=0, blocked=1 [octavia], waiting=0, active=61, unknown=1).Vault root CA pulled via get-root-ca --format json + jq (DOCFIX-021 path): CN=Vault Root Certificate Authority (charm-pki-local), valid 2026-06-27 -> 2036-06-24. Admin password from get-admin-password --format json; admin project DISCOVERED via the scope-test loop (DOCFIX-022; value recorded in ~/admin-openrc OS_PROJECT_NAME, not captured this turn). ~/admin-openrc written (chmod 600); openstack endpoint list authenticated and returned the full catalog -> confirms a SCOPED token (the gate). Endpoints IP-only on the three D-052 planes: public -> provider VIP 10.12.4.5x internal -> metal-internal 10.12.12.5x admin -> metal-admin 10.12.8.5x (keystone admin on :35357) VIP octets match bundle: keystone .50, barbican .51, cinderv3 .52, glance .53, magnum .54, neutron .55, nova .56, octavia .57, placement .59, radosgw/s3/swift .60:443.
The 3.2 GATE text reads "internal+admin on the metal VIP .8.5x" -- predates D-052's dedicated metal-internal plane. LIVE (correct) shows INTERNAL on metal-internal 10.12.12.5x and ADMIN on metal-admin 10.12.8.5x (bundle triple-VIP "10.12.4.5x 10.12.8.5x 10.12.12.5x" + D-052 internal binding). Amend the 3.2 gate to: public provider .4.5x; internal metal-internal .12.5x; admin metal-admin .8.5x; keystone admin :35357. ALSO (value drift, non-blocking): gss image-stream endpoint is HTTP on metal 10.12.8.226 this deploy (do-doc note said .172) -- the simplestreams image-stream IP is per-deploy; note as dynamic, do not hardcode. s3/swift on radosgw VIP .60:443 -- re-check vs radosgw :80 listener during any Swift/S3 smoke (carried-forward do-doc note).
v1 Horizon = PLAIN-HTTP reverse-proxy leg per D-044 (authoritative, adopted 2026-06-17). NO nginx edit was needed: the existing /etc/nginx/sites-available/openstack vhost on the nginx host (10.12.4.7) already proxies listen 81 -> proxy_pass http://10.12.4.58:80 at the CURRENT dashboard provider VIP (.58 confirmed vs bundle), with proxy_set_header Host $http_host (B5 ALLOWED_HOSTS) + X-Forwarded-. No proxyssl applied (that is the Roosevelt root-fix, not v1). The vhost's "Main LXD UI" comment is a stale mislabel (it is the Horizon proxy) -- cosmetic, flag for consolidation cleanup; left untouched to avoid mutating a working MAAS-fronting host. Live scheme probes (decisive, verify-before-mutate, from both jumphost and nginx host): jumphost->.58 https rc=000 FAIL(35) | http rc=200 nginx->.58 https rc=000 FAIL(35) | http rc=200 s_client .58:443 -> CONNECTED but "no peer certificate available" (certless :443 listener) => dashboard serves Horizon over HTTP :80; :443 is an unused, certless haproxy frontend. The certless :443 is EXPECTED under D-044 (v1 does not use dashboard HTTPS). The bundle's openstack-dashboard:certificates<->vault:certificates relation provisions a cert, but the v1 plain-HTTP leg never serves it. NOT a v1 defect; the Roosevelt DNS + FQDN-cert workstream is the end-to-end HTTPS root-fix. The earlier dashboard-SAN probe was therefore moot (proxy_ssl_name is a Roosevelt concern, not v1). Steps executed: A (nginx host, read-only): curl -sI http://127.0.0.1:81/horizon/ -> HTTP/1.1 302 Found (login redirect). GATE A met. B (jumphost, the one v1 mutation, PER-REBUILD, verbatim do-doc): juju ssh openstack-dashboard/leader wrote _99_internal_http_cookies.py (CSRF_COOKIE_SECURE=False + SESSION_COOKIE_SECURE=False, ASCII-only) + systemctl reload apache2. Clean. C (jumphost, verify adapted https->http per DOCFIX-046): csrftoken Set-Cookie present, no Secure attribute -> "OK: csrftoken not Secure". GATE C met. D: external browser login over http://10.17.11.246:81/horizon/ SUCCEEDED -- Horizon Overview renders as admin_domain/admin, fresh-cloud quotas 0-of-N. "Not secure" address bar = expected (plain-HTTP client leg, D-044). GATE D met. PHASE-03 EXIT GATE MET: 3.1 PASS (accept walk 2-expected + haproxy ZERO DOWN across 31 principals), 3.2 PASS (admin-openrc + scoped catalog), 3.3 PASS (Horizon reachable + login).
The phase-03 do-doc Step 3.3 body contains BOTH (a) an HTTPS-upstream edit set -- proxy_pass https://10.12.4.58:443, proxy_ssl_verify on, proxy_ssl_trusted_certificate, proxy_ssl_name + a dashboard-cert SAN discovery -- AND (b) the real "the upstream stays PLAIN HTTP (as-built)" line. These contradict. D-044 (authoritative) resolves it: v1 is the plain-HTTP leg; the proxy_ssl_name / HTTPS-upstream handling is the ROOSEVELT root-fix, not v1. The (a) block, if applied on the testcloud, would repoint nginx at the certless :443 and BREAK Horizon (curl 35) -- exactly what the live probes confirmed would happen. Also: the do-doc's D-044 VERIFY command uses curl --cacert ... https://10.12.4.58/... -- same HTTPS assumption; it fails (rc=000/35) against the v1 HTTP dashboard. Adapt to curl ... http://10.12.4.58/horizon/auth/login/. FIX (for completion consolidation): rewrite 3.3 to the v1 plain-HTTP path (verify the existing vhost points at the current dashboard VIP over http:80; no proxyssl; apply the cookie override; verify over http); move the proxyssl/SAN block verbatim into a clearly-marked "Roosevelt root-fix (DNS+FQDN certs)" subsection so a future operator does not apply it on the testcloud; fix the verify command https->http. Cross-ref D-044. Also fix the stale "Main LXD UI" vhost comment.
New read-only deliverable staged ahead of running phase-04 (network carve): scripts/phase-04-network-verify.sh -- verify-before-mutate + EXIT-GATE check for the Neutron external provider network. PRE gate: discovers the MAAS provider subnet BY CIDR (10.12.4.0/22) -- lib-net PATTERN-1, never a hardcoded subnet id -- asserts its gateway == pinned PLANE_GW (10.12.4.1) and that the FIP pool 10.12.5.0-10.12.7.254 is a RESERVED iprange on it (KI-P3-001). POST gate (auto-detected if provider-ext exists): external/flat/ physnet1/NOT-shared + subnet cidr/gateway/no-dhcp/FIP-pool. Sources lib-net.sh + need_jq; requires admin-openrc sourced + the 'admin' MAAS profile; never calls 'maas list' (DOCFIX-016). Exit 0 PROCEED|PASS / 1 HOLD|FAIL / 2 precondition. Mutates nothing. tests/phase-04/ -- offline regression (real jq + fake maas/openstack data shims; no live infra). 7/7 green: PRE PROCEED (net absent); POST PASS for BOTH allocation_pools shapes (list-of-objects AND list-of-strings -- tolerance proven, not assumed, so the live client's shape cannot silently break the gate); and four failure variants (FIP pool not reserved; wrong gateway; provider subnet absent-by-CIDR; provider-ext shared=true). bash -n clean; shellcheck 0.9.0 clean (no warnings) on script + harness + shims; ASCII + 0 CR on all five. NOTE: fixtures put the provider subnet at id=7 (NOT 1) on purpose, to prove CIDR discovery is id-independent (the exact failure mode DOCFIX-047 guards against).
DOCFIX-047 -- phase-04 do-doc hardcodes the provider MAAS subnet id (violates PATTERN-1). runbooks/phase-04-network-carve.md reads the provider gateway via maas admin subnet read 1 and its CHECK prose says "subnet id 1 (provider)" / "subnet id 2 (metal)" -- the PRE-D-052 two-plane numbering. lib-net.sh:9 records that the D-052 cutover renumbered subnets (metal- internal moved id 6 -> 10), so a hardcoded read 1 may now read the WRONG subnet. FIX (for completion consolidation): replace subnet read 1 / the "subnet id N" prose with CIDR-based discovery (select(.cidr=="10.12.4.0/22")), exactly as scripts/phase-04-network-verify.sh does; cross-ref the verify script from the do-doc's CHECK block. Not yet applied to the do-doc.
Step 4.1 create block ran clean (do-doc idempotent ( set -e ), with the DOCFIX-047 CIDR-discovery correction for the gateway; the [ GW = 10.12.4.1 ] gate retained as belt+braces): network provider-ext = bb386c86-d646-4c71-b6b7-550f5c691bfb (created + tagged role=provider) subnet provider-ext-fip = 544afa6a-b0cf-486b-89be-2b8e36983072 (created + tagged) (object IDs regenerate per deploy; the do-doc's As-built IDs are dead post-teardown, not a discrepancy.) CONFIRM: provider-ext external=true type=flat physnet=physnet1 shared=false; provider-ext-fip cidr=10.12.4.0/22 gateway=10.12.4.1 enable_dhcp=false allocation_pools=[{start:10.12.5.0,end:10.12.7.254}] tags=[role=provider, netbox-iprange=10.12.5.0-10.12.7.254]. phase-04-network-verify.sh POST gate: PASS -- EXIT GATE met (all network+subnet assertions green; fip-pool-match=true). Live allocation_pools came back as the list-of-OBJECTS shape -- the real client emits {start,end} objects; the harness string-shape case is confirmed safety-margin only. PRE re-run also PASS (provider subnet by CIDR id=1 this deploy; gateway pinned; FIP reserved). PHASE-04 EXIT GATE MET. FIP allocation + tenant router gateways now possible (needed by phase-06 mgmt-VM FIP; phase-08 cluster FIPs + LB validation).
DOCFIX-047 CONFIRMED LIVE: provider resolved to subnet id=1 THIS deploy, so the do-doc's subnet read 1 would have worked by luck -- but CIDR discovery is the correct id-independent pattern (lib-net.sh:9: cutover moved metal-internal 6->10) and ran clean. Do-doc fix still pending at consolidation.
DOCFIX-048 -- phase-04 do-doc IPAM reference VIP-reserve width drift. The do-doc "IPAM carve reference" lists the provider VIP reserve as 10.12.4.2-10.12.4.63 (front- loaded /26). LIVE MAAS shows the WIDER reserve 10.12.4.2-10.12.4.100 (comment "supersedes .224-.236") -- the D-052 "VIP reserve ceilings" correction. Both sit entirely in .4.x, OUTSIDE the FIP pool (10.12.5.0-10.12.7.254) -> no conflict; provider-ext created cleanly. The live mgmt-plane reserve 10.12.4.101-10.12.4.110 is also present (already in the do-doc As-built note). FIX (consolidation): update the do-doc IPAM reference VIP-reserve from .2-.63 to .2-.100 to match live + D-052. Non-blocking.
NOTE (repo hygiene, operator decision pending): all scripts on origin are committed mode 100644 (the Windows/GitHub-Desktop path strips +x), so the jumphost must invoke them as bash scripts/X.sh (./scripts/X.sh -> Permission denied). Two durable fixes offered: (a) standardize do-docs on bash scripts/...; or (b) one-time git update-index --chmod=+x scripts/*.sh tests/*/run-tests.sh tests/*/fakebin/* from Git Bash + commit (writes 755 into the tree). Not yet actioned.
Design decision: D-056. Doc fix: DOCFIX-049.