diff --git a/bundle.yaml b/bundle.yaml index eccbfd1..93c676b 100644 --- a/bundle.yaml +++ b/bundle.yaml @@ -92,8 +92,8 @@ mysql-innodb-cluster: charm: mysql-innodb-cluster channel: 8.0/stable - num_units: 1 # SINGLE-MYSQL seed; ramp to 3 in phase 2 via add-unit (D-009) - to: [lxd:8] # seed on machine 8 LXD; 9/10 reserved for ramp + num_units: 3 + to: [lxd:8, lxd:9, lxd:10] bindings: '': metal-admin certificates: metal-internal diff --git a/docs/design-decisions.md b/docs/design-decisions.md index 88f8308..691bf75 100644 --- a/docs/design-decisions.md +++ b/docs/design-decisions.md @@ -974,3 +974,82 @@ **Related:** D-052 / D-053 (plane scheme, retained), D-003B (API + FIP L2 co-location, restored), D-059 (NIC budget retained; its inventory's "enp1s0 = provider-public + provider-vip" row is now just provider-public -- the physical five-NIC count is unchanged). + +## D-061: Machine-preserving teardown via `remove-machine --keep-instance` (supersedes D-055) + +**Status:** ADOPTED 2026-06-30. Supersedes D-055. Documented behavior; the `--keep-instance` +path is UNVALIDATED on this virsh-pod MAAS at adoption (validate via the canary, below). + +**Problem (observed 3x in one session):** `juju destroy-model` DECOMPOSES the MAAS pod-composed +openstack0-3 machines -- they are deleted from MAAS and the carve is lost, forcing a full +reenroll + re-carve. This happened with `--destroy-storage --force`, and AGAIN with +`--force --no-wait` and NO storage flag at all. So D-055's diagnosis ("omit --destroy-storage +and the machines release to Ready") is WRONG/insufficient: the storage flag is not the trigger; +`destroy-model` itself releases the instances, and a MAAS pod auto-decomposes a released +pod-composed VM. + +**Root cause / mechanism:** machine retention is NOT a `destroy-model` option -- it is a +`remove-machine` option. Juju 3.6 `remove-machine` reference: a machine can be removed from the +Juju model without affecting the corresponding cloud instance via `--keep-instance`. +`destroy-model` has no machine-preserving flag. The openstack0-3 VMs are MAAS pod-composed +(VM host `lxd`), so any *release* auto-decomposes them; `--keep-instance` is the only documented +way to detach-from-model without release. + +**Decision:** two teardown scripts, by intent: +- `scripts/phase-00-teardown-release.sh` -- KEEP machines. Per host: + `juju remove-machine -m openstack --keep-instance --force --no-wait` FIRST (detach, + preserve the MAAS instance + carve), HARD-VERIFY each host survived in MAAS (FAIL LOUD if any + decomposed), THEN `juju destroy-model openstack --release-storage --force --no-wait --no-prompt`. + Result: hosts stay Deployed + carved, reusable with no reenroll/recarve. +- `scripts/phase-00-teardown-destroy.sh` -- DECOMPOSE machines (the from-scratch path): + `juju destroy-model openstack --destroy-storage --force --no-wait --no-prompt`; reenroll+carve after. +Both default to dry-run and require typing the model name at a /dev/tty gate before mutating. + +**First-run safety (UNVALIDATED caveat):** `--keep-instance` is documented but had not been +exercised on this virsh-pod MAAS when adopted. `phase-00-teardown-release.sh --apply --canary` +removes ONLY openstack0, verifies it survived in MAAS, and STOPS -- run that before trusting the +all-four path. The script also hard-verifies survival after every remove and exits 1 if any host +decomposed (so a wrong assumption fails loud, not silently). + +**Roosevelt:** bare-metal nodes are enlisted, not pod-composed, so release does not auto-decompose +-- the keep path is simpler there. If a MAAS pod is ever used at Roosevelt, configure it to not +decompose-on-release (or keep erase-on-release OFF) and carry the --keep-instance discipline. + +**Related:** D-055 (superseded), D-018 (MAAS-release-direct teardown intent), DOCFIX-057 +(old phase-00-teardown.sh deprecated -- its "releases to Ready" premise was wrong). + +## D-062: mysql-innodb-cluster deploys at target unit count; no single-unit seed-then-scale + +**Status:** ADOPTED 2026-06-30. Refines the D-009 unit-count guidance for this charm. + +**Context:** to avoid the 3-way Group-Replication formation race that twice parked a unit `blocked` +with a metadata half-join, a single-mysql seed (`num_units: 1`, ramp to 3 later) was tried. It +FAILED: the unit parked `blocked` `'cluster' missing, Instance not yet configured for clustering`. + +**Root cause:** the OpenStack reactive `mysql-innodb-cluster` charm bootstraps the InnoDB cluster +(`create_cluster`) from inside its `cluster` PEER-relation handler. With `num_units: 1` the peer +relation never receives a `-relation-changed` carrying a peer, so the bootstrap handler never runs. +Live flags confirmed: `certificates.available`, `local.cluster.user-created`, leader passwords set, +but NO `cluster.available`/bootstrapped flag; the log showed only update-status/db-router hooks +cycling -- no `create_cluster` attempt. This is NOT the Canonical "Charmed MySQL" charm (different +interface) which DOES support single-then-scale; the OpenStack charm's README only documents +`-n 3` deploy. + +**Decision:** deploy `mysql-innodb-cluster` at its target unit count (3) from cold. Peer-driven +bootstrap is the charm's happy path: with 3 units brought up together, the cluster forms and (this +deploy) reported clean -- `status: OK`, 1 R/W + 2 R/O, `instanceErrors=[]` on all members. The +single-unit seed-then-ramp idea is RETIRED for this charm. + +**Half-join distinction (carried):** the metadata half-join (`instanceErrors: ["...not managed by +InnoDB cluster. Use cluster.rescan()..."]`) is an artifact of MID-LIFE instance addition, not cold +bootstrap. If it recurs during a scale operation, the documented fix is +`cluster.rescan({addInstances:"auto"})` via mysqlsh as `clusteruser` (NOT the charm's bare +`cluster-rescan` action, which reports-but-does-not-adopt). Cold 3-unit deploy did not hit it. + +**As-executed this deploy:** the running model was (mistakenly) deployed single-mysql, then +corrected in place: `juju remove-unit mysql-innodb-cluster/1` (a stuck mis-placed unit), then +`juju add-unit mysql-innodb-cluster -n 2 --to lxd:1,lxd:2` (real live machine ids 0-3 this cycle). +Cluster formed clean. The committed bundle is reverted to `num_units: 3` / `to: [lxd:8,9,10]` +(matching its own `machines:` 8-11 declaration; the live 0-3 numbering was a deploy-time artifact). + +**Related:** D-009 (3 units on Roosevelt), BUNDLEFIX (bundle reverted to 3-unit). diff --git a/docs/v1-redeploy-changelog.md b/docs/v1-redeploy-changelog.md index cd8b655..a86b2ab 100644 --- a/docs/v1-redeploy-changelog.md +++ b/docs/v1-redeploy-changelog.md @@ -1021,3 +1021,72 @@ ### Next-free numbers Design decision: D-060. Doc fix: DOCFIX-056. + +### 2026-06-30 -- Pattern A redeploy executed: phases 00->03 PASS (mysql, Vault, core-verify, Horizon) + +Full Pattern-A (D-060) redeploy executed end to end through phase-03. Net: a clean, +core-verified cloud -- mysql clustered, Vault PKI up, cert cascade settled, 31/31 haproxy +backends serving, admin-openrc built, Horizon reachable. Phases 04-08 remain. + +PHASE-00 (teardown/recarve/redeploy prep): +- CARVEFIX-001 (committed adf2890, prior) VALIDATED LIVE: every host shows `enp1s0 -> VLAN + 5001` before `create br-ex`; br-ex lands fabric 0 (1_provider) with 10.12.4.4N. The explicit + VLAN move IS required (MAAS link-subnet does not auto-move VLAN). +- carve x4 clean (octets .40-.43), fabric-prune clean (auto-fabrics self-reclaim; fabric-4 kept + = LXD substrate), standup OK (consistent D-052/D-053). +- OSD wipe: vdb images carried 50-63 MiB LVM/ceph-volume residue post-decompose (NOT data, but + enough signature to warrant the wipe). qemu-img map (read-only) confirmed before the gated + rm+recreate -> four blank 512G (root:root 600). ceph-osd accepted the wiped disks cleanly. + +THREE MACHINE-DECOMPOSE INCIDENTS (Claude-caused; root-caused + fixed): +- destroy-model decomposed openstack0-3 three times (twice via Claude-advised destroy, incl. one + run with NO storage flag). Root cause + fix = D-061: machine retention is `remove-machine + --keep-instance`, NOT a destroy-model storage flag. Two replacement scripts written + (phase-00-teardown-release.sh / -destroy.sh); old phase-00-teardown.sh deprecated (DOCFIX-057). + Each incident cost a reenroll+carve cycle (libvirt domains survived every time -- the saving grace). + +PHASE-01 (deploy): +- BUNDLEFIX (this session): a single-mysql seed edit (num_units 1) was committed (bf0f3fa) then + found WRONG -- see D-062. Reverted to num_units 3 / to:[lxd:8,9,10]. +- Deployed; converged to the Vault gate (vault blocked "needs initialization"; ovn/neutron/barbican + waiting on certs -- all expected). openstack0 = control-only confirmed (no nova-compute). + +PHASE-01b (mysql clustering -- D-062): +- single-unit mysql parked blocked `'cluster' missing, Instance not yet configured for clustering`. + Root-caused: the reactive charm bootstraps from the `cluster` PEER-relation handler, which never + fires at num_units 1. NOT the Canonical Charmed MySQL charm. Fix: deploy at 3. +- Corrected in place: remove-unit the stuck mis-placed /1, then add-unit -n 2 --to lxd:1,lxd:2. + cluster-status: OK, 1 R/W + 2 R/O, instanceErrors=[] on all three. Clean cold bootstrap. + +PHASE-02 (Vault) -- PASS, no DOCFIX, proven path held: +- init (2>&1|tee, 5 shares/threshold 3) -> keys saved off-host; unseal 3-of-5 -> Sealed false, + Storage Type mysql, HA Enabled false (correct); authorize-charm (10m child token) + generate-root-ca + -> vault active/idle, root CA valid to 2036. Cert cascade settled fast: ovn-central x3, ovn-chassis + x3, ovn-chassis-octavia, neutron-api-plugin-ovn, barbican-vault all -> active. magnum self-resolved + to active at TLS cutover (FINDING-2). Tally 29 active / 1 unknown (gss) / 1 blocked (octavia) / 0 error. +- REFINEMENT: keystone landed `active` ("PO (broken): Unit is ready" at unit level) pre-policy-attach, + not the PO-broken-app the runbook anticipated -- the empty policyd-override is a clean no-op on this + charm rev. Not a regression; D-051 zip attaches in phase-08. + +PHASE-03 (core verify) -- PASS: +- 3.1 phase-03-core-verify.sh: acceptance walk PASS (only gss + octavia non-active, both expected); + haproxy backend sweep PASS (all backends UP across 31 principal units, zero DOWN). +- 3.2 admin-openrc built: scoped token OK; endpoint catalog confirms the D-052 three-plane VIP scheme + (public .4.5x / internal .12.5x / admin .8.5x; keystone admin :35357). admin project = `admin`. + NOTE: gss image-stream live VIP is 10.12.8.166 (do-doc 3.2 text says .8.172 -- stale reference, + DOCFIX-class, no action). +- 3.3 Horizon: the dashboard VIP 10.12.4.58 serves PLAIN HTTP (haproxy pass-through, backend :433/:70, + no crt) -- the do-doc's proxy_ssl_name/DNS-SAN upstream-TLS machinery is INAPPLICABLE here. The live + nginx vhost (proxy host 10.12.4.7, listen 81) was ALREADY correct HTTP-upstream + (proxy_pass http://10.12.4.58:80, Host preserved, no proxy_ssl_*); no edit needed. D-044 cookie + override reapplied (per-rebuild). Browser login page renders end-to-end via 10.17.11.246:81/horizon + with the Domain field (multidomain enabled). DOCFIX-058: phase-03 do-doc 3.3 must document the + HTTP-upstream reality (the as-built proxy is right; the runbook prose is stale -- DOCFIX-046 adjacent). + +CORRECTION to an earlier sweep note: the bundle `to: [lxd:8,9,10]` is NOT a bug -- the bundle declares +its own `machines:` 8-11, so 8/9/10 are internally consistent. The live 0-3 numbering was a deploy-time +artifact; the bundle correctly uses 8-11. + +### Next-free numbers +Design decision: D-063. Doc fix: DOCFIX-059. (D-061 teardown, D-062 mysql; DOCFIX-057 old-teardown +deprecation, DOCFIX-058 phase-03 3.3 HTTP-upstream both recorded above.) diff --git a/scripts/phase-00-teardown-destroy.sh b/scripts/phase-00-teardown-destroy.sh new file mode 100644 index 0000000..961cbae --- /dev/null +++ b/scripts/phase-00-teardown-destroy.sh @@ -0,0 +1,135 @@ +#!/usr/bin/env bash +# scripts/phase-00-teardown-destroy.sh [--apply] [--no-prompt] +# +# FULL-DESTROY teardown (the from-scratch path). Destroys the `openstack` model AND +# lets the MAAS pod-composed openstack0-3 machines DECOMPOSE (returned to the libvirt +# pod, removed from MAAS). After this you MUST reenroll (scripts/reenroll-hosts.sh) and +# re-carve (scripts/carve-host-interfaces.sh) before the next deploy. +# +# WHEN TO USE: when you deliberately want a clean MAAS re-enrollment (e.g. validating +# the reenroll+carve path itself, or recovering from a corrupted machine state). For a +# reuse-in-place teardown that KEEPS the machines, use phase-00-teardown-release.sh. +# +# D-061 (honest behavior note): on this virsh-pod MAAS, `juju destroy-model` decomposes +# the pod-composed machines REGARDLESS of the storage flag. This script EMBRACES that +# (it is the destroy path) rather than fighting it. --destroy-storage is used so OSD +# data is discarded too (a fresh deploy re-wipes vdb anyway). This is NOT the bug that +# bit the project 3x -- the bug was using THIS behavior when machine RETENTION was +# wanted. If you want retention, you are on the wrong script (use -release.sh). +# +# Roster (resolved live): +# PROTECTED (never touched): juju, lxd, tailscale +# HOSTS (decomposed): openstack0-3 -> reenroll+carve after +# ORPHAN (deleted if present): capi-mgmt +# +# DEFAULT = DRY-RUN. --apply executes; typed-approval gate (model name) first. +# --no-prompt skips the gate (tested automation only). +# +# Exit: 0 ok | 1 fatal/unsafe | 2 aborted. ASCII + LF. +set -euo pipefail +shopt -s inherit_errexit 2>/dev/null || true + +MAAS_PROFILE="${MAAS_PROFILE:-admin}" +MODEL="${OPENSTACK_MODEL:-openstack}" +HOSTS=(openstack0 openstack1 openstack2 openstack3) +ORPHANS=(capi-mgmt) +PROTECTED=(juju lxd tailscale) + +MODE="dryrun"; PROMPT=1 +for a in "$@"; do + case "$a" in + --apply) MODE="apply" ;; + --no-prompt) PROMPT=0 ;; + *) echo "unknown arg: $a" >&2; exit 1 ;; + esac +done +FATAL=0 +hdr() { echo; echo "=== $* ==="; } +note() { echo " - $*"; } +fail() { echo "FAIL: $*" >&2; FATAL=$((FATAL+1)); } +command -v jq >/dev/null || { echo "FATAL: jq required" >&2; exit 1; } +command -v juju >/dev/null || { echo "FATAL: juju not on PATH" >&2; exit 1; } + +maas_json() { local o; o="$(maas "$MAAS_PROFILE" "$@" 2>/dev/null || true)"; printf '%s' "$o" | jq empty 2>/dev/null && printf '%s' "$o" || printf '[]'; } +MACHINES_JSON="$(maas_json machines read)" +sid_of() { printf '%s' "$MACHINES_JSON" | jq -r --arg h "$1" '.[]|select(.hostname==$h)|.system_id' | head -1; } +status_of() { printf '%s' "$MACHINES_JSON" | jq -r --arg h "$1" '.[]|select(.hostname==$h)|.status_name' | head -1; } + +hdr "destroy-teardown audit mode=$MODE model=$MODEL" + +declare -A PROT_SID +hdr "PROTECTED substrate (never touched)" +for p in "${PROTECTED[@]}"; do + s="$(sid_of "$p")" + if [ -z "$s" ]; then note "$p: not in MAAS -- nothing to protect"; continue; fi + PROT_SID["$s"]="$p"; note "$p = $s (status $(status_of "$p")) -- PROTECTED" +done + +hdr "HOSTS (will DECOMPOSE -> reenroll+carve after)" +for h in "${HOSTS[@]}"; do + s="$(sid_of "$h")" + if [ -z "$s" ]; then note "$h: already absent from MAAS"; continue; fi + if [ -n "${PROT_SID[$s]:-}" ]; then fail "$h resolves to PROTECTED sid $s -- ABORT"; continue; fi + note "$h = $s (status $(status_of "$h"))" +done + +declare -A OSID +hdr "ORPHANS (deleted)" +for o in "${ORPHANS[@]}"; do + s="$(sid_of "$o")" + if [ -z "$s" ]; then note "$o: absent -- SKIP"; continue; fi + if [ -n "${PROT_SID[$s]:-}" ]; then fail "$o resolves to PROTECTED sid $s -- ABORT"; continue; fi + OSID["$s"]="$o"; note "$o = $s -- DELETE" +done + +MODEL_PRESENT=0 +if juju models --format=json 2>/dev/null | jq -e --arg m "$MODEL" '.models[]?|select(.name==$m or (.name|endswith("/"+$m)))' >/dev/null 2>&1; then + MODEL_PRESENT=1; note "juju model '$MODEL' PRESENT -- will destroy" +else + note "juju model '$MODEL' not present -- destroy skipped" +fi + +[ "$FATAL" -eq 0 ] || { echo; echo "ABORT: $FATAL safety failure(s) -- nothing changed"; exit 1; } + +hdr "PLAN" +echo " 1) juju destroy-model $MODEL --destroy-storage --force --no-wait --no-prompt" +echo " (decomposes openstack0-3; discards OSD storage)" +echo " 2) delete orphan MAAS machine(s): ${ORPHANS[*]}" +echo " 3) verify hosts gone + substrate intact" +echo " AFTER: reenroll-hosts.sh -> carve-host-interfaces.sh (x4) -> maas-fabric-prune.sh -> maas-standup.sh" +echo " PROTECTED: ${PROTECTED[*]}" + +if [ "$MODE" = dryrun ]; then + echo; echo " re-run with --apply to execute (typed model-name gate)." + echo "OK (dryrun)"; exit 0 +fi + +if [ "$PROMPT" -eq 1 ] && [ "$MODEL_PRESENT" = 1 ]; then + printf 'Type the model name "%s" to confirm FULL DESTROY (machines WILL decompose): ' "$MODEL" > /dev/tty + read -r ans < /dev/tty + [ "$ans" = "$MODEL" ] || { echo "aborted (got '$ans') -- nothing changed"; exit 2; } +fi + +hdr "MUTATE 1: destroy model (machines decompose)" +if [ "$MODEL_PRESENT" = 1 ]; then + echo " DO: juju destroy-model $MODEL --destroy-storage --force --no-wait --no-prompt" + juju destroy-model "$MODEL" --destroy-storage --force --no-wait --no-prompt 2>&1 || fail "destroy-model returned error" +else note "model absent -- skip"; fi +[ "$FATAL" -eq 0 ] || { echo; echo "STOP: destroy-model failed -- not deleting orphans."; exit 1; } + +hdr "MUTATE 2: delete orphan machines" +for s in "${!OSID[@]}"; do + echo " DO: delete orphan ${OSID[$s]} ($s)" + maas "$MAAS_PROFILE" machine delete "$s" >/dev/null 2>&1 || note "orphan ${OSID[$s]} delete failed (may already be gone)" +done + +hdr "VERIFY (read-only): hosts gone + substrate intact" +MACHINES_JSON="$(maas_json machines read)" +for h in "${HOSTS[@]}"; do + st="$(status_of "$h")" + if [ -z "$st" ]; then note "$h -> decomposed/absent (expected)"; else note "$h -> $st (still present; destroy may be in progress -- re-check)"; fi +done +for p in "${PROTECTED[@]}"; do note "PROTECTED $p -> $(status_of "$p") (unchanged)"; done + +echo; echo "next: reenroll-hosts.sh -> carve-host-interfaces.sh x4 -> maas-fabric-prune.sh -> phase-00-maas-standup.sh" +echo "OK (apply)" diff --git a/scripts/phase-00-teardown-release.sh b/scripts/phase-00-teardown-release.sh new file mode 100644 index 0000000..2976e5f --- /dev/null +++ b/scripts/phase-00-teardown-release.sh @@ -0,0 +1,201 @@ +#!/usr/bin/env bash +# scripts/phase-00-teardown-release.sh [--apply] [--canary] [--no-prompt] +# +# MACHINE-PRESERVING teardown (D-061). Removes the `openstack` model WITHOUT +# decomposing the MAAS pod-composed openstack0-3 machines, so they stay enlisted + +# carved and the next deploy needs NO reenroll/recarve. +# +# WHY THIS EXISTS (D-061, supersedes D-055): on this virsh-pod MAAS, `juju +# destroy-model` decomposes the pod-composed machines REGARDLESS of --destroy-storage +# (observed 3x in one session, incl. a run with neither --destroy-storage nor a release +# flag). D-055's "omit --destroy-storage" diagnosis did NOT hold. The documented +# machine-retention primitive is on `remove-machine`, NOT `destroy-model`: +# Juju 3.6 remove-machine ref: "It is possible to remove a machine from Juju model +# without affecting the corresponding cloud instance by using the --keep-instance +# option." destroy-model has no such option. +# So: `juju remove-machine --keep-instance --force --no-wait` per host FIRST +# (detaches from the model, leaves the MAAS instance intact), THEN destroy-model with +# only the controller relationship + LXD containers left to clean up. +# +# !!! UNVALIDATED ON THIS VIRSH-POD MAAS at first authorship. --keep-instance is +# DOCUMENTED but had not been exercised here when this script was written. Run +# --canary FIRST (removes ONLY openstack0, hard-verifies it survived in MAAS, then +# STOPS) before trusting the all-four path. The script HARD-VERIFIES survival after +# every remove and FAILS LOUD (exit 1) if any host decomposed. +# +# Roster (resolved live from `maas admin machines read`): +# PROTECTED (never touched): juju, lxd, tailscale -- management substrate +# HOSTS (kept in MAAS): openstack0-3 +# ORPHAN (deleted if present): capi-mgmt +# +# DEFAULT = DRY-RUN AUDIT (resolves sids, prints plan, changes nothing). +# --apply execute. Typed-approval gate (model name) before any mutation. +# --canary with --apply: do ONLY openstack0 (remove --keep-instance + verify), STOP. +# --no-prompt skip the typed gate (tested automation only). +# +# Exit: 0 ok | 1 fatal/unsafe (target intersects substrate; host decomposed; destroy +# failed) | 2 aborted by operator. ASCII + LF. +set -euo pipefail +shopt -s inherit_errexit 2>/dev/null || true + +MAAS_PROFILE="${MAAS_PROFILE:-admin}" +MODEL="${OPENSTACK_MODEL:-openstack}" +HOSTS=(openstack0 openstack1 openstack2 openstack3) +ORPHANS=(capi-mgmt) +PROTECTED=(juju lxd tailscale) + +MODE="dryrun"; PROMPT=1; CANARY=0 +for a in "$@"; do + case "$a" in + --apply) MODE="apply" ;; + --canary) CANARY=1 ;; + --no-prompt) PROMPT=0 ;; + *) echo "unknown arg: $a" >&2; exit 1 ;; + esac +done +FATAL=0 +hdr() { echo; echo "=== $* ==="; } +note() { echo " - $*"; } +fail() { echo "FAIL: $*" >&2; FATAL=$((FATAL+1)); } +command -v jq >/dev/null || { echo "FATAL: jq required" >&2; exit 1; } +command -v juju >/dev/null || { echo "FATAL: juju not on PATH" >&2; exit 1; } + +maas_json() { local o; o="$(maas "$MAAS_PROFILE" "$@" 2>/dev/null || true)"; printf '%s' "$o" | jq empty 2>/dev/null && printf '%s' "$o" || printf '[]'; } +MACHINES_JSON="$(maas_json machines read)" +sid_of() { printf '%s' "$MACHINES_JSON" | jq -r --arg h "$1" '.[]|select(.hostname==$h)|.system_id' | head -1; } +status_of() { printf '%s' "$MACHINES_JSON" | jq -r --arg h "$1" '.[]|select(.hostname==$h)|.status_name' | head -1; } + +# resolve the juju machine-id for a given MAAS hostname (juju inst-id == MAAS hostname here) +juju_mid_of() { + juju machines -m "$MODEL" --format=json 2>/dev/null \ + | jq -r --arg h "$1" '.machines | to_entries[] | select(.value."instance-id"==$h) | .key' | head -1 +} + +hdr "release-teardown audit mode=$MODE canary=$CANARY model=$MODEL" + +# --- protected substrate (must never be a target) --- +declare -A PROT_SID +hdr "PROTECTED substrate (never touched)" +for p in "${PROTECTED[@]}"; do + s="$(sid_of "$p")" + if [ -z "$s" ]; then note "$p: not in MAAS -- nothing to protect"; continue; fi + PROT_SID["$s"]="$p"; note "$p = $s (status $(status_of "$p")) -- PROTECTED" +done + +# --- hosts (kept) --- +hdr "HOSTS (kept in MAAS via --keep-instance)" +declare -A HMID +HOSTS_EFFECTIVE=("${HOSTS[@]}") +[ "$CANARY" -eq 1 ] && HOSTS_EFFECTIVE=(openstack0) +for h in "${HOSTS_EFFECTIVE[@]}"; do + s="$(sid_of "$h")" + if [ -z "$s" ]; then fail "$h: not in MAAS -- roster mismatch"; continue; fi + if [ -n "${PROT_SID[$s]:-}" ]; then fail "$h resolves to PROTECTED sid $s -- ABORT"; continue; fi + mid="$(juju_mid_of "$h")" + if [ -z "$mid" ]; then note "$h = $s (status $(status_of "$h")) -- NOT in juju model (already detached?); will skip remove"; HMID["$h"]=""; continue; fi + HMID["$h"]="$mid"; note "$h = $s juju-machine $mid (status $(status_of "$h"))" +done + +# --- orphans (deleted; absent ok) --- +declare -A OSID +hdr "ORPHANS (deleted from MAAS)" +for o in "${ORPHANS[@]}"; do + s="$(sid_of "$o")" + if [ -z "$s" ]; then note "$o: absent -- SKIP"; continue; fi + if [ -n "${PROT_SID[$s]:-}" ]; then fail "$o resolves to PROTECTED sid $s -- ABORT"; continue; fi + OSID["$s"]="$o"; note "$o = $s -- DELETE" +done + +# --- model presence --- +MODEL_PRESENT=0 +if juju models --format=json 2>/dev/null | jq -e --arg m "$MODEL" '.models[]?|select(.name==$m or (.name|endswith("/"+$m)))' >/dev/null 2>&1; then + MODEL_PRESENT=1; note "juju model '$MODEL' PRESENT" +else + note "juju model '$MODEL' not present -- remove/destroy skipped" +fi + +[ "$FATAL" -eq 0 ] || { echo; echo "ABORT: $FATAL safety failure(s) -- nothing changed"; exit 1; } + +hdr "PLAN" +echo " 1) per host: juju remove-machine --keep-instance --force --no-wait" +echo " (detaches from model; MAAS instance + carve PRESERVED)" +echo " 2) HARD-VERIFY each host still present in MAAS (FAIL LOUD if decomposed)" +[ "$CANARY" -eq 1 ] && echo " -- CANARY: openstack0 ONLY, then STOP (no destroy-model, no orphan delete) --" +if [ "$CANARY" -eq 0 ]; then + echo " 3) juju destroy-model $MODEL --release-storage --no-prompt (machines already detached)" + echo " 4) delete orphan MAAS machine(s): ${ORPHANS[*]}" +fi +echo " PROTECTED: ${PROTECTED[*]}" + +if [ "$MODE" = dryrun ]; then + echo; echo " re-run with --apply (and --canary for the single-host first run)." + echo "OK (dryrun)"; exit 0 +fi + +# ---- typed-approval gate ---- +if [ "$PROMPT" -eq 1 ] && [ "$MODEL_PRESENT" = 1 ]; then + printf 'Type the model name "%s" to confirm machine-preserving teardown: ' "$MODEL" > /dev/tty + read -r ans < /dev/tty + [ "$ans" = "$MODEL" ] || { echo "aborted (got '$ans') -- nothing changed"; exit 2; } +fi + +# ---- MUTATE 1: remove-machine --keep-instance per host, verify survival ---- +hdr "MUTATE 1: remove-machine --keep-instance (per host) + survival verify" +if [ "$MODEL_PRESENT" = 1 ]; then + for h in "${HOSTS_EFFECTIVE[@]}"; do + mid="${HMID[$h]:-}" + if [ -z "$mid" ]; then note "$h: no juju machine-id -- skip remove"; continue; fi + echo " DO: juju remove-machine $mid --keep-instance --force --no-wait ($h)" + if ! juju remove-machine "$mid" -m "$MODEL" --keep-instance --force --no-wait 2>&1; then + fail "remove-machine $mid ($h) returned error"; continue + fi + done + echo " ...waiting 30s for MAAS to settle, then verifying survival" + sleep 30 + MACHINES_JSON="$(maas_json machines read)" + for h in "${HOSTS_EFFECTIVE[@]}"; do + s="$(sid_of "$h")" + if [ -z "$s" ]; then + fail "$h DECOMPOSED -- gone from MAAS after remove-machine --keep-instance (the documented behavior did NOT hold on this MAAS; STOP and investigate before continuing)" + else + note "$h survived in MAAS = $s (status $(status_of "$h")) -- GOOD" + fi + done +else note "model absent -- skip remove"; fi + +[ "$FATAL" -eq 0 ] || { echo; echo "STOP: a host decomposed or remove failed -- NOT destroying model / deleting orphans. Investigate."; exit 1; } + +if [ "$CANARY" -eq 1 ]; then + echo; echo "CANARY OK: openstack0 detached from model and SURVIVED in MAAS." + echo " --keep-instance is validated on this MAAS. Re-run WITHOUT --canary for all four + destroy-model." + echo "OK (canary)"; exit 0 +fi + +# ---- MUTATE 2: destroy model (machines already detached -> nothing to decompose) ---- +hdr "MUTATE 2: destroy model (machines detached; --release-storage)" +if [ "$MODEL_PRESENT" = 1 ]; then + echo " DO: juju destroy-model $MODEL --release-storage --force --no-wait --no-prompt" + if ! juju destroy-model "$MODEL" --release-storage --force --no-wait --no-prompt 2>&1; then + fail "destroy-model returned error" + fi +else note "model absent -- skip"; fi + +# ---- MUTATE 3: delete orphans ---- +hdr "MUTATE 3: delete orphan machines" +for s in "${!OSID[@]}"; do + echo " DO: delete orphan ${OSID[$s]} ($s)" + maas "$MAAS_PROFILE" machine delete "$s" >/dev/null 2>&1 || note "orphan ${OSID[$s]} delete failed (may already be gone)" +done + +# ---- VERIFY ---- +hdr "VERIFY (read-only): hosts present + substrate intact" +MACHINES_JSON="$(maas_json machines read)" +for h in "${HOSTS[@]}"; do + st="$(status_of "$h")" + if [ -n "$st" ]; then note "$h -> $st (kept)"; else note "$h -> ABSENT (unexpected for the keep path)"; fi +done +for p in "${PROTECTED[@]}"; do note "PROTECTED $p -> $(status_of "$p") (unchanged)"; done + +echo; echo "next: hosts are KEPT (Deployed) -- a redeploy reuses them in place (no reenroll/recarve)." +echo " if you intend a fresh deploy onto these same carved hosts: juju add-model $MODEL -> deploy." +echo "OK (apply)" diff --git a/scripts/phase-00-teardown.sh b/scripts/phase-00-teardown.sh index e52b3fa..2b6064e 100644 --- a/scripts/phase-00-teardown.sh +++ b/scripts/phase-00-teardown.sh @@ -1,6 +1,14 @@ #!/usr/bin/env bash # scripts/phase-00-teardown.sh [--apply] [--no-prompt] # +# !!! DEPRECATED (DOCFIX-057 / D-061). DO NOT USE. Superseded by: +# scripts/phase-00-teardown-destroy.sh -- full destroy + decompose (reenroll path) +# scripts/phase-00-teardown-release.sh -- machine-preserving (remove-machine --keep-instance) +# This script's premise is WRONG: the header + PLAN below claim "destroy-model releases +# openstack0-3 to Ready". On this virsh-pod MAAS, destroy-model DECOMPOSES the pod-composed +# machines regardless of --destroy-storage (D-061, observed 3x). It does NOT release to Ready. +# Retained only for history; the two scripts above replace it. See docs/design-decisions.md D-061. +# # Gated teardown for the Pattern A revert (D-060). Destroys the `openstack` Juju model and # deletes the orphaned `capi-mgmt` MAAS machine, so the hosts release to Ready for the # Pattern A re-carve/standup-verify/rebuild. HARD-EXCLUDES the management substrate (juju, lxd, tailscale):