Newer
Older
openstack-caracal-ipv4 / runbooks / phase-00-teardown-maas-reset.md
@JANeumatrix JANeumatrix 12 hours ago 13 KB Patches

Phase 00 -- Teardown + Pattern A Reset (D-060 / D-061)

Remove the openstack Juju model, delete the orphaned capi-mgmt MAAS machine, and bring the four hosts (openstack0-3) back to a deploy-ready state on the EXISTING D-052/D-053 plane scheme. This is the rebuild-prep window -- it runs BEFORE phase-01.

The deterministic, repeatable work is owned by SCRIPTS (each resolves every id live and is dry-run by default); the destructive juju + libvirt steps stay human-gated here. Run from jumphost vopenstack-jesse (user jessea123, sudo; also the libvirt hypervisor); MAAS CLI logged in as profile admin. Invoke every script as bash scripts/<name>.sh -- the repo does not carry exec bits (DOCFIX-069: the Windows commit path strips them; a bare invocation fails "Permission denied" on a fresh clone).

Decisions: D-061 (teardown fork: machine-preserving vs full-destroy; supersedes D-055), D-060 (Pattern A revert; supersedes D-057/D-058), D-018 (teardown intent), D-017 (full rebuild every cycle, nothing preserved). The live MAAS is already on D-052/D-053, so there is NO re-CIDR -- the carve re-establishes Pattern A interfaces on the existing subnets.

!!! DESTRUCTIVE. The teardown (Step 2) and the OSD wipe (Step 4) are irreversible. There is NO model-state rollback (D-017): the repo runbooks ARE the tested restore path. Each destructive step is DISCRETE and individually gated -- do not batch.

!!! DO NOT USE the old phase-00-teardown.sh. It is DEPRECATED (DOCFIX-057 / D-061): its "destroy-model releases hosts to Ready" premise is WRONG on this virsh-pod MAAS -- destroy-model DECOMPOSES the pod-composed machines (observed 3x). Use the D-061 pair below. (DOCFIX-066 replaced the old spine of this runbook, which still drove it.)

CAPI-MGMT: the orphaned capi-mgmt MAAS machine (retired D-033 out-of-cloud node) is DELETED by BOTH D-061 teardown scripts. The in-cloud capi-mgmt-v2 tenant VM (phase-06) dies with the model.


Step 2 has two paths -- choose deliberately (D-061)

Path B -- DESTROY (default redeploy spine) Path A -- RELEASE (machine-preserving)
Script bash scripts/phase-00-teardown-destroy.sh bash scripts/phase-00-teardown-release.sh
Machines after DECOMPOSED (gone from MAAS; libvirt domains survive) kept in MAAS, Deployed, carve intact
Then reenroll (Step 2b) + wipe + carve + standup -> phase-01 model gone; hosts untouched
Status VALIDATED end-to-end (this is the rehearsed full-redeploy path) UNVALIDATED on this virsh-pod MAAS -- canary first
Use when standard full redeploy (fresh Ceph/MySQL/identity) model-only teardown / surgery where hosts must survive

Path A honesty note (D-061 amendment): the release script's canary validates machine SURVIVAL (host not decomposed after remove-machine --keep-instance). It does NOT yet validate RE-ACQUISITION -- juju's MAAS provider allocates Ready machines, and a kept host stays Deployed, so how a NEXT model re-adopts it is the open half. Until both halves pass, Path A is NOT a route into phase-01; the standard redeploy is Path B. First Path A use: bash scripts/phase-00-teardown-release.sh --apply --canary (openstack0 only, verify, STOP).

Sequence (this phase, Path B)

1.  Pre-flight            (read-only; baseline)
2.  Teardown (destroy)    bash scripts/phase-00-teardown-destroy.sh --apply  [DESTRUCTIVE: model + machines + capi-mgmt]
    -- machines decomposed from MAAS; libvirt domains remain, shut off --
3.  8_lbaas net removal   one-off jumphost op                                [domains off]
4.  OSD secondary wipe    vdb -> blank 512G                                  [DESTRUCTIVE; domains off]
5.  Reenroll              bash scripts/reenroll-hosts.sh                     [creates MAAS objects; auto-commission -> Ready]
6.  Pattern A re-carve    bash scripts/carve-host-interfaces.sh <host> --apply   [hosts Ready]
7.  Standup + bundle gate bash scripts/phase-00-maas-standup.sh ; provider-bundle-check.py  [read-only]
    -> EXIT GATE -> phase-01 deploy

Steps 3-4 operate on libvirt only (valid while the hosts are absent from MAAS). Step 5 requires the libvirt domains to EXIST (reenroll re-creates MAAS machine objects, not VMs) and leaves all four Ready. Step 6 requires Ready (link-subnet is REJECTED on a Deployed machine). The phase-01 deploy powers the hosts on and applies the carved netplan.

Step 1 -- Pre-flight (READ-ONLY)

CHECK (read-only) -- jumphost

( {
  echo "=== six D-052/D-053 spaces (hard blocker if absent) ==="
  # expect: provider-public 10.12.4.0/22 | metal-admin 10.12.8.0/22 | metal-internal 10.12.12.0/22
  #       | data-tenant 10.12.16.0/22 | storage 10.12.32.0/22 | replication 10.12.36.0/22
  maas admin spaces read | jq -r '.[] | "\(.name)\t\([.subnets[]?.cidr] | join(", "))"' | sort

  echo "=== hosts + capi-mgmt status (baseline) ==="
  maas admin machines read | jq -r '.[]|select(.hostname|test("^(openstack[0-3]|capi-mgmt)$"))|"\(.hostname)\t\(.status_name)"' | sort

  echo "=== OSD vdb baseline (pre-teardown: running, libvirt-qemu:kvm) ==="
  for host in openstack0 openstack1 openstack2 openstack3; do
    f="/var/lib/libvirt/images/${host}-1.qcow2"
    printf '  %-46s state=%s owner=%s mode=%s\n' "$f" \
      "$(sudo virsh -c qemu:///system domstate "$host" 2>/dev/null)" \
      "$(sudo stat -c '%U:%G' "$f" 2>/dev/null)" "$(sudo stat -c '%a' "$f" 2>/dev/null)"
  done
} )

Step 2 -- Teardown, Path B: DESTROY (D-061) DISCRETE / DESTRUCTIVE

scripts/phase-00-teardown-destroy.sh is the authority: it resolves the four host system_ids live (no hardcoded ids), HARD-EXCLUDES the management substrate (juju, lxd, tailscale), destroys the openstack model (machines decompose -- the destroy path EMBRACES the pod behavior that D-061 documents), and deletes the orphaned capi-mgmt machine. A pre-destroy juju export/status capture runs first (reference only; NOT a restore path).

CHECK (read-only) -- jumphost -- dry-run first (default; changes nothing)

bash scripts/phase-00-teardown-destroy.sh

Expect: the four openstack hosts listed as decompose targets + capi-mgmt as the delete target; PROTECTED juju / lxd / tailscale shown as excluded. Confirm the resolved system_ids look right before applying.

CAUTION: destroys the entire openstack Juju model, DECOMPOSES openstack0-3 out of MAAS, and DELETES the capi-mgmt MAAS machine -- irreversible. Confirm you are on the test cloud, not Roosevelt. The script requires the model name typed at a gate.

RUN -- jumphost

bash scripts/phase-00-teardown-destroy.sh --apply

GATE: juju models shows no openstack; maas admin machines read shows NO openstack0-3 and NO capi-mgmt; sudo virsh list --all still shows the four domains (shut off). If the model is still destroying after ~10 min: juju remove-machine -m openstack --force <id> for each lingering id, then re-run --apply.

(Path A -- RELEASE -- is NOT part of this sequence. If a machine-preserving teardown is what you need, stop here and run bash scripts/phase-00-teardown-release.sh per its header: canary first, and note the re-acquisition caveat above.)

Step 3 -- Remove the idle 8_lbaas libvirt network (domains off) one-off

Each host still carries an idle virtio NIC on the isolated 8_lbaas libvirt network (bridge virbr6, no L3, ex-lbaas). MAAS has no lbaas space; the NIC is unused. Remove it now while the domains are shut off. This is a one-off jumphost op (Roosevelt is bare metal, no libvirt nets) -- it is NOT part of any phase-00 script; log it to the as-executed log.

CAUTION: detaches a NIC from each host's persistent domain config and undefines a libvirt network. Reversible (XML backed up first); the detach uses --config only (no live change).

Use the two gated blocks from the as-executed log / session notes -- do NOT improvise an irreversible libvirt op:

  • Block 1: back up ~/8_lbaas-net.xml.bak; pre-check every host domstate = shut off (REFUSE otherwise); detach the idle NIC per host (virsh detach-interface <dom> network --mac <mac> --config); verify no domain still references 8_lbaas.
  • Block 2: virsh net-destroy 8_lbaas; virsh net-undefine 8_lbaas; confirm gone.

GATE: sudo virsh net-list --all shows no 8_lbaas; no host domain references it.

Step 4 -- OSD secondary-disk wipe (clean-slate Ceph) DISCRETE / DESTRUCTIVE

Only after Step 2 GATE is green (model gone, domains shut off) AND explicit go. vda (the OS disk) is NOT touched -- MAAS reinstalls it on deploy; only vdb (the OSD target) is recreated blank.

CAUTION: deletes and recreates each host's vdb OSD disk (512G blank) -- destroys all Ceph OSD data. vda is untouched. Domains must be shut off. (R7: sudo for qemu-img.)

RUN -- jumphost

OWNER="root:root"; MODE="600"
for host in openstack0 openstack1 openstack2 openstack3; do
  f="/var/lib/libvirt/images/${host}-1.qcow2"
  echo "=== Wiping $f ==="
  sudo rm -f "$f"
  sudo qemu-img create -f qcow2 "$f" 512G
  sudo chown "$OWNER" "$f"; sudo chmod "$MODE" "$f"
done
# verify
for host in openstack0 openstack1 openstack2 openstack3; do
  sudo qemu-img info "/var/lib/libvirt/images/${host}-1.qcow2" | grep -E 'virtual size|disk size'
done

GATE: 4 files, ~200 KiB actual / 512 GiB virtual, root:root mode 600.

Step 5 -- Reenroll the hosts into MAAS (post-decompose)

scripts/reenroll-hosts.sh re-creates the MAAS machine objects for the four still-existing libvirt domains (it does NOT create VMs) and polls them through auto-commissioning to Ready. It is discover-assert-pin idempotent (only creates the missing) and reads the libvirt SSH password interactively -- never as an argument. NOTE the script header's standing item: the libvirt SSH credential was exposed by power-parameters on 2026-06-26 -- rotate it after the rebuild.

CHECK (read-only) -- jumphost

bash scripts/reenroll-hosts.sh --check

RUN -- jumphost

bash scripts/reenroll-hosts.sh

GATE: all four openstack0-3 report Ready (script exit 0). System_ids are NEWLY MINTED this enrollment (DOCFIX-040) -- never reuse ids from a previous cycle or a document.

Step 6 -- Pattern A interface re-carve (per host; machines Ready)

scripts/carve-host-interfaces.sh rebuilds each host's interface tree to Pattern A on the EXISTING D-052/D-053 subnets:

  • enp1s0 -> OVS br-ex + STATIC 10.12.4.N (provider-public) -- MAAS builds the OVS bridge; ovn-chassis consumes it (bridge-interface-mappings + physnet1:br-ex), API containers attach.
  • enp7s0 -> br-metal (STATIC 10.12.8.N) -> br-metal.103 -> br-internal (STATIC 10.12.12.N).
  • enp8s0 / enp9s0 / enp10s0 raw + STATIC on data 10.12.16.N / storage 10.12.32.N / replication 10.12.36.N.

It resolves every id live, is idempotent, and requires Ready (interface edits are rejected on Deployed).

CHECK (read-only) -- jumphost -- dry-run each host first (default)

for h in openstack0 openstack1 openstack2 openstack3; do bash scripts/carve-host-interfaces.sh "$h"; done

Expect: each plan ends Summary: 0 fatal; the provider plane shows create br-ex (OVS) parent=enp1s0 and br-ex -> STATIC 10.12.4.N; metal / internal / data / storage / replication statics as above. No br-prov-api, no enp1s0.104, no provider-vip.

CAUTION: mutates MAAS interface definitions on each host. Re-runnable (idempotent), but apply ONE host at a time and re-read the resulting tree.

RUN -- jumphost (per host)

bash scripts/carve-host-interfaces.sh openstack0 --apply
# then openstack1, openstack2, openstack3 (one at a time)

GATE: each host shows br-ex (type ovs) STATIC 10.12.4.N; br-metal 10.12.8.N; br-internal 10.12.12.N; enp8s0/enp9s0/enp10s0 STATIC on 10.12.16/32/36.N.

Step 7 -- Standup + bundle gate (READ-ONLY; before deploy)

CHECK (read-only) -- jumphost -- MAAS topology

bash scripts/phase-00-maas-standup.sh

Expect: no drift and OK (dryrun) -- topology consistent with D-052/D-053. Any DRIFT line is a hard stop (do not deploy onto a mis-bound plane).

CHECK (read-only) -- jumphost -- bundle invariants

python3 scripts/provider-bundle-check.py bundle.yaml

Expect: PASS -- 11 charms public->provider-public, .4/.8/.12 VIP triples, all 4 chassis MACs present (incl openstack0), relations well-formed, mysql at 3 units (D-062), keystone policyd resource wired (DOCFIX-071).


EXIT GATE (phase-00 complete)

  • juju models shows no openstack; openstack0-3 all Ready (fresh system_ids); capi-mgmt DELETED.
  • 8_lbaas libvirt network gone; no host domain references it.
  • OSD vdb files 512 GiB blank (root:root, 600) on all four hosts.
  • Pattern A interfaces on all four: br-ex (OVS) STATIC .4.N; br-metal .8.N; br-internal .12.N; data / storage / replication .16/.32/.36.N.
  • phase-00-maas-standup.sh reports no drift; provider-bundle-check.py PASSes.
  • Clean slate ready for phase-01. The deploy uses ONE overlay (octavia-pki) -- NOT the vr0-dc0-testcloud overlay (its intent is folded into the hardened base bundle).

Next

phase-01 -- bundle deploy.