diff --git a/runbooks/README.md b/runbooks/README.md index 7e595e6..d0a2cf4 100644 --- a/runbooks/README.md +++ b/runbooks/README.md @@ -1,5 +1,14 @@ # v1 Deploy Runbook -- VR0 DC0 Omega Cloud (Caracal 2024.1, IPv4) +## Command-label convention +Every command block below is bracketed by bold labels, so a command line is never mistaken +for surrounding prose (these render in GitBucket and read clearly in a raw editor): +- **RUN -- LOC** -- the block CHANGES state; run it at LOC (e.g. `jumphost`, `vault/0`, `jumphost -> magnum/0`). +- **CHECK (read-only) -- LOC** -- a read-only verification; safe to re-run. +- **GATE:** -- a hard stop; do NOT proceed past the block unless the stated condition holds. +- **Expect:** -- what a passing result looks like. +- `> CAUTION:` -- marks a destructive, secret-handling, or irreversible step. + The deploy is a gated sequence: run `phase-00` through `phase-08` in order. Each phase ends in a hard gate (an explicit pass/fail check); do not start the next phase until the current gate passes. The two appendices are reference, not steps. diff --git a/runbooks/phase-00-teardown-maas-reset.md b/runbooks/phase-00-teardown-maas-reset.md index 433376e..b724724 100644 --- a/runbooks/phase-00-teardown-maas-reset.md +++ b/runbooks/phase-00-teardown-maas-reset.md @@ -46,8 +46,18 @@ --- +## Command-label convention +Every command block below is bracketed by bold labels, so a command line is never mistaken +for surrounding prose (these render in GitBucket and read clearly in a raw editor): +- **RUN -- LOC** -- the block CHANGES state; run it at LOC (e.g. `jumphost`, `vault/0`, `jumphost -> magnum/0`). +- **CHECK (read-only) -- LOC** -- a read-only verification; safe to re-run. +- **GATE:** -- a hard stop; do NOT proceed past the block unless the stated condition holds. +- **Expect:** -- what a passing result looks like. +- `> CAUTION:` -- marks a destructive, secret-handling, or irreversible step. + ## Phase 0 -- Pre-flight (READ-ONLY; run before teardown) -`# RUN: jumphost` + +**RUN -- jumphost** ```bash ( { echo "=== 0a. five network spaces (hard blocker if absent) ===" @@ -69,6 +79,8 @@ done # enp8s0(data) is the one KNOWN unlinked + a HARD deploy prereq; enp9s0/enp10s0 usually already linked } ) ``` + +**RUN -- jumphost** ```bash # 0d. OSD-wipe pre-flight gate -- post-teardown these are "shut off"; vdb is root:root / 600. (R7: sudo) for host in openstack0 openstack1 openstack2 openstack3; do @@ -81,7 +93,8 @@ ``` ## Phase 1 -- Teardown (D-018) DISCRETE / DESTRUCTIVE -`# RUN: jumphost` + +**RUN -- jumphost** ```bash # A. pre-destroy capture (reference only; NOT for restore) TS=$(date -u +%Y%m%dT%H%M%SZ); BACKUP_DIR=$HOME/backups/pre-caracal-destroy-$TS; mkdir -p "$BACKUP_DIR" @@ -90,16 +103,28 @@ for f in "$BACKUP_DIR"/*.yaml; do [ -s "$f" ] || echo "WARNING: $f empty"; done echo "$BACKUP_DIR" > "$HOME/.last-pre-caracal-destroy-backup"; ls -la "$BACKUP_DIR" ``` + +> CAUTION: destroys the entire `openstack` Juju model -- irreversible. The controller is +> untouched, but every app/unit/relation is reaped. Confirm you are on the testcloud, not Roosevelt. + +**RUN -- jumphost** ```bash # B. destroy the openstack model (returns ~1-2 min; reaping ~5-10 min background). Controller untouched. juju destroy-model openstack --force --no-wait --destroy-storage --no-prompt ``` + +> CAUTION: releases the four openstack hosts back to MAAS (erase + power off). Hardcoded +> system_ids (DOCFIX-017) -- does NOT touch the capi-mgmt host. + +**RUN -- jumphost** ```bash # C. release the FOUR openstack hosts by system_id (DOCFIX-017: hardcoded ids, no whoami). NOT capi-mgmt. for SID in 4na83t qdbqd6 h8frng tmsafc; do echo "Releasing $SID..."; maas admin machine release "$SID" comment="Caracal rebuild teardown $TS" done ``` + +**CHECK (read-only) -- jumphost** ```bash # D. verify juju models # expect: no 'openstack' (allow a few min) @@ -107,16 +132,21 @@ | jq -r '.[] | select(.hostname|test("^openstack[0-3]$")) | "\(.hostname)\t\(.status_name)"' | sort # expect four lines, each ending "Ready" ``` -GATE: `juju models` shows no `openstack`; openstack0-3 all Ready. (`link-subnet` is +**GATE:** `juju models` shows no `openstack`; openstack0-3 all Ready. (`link-subnet` is REJECTED on a Deployed machine -- Phases 2-3 REQUIRE Ready.) If the model is still `destroying` after ~10 min: `juju machines -m openstack --format=yaml`, then `juju remove-machine -m openstack --force ` for each lingering id, then re-run the destroy-model in B. ## Phase 2 -- OSD secondary-disk wipe (clean-slate Ceph) DISCRETE / DESTRUCTIVE -`# RUN: jumphost (libvirt host; R7 sudo)` Only after Phase 0d is GREEN (all "shut +Only after Phase 0d is GREEN (all "shut off") AND explicit go. vda (the OS disk) is NOT touched -- MAAS reinstalls it on deploy; only vdb (the OSD target) is recreated blank. + +> CAUTION: deletes and recreates each host's vdb OSD disk (512G blank) -- destroys all Ceph +> OSD data. vda (OS disk) is untouched. Run only after Phase 0d is GREEN and on explicit go. + +**RUN -- jumphost** ```bash OWNER="root:root"; MODE="600" for host in openstack0 openstack1 openstack2 openstack3; do @@ -132,13 +162,15 @@ sudo qemu-img info "/var/lib/libvirt/images/${host}-1.qcow2" | grep -E 'virtual size|disk size' done ``` -GATE: 4 files, ~200 KiB actual / 512 GiB virtual, root:root mode 600. +**GATE:** 4 files, ~200 KiB actual / 512 GiB virtual, root:root mode 600. ## Phase 3 -- Storage-class NIC links (idempotent; machines Ready) -`# RUN: jumphost` Links every storage-class NIC to its space's subnet. enp8s0 (data) +Links every storage-class NIC to its space's subnet. enp8s0 (data) is the one KNOWN unlinked and a HARD deploy prereq (nova-compute:neutron-plugin->data, octavia:ovsdb-cms->data, chassis data bindings). enp9s0/enp10s0 back the C2 Ceph public/cluster bindings; this links them too only if not already linked. + +**RUN -- jumphost** ```bash declare -A NIC_CIDR=( [enp8s0]=10.12.12.0/22 [enp9s0]=10.12.16.0/22 [enp10s0]=10.12.20.0/22 ) declare -A HOST_OCTET=( [4na83t]=40 [qdbqd6]=41 [h8frng]=42 [tmsafc]=43 ) @@ -167,20 +199,24 @@ | jq -r '.[] | select(.name|test("^enp(8|9|10)s0$")) | " \(.name)\t\([.links[]?|{(.subnet.cidr):.ip_address}])"' done ``` -GATE: each host's enp8s0/enp9s0/enp10s0 shows a 10.12.{12,16,20}.4N STATIC link. +**GATE:** each host's enp8s0/enp9s0/enp10s0 shows a 10.12.{12,16,20}.4N STATIC link. ## Phase 4 -- MAAS VIP/FIP address carve (mutation; confirm-first) -`# RUN: jumphost` The bundle's VIPs live in the front-loaded /26 blocks; the FIP +The bundle's VIPs live in the front-loaded /26 blocks; the FIP pool (phase-04) lives at 10.12.5.0-10.12.7.254. These MAAS reservations persist across teardown, so on a repeat rebuild they usually already exist -- verify, create only if absent, and delete the stale old-scheme reservation. (KI-P3-001: a reserved range stops MAAS auto-static landing a primary on a configured VIP.) + +**CHECK (read-only) -- jumphost** ```bash # 4a. verify current state maas admin ipranges read | jq -r '.[] | "id=\(.id)\t\(.type)\t\(.start_ip)-\(.end_ip)\tsubnet=\(.subnet.cidr // "?")\t\(.comment // "")"' | sort # want present: provider .4.2-.63 (subnet 1), metal .8.2-.63 (subnet 2), provider FIP .5.0-.7.254. # want absent : metal .8.224-.254 (stale). ``` + +**RUN -- jumphost** ```bash # 4b. create the front-loaded VIP reservations ONLY if absent (idempotent; carve doc section 8) ( { @@ -202,16 +238,19 @@ fi } ) ``` + +**RUN -- jumphost** ```bash # 4c. delete the stale .224-.254 metal reservation -- CONFIRM the id from 4a first (this arc: id=2) # maas admin iprange delete ``` -GATE: `ipranges read` shows provider FIP + provider VIPs .4.2-.63 + metal VIPs +**GATE:** `ipranges read` shows provider FIP + provider VIPs .4.2-.63 + metal VIPs .8.2-.63; the metal .8.224-.254 reservation is gone; the metal DHCP dynamic (10.12.9.0-10.12.11.254) is unchanged. ## Phase 5 -- Post-prep verification (READ-ONLY gate before deploy) -`# RUN: jumphost` + +**CHECK (read-only) -- jumphost** ```bash ( { maas admin spaces read | jq -r '.[] | "\(.name)\t\([.subnets[]?.cidr] | join(", "))"' # DOCFIX-026: 5 spaces (juju spaces FAILS here -- model gone post-teardown) diff --git a/runbooks/phase-01-bundle-deploy.md b/runbooks/phase-01-bundle-deploy.md index 8b745ba..b323278 100644 --- a/runbooks/phase-01-bundle-deploy.md +++ b/runbooks/phase-01-bundle-deploy.md @@ -30,6 +30,16 @@ --- +## Command-label convention +Every command block below is bracketed by bold labels, so a command line is never mistaken +for surrounding prose (these render in GitBucket and read clearly in a raw editor): +- **RUN -- LOC** -- the block CHANGES state; run it at LOC (e.g. `jumphost`, `vault/0`, `jumphost -> magnum/0`). +- **CHECK (read-only) -- LOC** -- a read-only verification; safe to re-run. +- **GATE:** -- a hard stop; do NOT proceed past the block unless the stated condition holds. +- **Expect:** -- what a passing result looks like. +- `> CAUTION:` -- marks a destructive, secret-handling, or irreversible step. + + ## Step 1.0 -- Octavia PKI overlay (secret-handling prereq) DISCRETE `overlays/octavia-pki.yaml` carries the 5 lb-mgmt-* PKI keys (controller CA/cert, issuing CA key+passphrase+cert). It is the ONLY overlay in the deploy command and is @@ -37,6 +47,8 @@ 10y, so it survives rebuilds). REGENERATION path (fresh CAs): run the discrete secret procedure inlined as "Step 1.0-GEN" at the end of this phase. Either way, confirm the overlay parses and contains exactly the 5 keys (sanity block below) before deploying. + +**CHECK (read-only) -- jumphost** ```bash # RUN: jumphost -- sanity only (does NOT print key material) [ -f overlays/octavia-pki.yaml ] && grep -cE 'lb-mgmt-' overlays/octavia-pki.yaml # expect 5 keys @@ -44,8 +56,10 @@ ``` ## Step 1.1 -- Pre-deploy verify (read-only; 4 checks) -`# RUN: jumphost` One consolidated read-only block. NO `set -e` (a guarded count of +One consolidated read-only block. NO `set -e` (a guarded count of 0 is a valid answer, not a failure -- appendix-A: L1); count greps are `|| true`. + +**CHECK (read-only) -- jumphost** ```bash ( { echo "=== CHECK 1: bundle VIPs (quote-tolerant, octet-anchored) ===" @@ -73,6 +87,8 @@ | "\(.hostname) \(.status_name) power=\(.power_state)"' } ) ``` + +**CHECK (read-only) -- jumphost** ```bash # CHECK 4b: OSD /dev/vdb blank (DOCFIX-027 -- LOCAL libvirt-host loop, NOT ssh: the four # hosts are Released/powered-off entering phase-01, and /var/lib/libvirt/images is a @@ -82,10 +98,12 @@ sudo qemu-img info "/var/lib/libvirt/images/${h}-1.qcow2" | grep -E 'virtual size|disk size' done # expect virtual 512 GiB, disk ~200 KiB (sparse/blank) ``` -GATE: VIPs 11/11/0; enp8s0 linked on all 4; subnet DNS as above; 4 nodes Ready; OSD blank. +**GATE:** VIPs 11/11/0; enp8s0 linked on all 4; subnet DNS as above; 4 nodes Ready; OSD blank. ## Step 1.2 -- Dry-run (guarded) -`# RUN: jumphost` Refuse to add a model if `openstack` already exists; require the overlay. +Refuse to add a model if `openstack` already exists; require the overlay. + +**RUN -- jumphost** ```bash ( { juju models 2>&1 | tee /tmp/jmodels.txt @@ -99,7 +117,7 @@ fi } ) ``` -GATE (from the plan): 50 apps, 97 relations, 4 machines (8/9/10/11 -> 0/1/2/3), 24 LXD; +**GATE:** (from the plan): 50 apps, 97 relations, 4 machines (8/9/10/11 -> 0/1/2/3), 24 LXD; ceph-osd/0-3 one per node; nova-compute/0-2 on machines 1/2/3 ONLY (machine 0 = OSD+LXD host, no compute); channels match the matrix; relations include `octavia:certificates - vault:certificates`, `vault:shared-db - vault-mysql-router`, @@ -107,8 +125,10 @@ (D-019). Only the two benign R11 warnings (L34 `name`, L55 `variables`). ## Step 1.3 -- Deploy (VIP-guarded) -`# RUN: jumphost` Re-run the VIP guard inline (the dry-run never echoes vip values), +Re-run the VIP guard inline (the dry-run never echoes vip values), then deploy only if 11/11/0. + +**RUN -- jumphost** ```bash ( { TOT=$(grep -cE '^[[:space:]]+vip:[[:space:]]*"?10\.12\.4\.' bundle.yaml || true) @@ -131,13 +151,15 @@ descends into subordinates; neither replaces the phase gates. ## Step 1.4 -- DNS gate during deploy (as machines come up) -`# RUN: jumphost` Run when machine 0 reaches `started`, then per LXD unit as they +Run when machine 0 reaches `started`, then per LXD unit as they appear (flag BEFORE the target; logic inside the remote quotes; no outer 2>/dev/null): + +**CHECK (read-only) -- jumphost** ```bash juju ssh -m openstack 0 -- 'resolvectl status | grep -i "DNS Server"; getent hosts api.snapcraft.io && echo OK || echo FAIL' # repeat for ceph-mon/0, mysql-innodb-cluster/0 as they appear ``` -GATE: each returns OK (api.snapcraft.io resolves -> the snap install storm proceeds +**GATE:** each returns OK (api.snapcraft.io resolves -> the snap install storm proceeds clean). FINDING (non-blocking, R15): the unreachable region resolver `10.12.8.10` (MAAS region/rack controller, advertised on the metal VLAN independent of the subnet field) may still appear in a node's resolver list -- resolution succeeds because @@ -194,6 +216,10 @@ --- ## Step 1.0-GEN -- Octavia management-PKI generation (regeneration path) DISCRETE / SECRET + +> CAUTION: SECRET step -- generates Octavia CA private keys + passphrases. Do NOT echo or log key +> material (only cert dates/subjects + verify-OK are printed). The overlay it writes is gitignored. + Run ONLY if you are not reusing an existing `overlays/octavia-pki.yaml`. Produces the two-tier EC PKI for Charmed Octavia's amphora trust domain and writes the overlay. Decisions (Workstream 3a, 2026-05-22): fresh generation; EC P-384 CAs (SHA-384, 10y); @@ -209,6 +235,8 @@ - `lb-mgmt-controller-cert` = base64(controller cert + key, concatenated) ### 1.0-GEN.0 -- workspace (openssl 3.x; $HOME only -- snap home-confinement, never /tmp) + +**RUN -- jumphost** ```bash # RUN: jumphost WORKDIR="$HOME/octavia-pki" @@ -218,6 +246,8 @@ ``` ### 1.0-GEN.a -- Issuing CA (EC P-384, AES-256 encrypted key, self-signed 10y) + +**RUN -- jumphost** ```bash ( { WORKDIR="$HOME/octavia-pki"; cd "$WORKDIR/issuing-ca" || exit 1 # dir from 1.0-GEN.a @@ -238,6 +268,8 @@ ### 1.0-GEN.b -- Controller CA (EC P-384, AES-256 encrypted key, self-signed 10y; own passphrase) The controller CA key is encrypted (its own passphrase) for future controller-cert rotation -- Octavia never receives this key, only the controller CA cert. + +**RUN -- jumphost** ```bash ( { WORKDIR="$HOME/octavia-pki"; cd "$WORKDIR/controller-ca" || exit 1 # dir from 1.0-GEN.a @@ -258,6 +290,8 @@ ### 1.0-GEN.c -- Controller cert (EC P-256 UNENCRYPTED, SAN, signed by Controller CA, 2y) The P-256 key is unencrypted -- Octavia reads it at startup. SAN carries the controller FQDN, the octavia API FQDN, and the Octavia API VIP 10.12.4.233. + +**RUN -- jumphost** ```bash ( { WORKDIR="$HOME/octavia-pki"; cd "$WORKDIR/controller" || exit 1 # dir from 1.0-GEN.a @@ -302,6 +336,8 @@ ### 1.0-GEN.d -- Write overlays/octavia-pki.yaml (base64 blobs + plaintext passphrase) Four values are base64(PEM); the issuing-CA passphrase is a PLAIN string. The file is gitignored. Set `$REPO` to the jumphost clone (the dir holding bundle.yaml + overlays/). + +**RUN -- jumphost** ```bash ( { WORKDIR="$HOME/octavia-pki"; cd "$WORKDIR" || exit 1 # dir from 1.0-GEN.a diff --git a/runbooks/phase-02-vault-bringup.md b/runbooks/phase-02-vault-bringup.md index 74582e4..1cc6b62 100644 --- a/runbooks/phase-02-vault-bringup.md +++ b/runbooks/phase-02-vault-bringup.md @@ -37,16 +37,33 @@ --- +## Command-label convention +Every command block below is bracketed by bold labels, so a command line is never mistaken +for surrounding prose (these render in GitBucket and read clearly in a raw editor): +- **RUN -- LOC** -- the block CHANGES state; run it at LOC (e.g. `jumphost`, `vault/0`, `jumphost -> magnum/0`). +- **CHECK (read-only) -- LOC** -- a read-only verification; safe to re-run. +- **GATE:** -- a hard stop; do NOT proceed past the block unless the stated condition holds. +- **Expect:** -- what a passing result looks like. +- `> CAUTION:` -- marks a destructive, secret-handling, or irreversible step. + + ## Step 2.1 -- Vault init [IRREVERSIBLE ONE-SHOT -- run verbatim] DISCRETE -`# RUN: on vault/0` Open the session, set the loopback addr, pre-check fresh, then +Open the session, set the loopback addr, pre-check fresh, then init with the `2>&1 | tee` capture (NOT `>`). Save `~/vault-init/init.txt` off-host the moment the gate passes. + +**RUN -- jumphost (opens the vault/0 session)** ```bash # RUN: jumphost -- open the interactive session ONLY (paste this line alone; DOCFIX-029) juju ssh -m openstack vault/0 ``` WAIT for the remote prompt (`ubuntu@juju-...`) before pasting the next block -- a combined paste buffers the in-session lines and feeds them to the session on connect. + +> CAUTION: `vault operator init` is an IRREVERSIBLE one-shot. The moment the GATE passes, +> save the 5 unseal shares + root token off-host -- they cannot be recovered if lost. + +**RUN -- vault/0** ```bash # --- inside the vault/0 session: --- export VAULT_ADDR=http://127.0.0.1:8200 ; umask 077 ; mkdir -p ~/vault-init @@ -55,15 +72,17 @@ grep -c '^Unseal Key' ~/vault-init/init.txt # GATE: MUST print 5 grep -q '^Initial Root Token:' ~/vault-init/init.txt && echo TOKEN_OK || echo MISSING ``` -GATE: `5` unseal keys AND `TOKEN_OK`. If the count is not 5 or the token is MISSING, +**GATE:** `5` unseal keys AND `TOKEN_OK`. If the count is not 5 or the token is MISSING, STOP -- do not proceed (the empty-file case is the DOCFIX-006 catch). Now SAVE the 5 shares + root token off-host (operator secret store) before continuing. Do NOT batch this with unseal. ## Step 2.2 -- Vault unseal (3 of 5) DISCRETE (re-runnable) -`# RUN: on vault/0` Use Vault's OWN hidden prompt -- the key is never on the command +Use Vault's OWN hidden prompt -- the key is never on the command line, in a var, or in scrollback (appendix-A: L4). Do NOT use `vault operator unseal $K` (that puts the key in `ps`/argv). + +**RUN -- vault/0** ```bash # --- inside the vault/0 session: --- export VAULT_ADDR=http://127.0.0.1:8200 @@ -72,7 +91,7 @@ vault operator unseal # prompts hidden; paste share 3 -> 3/3 vault status 2>&1 | grep -E 'Sealed|Initialized|Storage Type|HA Enabled' ``` -GATE: progress 1/3 -> 2/3 -> 3/3, then `Sealed false`. Expected final: Initialized +**GATE:** progress 1/3 -> 2/3 -> 3/3, then `Sealed false`. Expected final: Initialized true / Sealed false / Storage Type mysql / **HA Enabled false** (CORRECT for single-unit vault-on-mysql -- appendix-A: R3; any "HA true / etcd" reference is stale). @@ -86,10 +105,14 @@ token (not the root token -- `juju run` persists action params in the operation log, so a minutes-lived token self-limits), then generate the root CA (DOCFIX-014 -- without it vault stays blocked "Missing CA cert"). + +**CHECK (read-only) -- jumphost** ```bash # RUN: jumphost -- schema (read-only): authorize-charm requires `token` (direct-token path) juju actions vault --schema --format yaml -m openstack | sed -n '/authorize-charm:/,/^[a-z]/p' ``` + +**RUN -- jumphost (opens the vault/0 session)** ```bash # RUN: jumphost -- open the interactive session ONLY (paste this line alone; DOCFIX-029) juju ssh -m openstack vault/0 @@ -98,6 +121,8 @@ `read -s` -- a combined paste would let read swallow the next buffered line as the secret. NO trailing `exit`: exit MANUALLY after copying the child token (a paste-ahead `exit` could self-terminate the session and mask the swallow). + +**RUN -- vault/0** ```bash # --- inside the session: mint a short-lived child token (root entered hidden, never on argv/history) --- export VAULT_ADDR=http://127.0.0.1:8200 @@ -106,6 +131,8 @@ unset VAULT_TOKEN # (exit manually after you have copied the child token) ``` + +**RUN -- jumphost** ```bash # RUN: jumphost -- authorize + root CA + status (each juju run blocks to completion) # ENHANCEMENT-2: enter the child token via hidden read (keeps it out of jumphost shell @@ -117,7 +144,7 @@ juju run vault/leader generate-root-ca -m openstack juju status vault -m openstack ``` -GATE: authorize-charm completes; generate-root-ca returns the root CA PEM ("Vault Root +**GATE:** authorize-charm completes; generate-root-ca returns the root CA PEM ("Vault Root Certificate Authority (charm-pki-local)"); vault/0 -> active/idle "Unit is ready". The "Missing CA cert" block clears straight to active (validates DOCFIX-014). (`mlock: disabled` is expected/benign for snap/container vault without IPC_LOCK.) diff --git a/runbooks/phase-03-core-verify.md b/runbooks/phase-03-core-verify.md index 1fdbead..be02c2e 100644 --- a/runbooks/phase-03-core-verify.md +++ b/runbooks/phase-03-core-verify.md @@ -30,15 +30,29 @@ --- +## Command-label convention +Every command block below is bracketed by bold labels, so a command line is never mistaken +for surrounding prose (these render in GitBucket and read clearly in a raw editor): +- **RUN -- LOC** -- the block CHANGES state; run it at LOC (e.g. `jumphost`, `vault/0`, `jumphost -> magnum/0`). +- **CHECK (read-only) -- LOC** -- a read-only verification; safe to re-run. +- **GATE:** -- a hard stop; do NOT proceed past the block unless the stated condition holds. +- **Expect:** -- what a passing result looks like. +- `> CAUTION:` -- marks a destructive, secret-handling, or irreversible step. + + ## Step 3.1 -- Settle the cert cascade + acceptance walk -`# RUN: jumphost` The cascade here is NARROW (mysql bootstrapped before vault init, +The cascade here is NARROW (mysql bootstrapped before vault init, so only the Vault consumers clear: ovn-central x3, ovn-chassis x3, ovn-chassis-octavia, neutron-api-plugin-ovn, barbican-vault). Watch, then walk units AND subordinates. + +**CHECK (read-only) -- jumphost** ```bash juju status --color --watch 30s -m openstack # Ctrl-C once settled ``` Acceptance walk (counts non-active/idle across units + subordinates): + +**CHECK (read-only) -- jumphost** ```bash juju status -m openstack --format=yaml | python3 -c " import yaml,sys @@ -55,22 +69,26 @@ for b in bad: print(' '+b) " ``` -GATE: expected non-active/idle = **1** (octavia/0 BLOCKED "Awaiting configure-resources", +**GATE:** expected non-active/idle = **1** (octavia/0 BLOCKED "Awaiting configure-resources", the D-021 next step) or briefly **2** (+ glance-simplestreams-sync, normal pre-run). Any TLS consumer (the five above) persisting waiting/error past ~15 min is the concern -- STOP and read its log + relations (do NOT assume TLS; a prior stall was a MySQL 1045 desync): + +**CHECK (read-only) -- jumphost** ```bash juju status --relations -m openstack ovn-central ovn-chassis ovn-chassis-octavia neutron-api-plugin-ovn barbican-vault # juju ssh -m openstack -- 'sudo tail -120 /var/log/juju/unit-.log' plaintext checks vs the SSL backend). Probe haproxy's own verdict on every unit: + +**CHECK (read-only) -- jumphost** ```bash ( { echo "=== POST-TLS GATE: haproxy backend health sweep across all units ===" @@ -80,8 +98,10 @@ echo "=== sweep complete -- no DOWN lines above means every haproxy backend is UP ===" } ) ``` -GATE: zero `[unit] DOWN:` lines. On a DOWN line (check token L7STS/400 == plaintext-vs-SSL), +**GATE:** zero `[unit] DOWN:` lines. On a DOWN line (check token L7STS/400 == plaintext-vs-SSL), remediate the flagged unit (set U, then validate-and-reload): + +**CHECK (read-only) -- jumphost** ```bash U=nova-cloud-controller/0 juju ssh -m openstack "$U" -- 'sudo haproxy -c -f /etc/haproxy/haproxy.cfg' $OS_AUTH_URL project=$OS_PROJECT_NAME"; openstack token issue 2>&1 | head -6 ) ( source "$RC"; openstack endpoint list -f value -c "Service Name" -c Interface -c URL 2>&1 | sort ) ``` -GATE: `token issue` returns a SCOPED token; `endpoint list` is IP-only across all +**GATE:** `token issue` returns a SCOPED token; `endpoint list` is IP-only across all services (public on the provider VIP `.5x`, internal+admin on the metal VIP `.8.5x`, keystone admin on `:35357`). Two non-blocking notes for later: s3/swift is registered on the radosgw VIP `.60:443` (re-check vs the radosgw `:80` listener during any Swift/S3 smoke); the gss image-stream is HTTP on metal `10.12.8.172`. ## Step 3.3 -- Horizon access via the external nginx reverse proxy -`# RUN: operator (outside the Juju model) + jumphost` Horizon is fronted by an +Horizon is fronted by an operator-managed nginx reverse proxy. On each rebuild / VIP relocation: (1) repoint the upstream to the CURRENT dashboard provider VIP (now `https://10.12.4.58`, was `.234` pre-R14), and (2) reapply the Horizon Secure-cookie override (DOCFIX-030 / D-044, @@ -188,10 +209,14 @@ As-executed change set (gate every edit -- `sed -i` exits 0 on zero matches, so grep-assert the expected line after any mutation): + +**RUN -- jumphost** ```bash # RUN: jumphost -- ship the vault root CA to the proxy scp ~/vault-init/vault-ca-root.pem jessea123@10.12.4.7:/tmp/ ``` + +**RUN -- operator ON 10.12.4.7** ```bash # RUN: operator ON 10.12.4.7 -- install CA, back up + edit the Horizon vhost, validate, restart. sudo install -o root -g root -m 644 /tmp/vault-ca-root.pem /etc/nginx/vault-ca-root.pem && rm -f /tmp/vault-ca-root.pem @@ -205,18 +230,22 @@ sudo nginx -t # GATE: configuration ok sudo systemctl restart nginx # prefer restart over reload for a definitive cutover (a curl ~2s after `reload` can be served by a draining old worker; ~2s blip incl. the co-hosted MAAS proxy) ``` -GATE (on the proxy): `curl -sI http://127.0.0.1:81/horizon/` -> 302 to .../auth/login; no TLS errors in error.log. +**GATE:** (on the proxy): `curl -sI http://127.0.0.1:81/horizon/` -> 302 to .../auth/login; no TLS errors in error.log. ### DOCFIX-030 -- Horizon Secure-cookie override (D-044; PER-REBUILD) The charm renders `CSRF_COOKIE_SECURE`/`SESSION_COOKIE_SECURE = True` (vault:certificates). On the plain-HTTP client leg the browser drops the Secure csrftoken and login fails with "CSRF cookie not set" -- so a clean follow of 3.3 otherwise stalls at the browser login. Drop an ASCII-only post-load override on the dashboard unit, then graceful-reload apache2: + +**RUN -- jumphost** ```bash # RUN: jumphost -- D-044 cookie override on the dashboard unit (ASCII-only; PER-REBUILD) juju ssh -m openstack openstack-dashboard/leader -- "printf 'CSRF_COOKIE_SECURE = False\nSESSION_COOKIE_SECURE = False\n' | sudo tee /usr/share/openstack-dashboard/openstack_dashboard/local/local_settings.d/_99_internal_http_cookies.py >/dev/null && sudo systemctl reload apache2" magnum/0`). +- **CHECK (read-only) -- LOC** -- a read-only verification; safe to re-run. +- **GATE:** -- a hard stop; do NOT proceed past the block unless the stated condition holds. +- **Expect:** -- what a passing result looks like. +- `> CAUTION:` -- marks a destructive, secret-handling, or irreversible step. + + ## Step 4.1 -- Create the external provider network (B29; idempotent) -`# RUN: jumphost` `--external` but NOT `--share` (usable as router gateway + FIP +`--external` but NOT `--share` (usable as router gateway + FIP source, but tenants cannot attach instance ports to the provider segment -- Option B isolation). `--no-dhcp` (MAAS owns DHCP on this segment; FIPs are NAT'd). The subnet is the FULL provider /22 with the FIP pool as the allocation_pool; the VIP block and primaries are MAAS-reserved so neutron never allocates them. Read-only pre-check first (verify the FIP pool is MAAS-reserved so neutron can own it): + +**CHECK (read-only) -- jumphost** ```bash # RUN: jumphost (MAAS profile is 'admin'; never run 'maas list' -- it prints the API key, DOCFIX-016) maas admin ipranges read | jq -r '.[] | select(.type=="reserved") | "\(.start_ip)-\(.end_ip) subnet=\(.subnet.id) [\(.comment)]"' @@ -72,6 +84,8 @@ ``` Create (idempotent `( set -e )`; dynamic gateway; tags applied via `set`, not an inline `--tag` flag): + +**RUN -- jumphost** ```bash source ~/admin-openrc ( set -e @@ -102,7 +116,7 @@ openstack subnet show "$EXT_SUBNET" -f json | jq -c '{name, cidr, gateway_ip, enable_dhcp, allocation_pools, tags}' ) ``` -GATE: `provider-ext` external=true, type=flat, physnet=physnet1, shared=false; +**GATE:** `provider-ext` external=true, type=flat, physnet=physnet1, shared=false; `provider-ext-fip` cidr=10.12.4.0/22, gateway 10.12.4.1, enable_dhcp=false, allocation_pools=[10.12.5.0-10.12.7.254]. diff --git a/runbooks/phase-05-octavia-enablement.md b/runbooks/phase-05-octavia-enablement.md index 61d0293..31446e7 100644 --- a/runbooks/phase-05-octavia-enablement.md +++ b/runbooks/phase-05-octavia-enablement.md @@ -40,10 +40,21 @@ --- +## Command-label convention +Every command block below is bracketed by bold labels, so a command line is never mistaken +for surrounding prose (these render in GitBucket and read clearly in a raw editor): +- **RUN -- LOC** -- the block CHANGES state; run it at LOC (e.g. `jumphost`, `vault/0`, `jumphost -> magnum/0`). +- **CHECK (read-only) -- LOC** -- a read-only verification; safe to re-run. +- **GATE:** -- a hard stop; do NOT proceed past the block unless the stated condition holds. +- **Expect:** -- what a passing result looks like. +- `> CAUTION:` -- marks a destructive, secret-handling, or irreversible step. + + ## Step 5.1 -- configure-resources (D-021 Phase 1; control plane + lb-mgmt overlay) -`# RUN: jumphost` Read-only pre-check, then the argument-free action with a bound +Read-only pre-check, then the argument-free action with a bound wait, then authoritative completion via show-operation (NOT the streamed log). +**RUN -- jumphost** ```bash ( { source ~/admin-openrc @@ -59,10 +70,14 @@ Run the action (long-running; juju's default wait may time out but the hook keeps going -- use a bound `--wait` and tee; do NOT re-fire on a wait-timeout -- appendix-A: octavia-configure-resources): + +**RUN -- jumphost** ```bash juju run octavia/leader configure-resources -m openstack --wait=20m 2>&1 | tee ~/octavia-configure-resources.out ``` Authoritative completion + A/B/C verify: + +**CHECK (read-only) -- jumphost** ```bash ( { source ~/admin-openrc @@ -78,13 +93,13 @@ juju exec --unit octavia/0 -m openstack -- 'ip -br addr show o-hm0; sudo ovs-vsctl get Interface o-hm0 external_ids' idempotent seed (base staged in `$HOME`, NOT /tmp -- the openstack snap cannot read /tmp, appendix-A: L7) -> retrofit build -> confirm. Fully idempotent (amphora present -> skip to confirm; base present -> retrofit only; @@ -101,6 +116,7 @@ qcow2 base and emits the raw `octavia-amphora` OUTPUT (the config gate's image-format=raw is on the retrofit OUTPUT, not the base). +**RUN -- jumphost** ```bash # Tunables (operator-confirm the first two for your environment): BASE_IMG_URL="https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img" @@ -169,7 +185,7 @@ echo "[OK] amphora present + active + tagged $OTAG (matches octavia amp-image-tag) -- D-021 complete" ) ``` -GATE: an ACTIVE image tagged `octavia-amphora` whose tag matches `octavia amp-image-tag`. +**GATE:** an ACTIVE image tagged `octavia-amphora` whose tag matches `octavia amp-image-tag`. --- diff --git a/runbooks/phase-06-incloud-mgmt-cluster.md b/runbooks/phase-06-incloud-mgmt-cluster.md index 6c5b87c..9563ad7 100644 --- a/runbooks/phase-06-incloud-mgmt-cluster.md +++ b/runbooks/phase-06-incloud-mgmt-cluster.md @@ -49,8 +49,18 @@ --- +## Command-label convention +Every command block below is bracketed by bold labels, so a command line is never mistaken +for surrounding prose (these render in GitBucket and read clearly in a raw editor): +- **RUN -- LOC** -- the block CHANGES state; run it at LOC (e.g. `jumphost`, `vault/0`, `jumphost -> magnum/0`). +- **CHECK (read-only) -- LOC** -- a read-only verification; safe to re-run. +- **GATE:** -- a hard stop; do NOT proceed past the block unless the stated condition holds. +- **Expect:** -- what a passing result looks like. +- `> CAUTION:` -- marks a destructive, secret-handling, or irreversible step. + + ## Step 6.0-BOOT -- Fresh-deploy tenant bootstrap (project, role, flavors, mgmt image) -`# RUN: jumphost` REQUIRED on a fresh deploy: post-teardown the cloud has no +REQUIRED on a fresh deploy: post-teardown the cloud has no tenant projects, NO flavors, and NO images -- this is the substance of the retired do-doc-06 tenant setup, restored after the phase-NN consolidation dropped it (found in the 2026-06-10 pre-redeploy review). Everything is verify-or-create, so @@ -86,6 +96,7 @@ cloud credential (`clusterctl init` takes none; per-cluster creds are magnum-minted at create time per D-039). +**RUN -- jumphost** ```bash ( { set -u @@ -169,14 +180,15 @@ done } ) ``` -GATE: project + role + all five flavors present; `ubuntu-24.04-noble` `active` +**GATE:** project + role + all five flavors present; `ubuntu-24.04-noble` `active` (disk_format `raw` expected with image-conversion on). Do not proceed to 6.0 until this passes. ## Step 6.0 -- Keypair + security group (capi-mgmt project) -`# RUN: jumphost` Safe/idempotent setup -- consolidated. (LIVE-REVIEW: exact +Safe/idempotent setup -- consolidated. (LIVE-REVIEW: exact SG rule syntax is standard openstack-client; confirm on the redeploy test.) +**RUN -- jumphost** ```bash ( { set -u @@ -200,9 +212,10 @@ Expect: `capi-mgmt-key` present; `capi-mgmt-sg` with tcp/22 and tcp/6443 ingress. ## Step 6.1 -- Network, subnet, router (capi-mgmt project) -`# RUN: jumphost` Idempotent network plumbing -- consolidated. DNS nameservers +Idempotent network plumbing -- consolidated. DNS nameservers 1.1.1.1/1.0.0.1 (D-019: public resolvers; image pulls need internet egress). +**RUN -- jumphost** ```bash ( { set -u @@ -228,9 +241,10 @@ Expect: subnet `10.20.0.0/24`; router `ACTIVE` with an external gateway on provider-ext. ## Step 6.2 -- VM + floating IP (MUTATION; not batched with the gate) -`# RUN: jumphost` Creates the VM and pins the management FIP. The FIP is the +Creates the VM and pins the management FIP. The FIP is the stable apiserver endpoint for the jumphost AND the Magnum conductor. +**RUN -- jumphost** ```bash ( { set -u @@ -262,8 +276,9 @@ and phase-07 (conductor kubeconfig) uses the same FIP. Do not hardcode either value. ## Step 6.3 -- GATE 1: OS-level egress (before any k8s investment) -`# RUN: mgmt VM` This is the premise of D-035. PROCEED ONLY IF VIP-OK. +This is the premise of D-035. PROCEED ONLY IF VIP-OK. +**RUN -- jumphost -> mgmt VM** ```bash source ~/capi-mgmt-net.env # MGMT_FIP, MGMT_TENANT_IP (written by 6.2) ssh -i ~/.ssh/id_ed25519 -o BatchMode=yes -o StrictHostKeyChecking=no \ @@ -275,18 +290,19 @@ timeout 6 bash -c 'exec 3<>/dev/tcp/1.1.1.1/443' && echo NET-OK || echo NET-FAIL REOF ``` -GATE: require `VIP-OK`. `NET-FAIL` means sort provider-ext internet egress (or a +**GATE:** require `VIP-OK`. `NET-FAIL` means sort provider-ext internet egress (or a registry mirror) before 6.6. Do NOT build k8s on a VM that fails VIP-OK. (appendix-A: D-035 -- single-NIC removes the dual-homed reverse-path bug.) ## Step 6.4 -- k8s-snap install + bootstrap (MUTATION; secret-free) -`# RUN: mgmt VM` Channel is `1.32-classic/stable` (NOT `1.32/stable` -- that is +Channel is `1.32-classic/stable` (NOT `1.32/stable` -- that is the charm-era track and does not exist for the snap). The bootstrap config MUST carry an explicit `cluster-config` block (appendix-A: DOCFIX-024 -- a config without it disables network+dns and the node never goes Ready). Every `sudo` gets ` mgmt VM** ```bash source ~/capi-mgmt-net.env # MGMT_FIP, MGMT_TENANT_IP (written by 6.2) ssh -i ~/.ssh/id_ed25519 -o BatchMode=yes -o StrictHostKeyChecking=no \ @@ -327,6 +343,7 @@ The agnhost pod-egress probe is the exact test the dual-homed D-033 node and the old k3s node FAILED. On this single-NIC VM it must `Completed`. +**RUN -- jumphost -> mgmt VM** ```bash # RUN: jumphost (ssh to the mgmt VM; the kubeconfig lands on the jumphost). server = the FIP, not tenant IP source ~/capi-mgmt-net.env # MGMT_FIP @@ -337,6 +354,7 @@ wc -l ~/capi-mgmt.kubeconfig ; head -1 ~/capi-mgmt.kubeconfig # expect >0 lines, "apiVersion: v1" ``` +**RUN -- jumphost** ```bash # RUN: jumphost -- node check + the hard gate ( { @@ -351,17 +369,18 @@ kubectl get pod egress-test -o jsonpath='{.status.phase} {.status.containerStatuses[0].state}{"\n"}' } ) ``` -GATE: require the probe pod `Completed` / `exitCode 0` (empty logs = clean TCP +**GATE:** require the probe pod `Completed` / `exitCode 0` (empty logs = clean TCP connect). That proves pod -> Cilium -> ens3 -> OVN -> router SNAT egress works. Then clean up the throwaway pod: +**RUN -- jumphost** ```bash # RUN: jumphost KUBECONFIG="$HOME/capi-mgmt.kubeconfig" kubectl delete pod egress-test --now ``` ## Step 6.6 -- CAPI provider stack (pinned to dependencies.json; D-034) -`# RUN: mgmt VM` Run VM-side as root with `KUBECONFIG=/root/kubeconfig` (local +Run VM-side as root with `KUBECONFIG=/root/kubeconfig` (local apiserver = the VM's tenant IP:6443) so the matched 1.32.13 kubectl is used -- avoids the jumphost kubectl's +3-minor skew. Versions are READ from the tag's dependencies.json, never hardcoded (D-034). The as-built pins are in the @@ -375,10 +394,12 @@ runbook corrects the order.) ### 6.6a -- tooling + pins (install helm/clusterctl/kubectl VM-side; read dependencies.json @ 0.25.1) -`# RUN: jumphost` Installs the CAPI tooling on the mgmt VM at the dependencies.json +Installs the CAPI tooling on the mgmt VM at the dependencies.json pins and writes `~/capi-pins.env` (sourced by 6.6b-6.6f). kubectl is pinned to the cluster's 1.32.13 (no apiserver skew). The `SSH_OPTS`/`MGMT_VM` vars set here are reused by 6.6b-6.6f (same jumphost shell). + +**RUN -- jumphost -> mgmt VM** ```bash # define the mgmt-VM connection once (reused by 6.6b-6.6f) source ~/capi-mgmt-net.env # MGMT_FIP, MGMT_TENANT_IP (written by 6.2) @@ -424,7 +445,8 @@ ``` ### 6.6b -- cert-manager (DOCFIX-025a: crds.enabled=true, NOT installCRDs) -`# RUN: jumphost` + +**RUN -- jumphost -> mgmt VM** ```bash ssh $SSH_OPTS ubuntu@"$MGMT_VM" bash -s <<'REOF' set -euo pipefail @@ -440,8 +462,10 @@ ``` ### 6.6c -- ORC (BEFORE clusterctl init; CAPO hard-depends on the ORC Image CRD) -`# RUN: jumphost` server-side apply (large CRDs). Manifest is the k-orc release +server-side apply (large CRDs). Manifest is the k-orc release `install.yaml` (D-034). + +**RUN -- jumphost -> mgmt VM** ```bash ssh $SSH_OPTS ubuntu@"$MGMT_VM" bash -s <<'REOF' set -euo pipefail @@ -454,7 +478,9 @@ ``` ### 6.6d -- clusterctl init (core + kubeadm bootstrap/control-plane + CAPO) -`# RUN: jumphost` cert-manager already present -> clusterctl detects and skips it. +cert-manager already present -> clusterctl detects and skips it. + +**RUN -- jumphost -> mgmt VM** ```bash ssh $SSH_OPTS ubuntu@"$MGMT_VM" bash -s <<'REOF' set -euo pipefail @@ -471,7 +497,8 @@ ``` ### 6.6e -- CAAPH + janitor (azimuth helm charts; chart names from each repo Chart.yaml) -`# RUN: jumphost` + +**RUN -- jumphost -> mgmt VM** ```bash ssh $SSH_OPTS ubuntu@"$MGMT_VM" bash -s <<'REOF' set -euo pipefail @@ -489,7 +516,8 @@ ``` ### 6.6f -- verify the stack -`# RUN: jumphost` + +**RUN -- jumphost -> mgmt VM** ```bash ssh $SSH_OPTS ubuntu@"$MGMT_VM" bash -s <<'REOF' set -euo pipefail diff --git a/runbooks/phase-07-conductor-graft.md b/runbooks/phase-07-conductor-graft.md index e9e8355..124c452 100644 --- a/runbooks/phase-07-conductor-graft.md +++ b/runbooks/phase-07-conductor-graft.md @@ -50,8 +50,18 @@ --- +## Command-label convention +Every command block below is bracketed by bold labels, so a command line is never mistaken +for surrounding prose (these render in GitBucket and read clearly in a raw editor): +- **RUN -- LOC** -- the block CHANGES state; run it at LOC (e.g. `jumphost`, `vault/0`, `jumphost -> magnum/0`). +- **CHECK (read-only) -- LOC** -- a read-only verification; safe to re-run. +- **GATE:** -- a hard stop; do NOT proceed past the block unless the stated condition holds. +- **Expect:** -- what a passing result looks like. +- `> CAUTION:` -- marks a destructive, secret-handling, or irreversible step. + + ## Step 7.0 -- Magnum trustee domain-setup (D-046; REQUIRED on every (re)deploy) -`# RUN: jumphost` The magnum charm action `domain-setup` is MANUAL and idempotent; magnum +The magnum charm action `domain-setup` is MANUAL and idempotent; magnum reports active/"Unit is ready" REGARDLESS of whether the trustee domain exists. If the keystone domain `magnum` + user `magnum_domain_admin` (referenced by magnum.conf `[trust]`) are absent, `magnum/common/policy.py` 401s on EVERY policy-enforced request -> every `coe` op 403s (the @@ -61,11 +71,15 @@ trustee_domain_id is recomputed per request). Step A -- create the trustee domain (charm-native; idempotent; takes no parameters): + +**RUN -- jumphost** ```bash juju run magnum/leader domain-setup mgmt apiserver reachability: + +**CHECK (read-only) -- jumphost -> magnum/0** ```bash # RUN: jumphost -> magnum/0 (FIP from phase-06's ~/capi-mgmt-net.env -- never hardcode; DOCFIX-038) source ~/capi-mgmt-net.env # MGMT_FIP juju ssh -m openstack magnum/0 \ "timeout 6 bash -c 'exec 3<>/dev/tcp/$MGMT_FIP/6443' && echo TCP-OK || echo TCP-FAIL" magnum/0` The source `~/capi-mgmt.kubeconfig` already has its +The source `~/capi-mgmt.kubeconfig` already has its server rewritten to the FIP (phase-06 6.5). Transfer base64-piped straight into a root-written 0600 file owned by the conductor user -- never stage the admin kubeconfig in /tmp (appendix-A: L-P6-4). +**RUN -- jumphost -> magnum/0** ```bash # discover the conductor service user (expect: magnum) juju ssh -m openstack magnum/0 'systemctl show magnum-conductor -p User --value' magnum/0** ```bash juju ssh -m openstack magnum/0 \ 'sudo -u magnum env HOME=/tmp helm --kubeconfig /etc/magnum/kubeconfig list -A' ,{}).get("api_version", )`, @@ -164,6 +186,8 @@ override needed. Sanity-confirm v1beta1 is served per group before installing: + +**RUN -- jumphost** ```bash ( { export KUBECONFIG="$HOME/capi-mgmt.kubeconfig" @@ -179,9 +203,10 @@ ``` ## Step 7.4 -- Install the driver (1.4.0) + helm in the conductor container -`# RUN: jumphost -> magnum/0` `--no-deps` preserves the deb-managed oslo stack (no +`--no-deps` preserves the deb-managed oslo stack (no PEP668 issue on the 22.04 container). +**RUN -- jumphost -> magnum/0** ```bash # egress pre-check juju ssh -m openstack magnum/0 \ @@ -240,13 +265,14 @@ `api_resources = {"Cluster": {"api_version": "cluster.x-k8s.io/v1beta2"}}`. ## Step 7.6 -- Stage the [capi_helm] conf.d drop-in (D-037) -`# RUN: jumphost -> magnum/0` 0644 root, NO secrets (it points at the 0600 +0644 root, NO secrets (it points at the 0600 kubeconfig). The `default_helm_chart_version = 0.25.1` line is LOAD-BEARING (driver built-in default is `0.10.1`, the retired v1alpha6-era chart). `api_resources` is set to an explicit empty map `{}` (Step 7.5 -- the driver's code falls back to v1beta1 for every CAPI kind, which this cluster serves; explicit `{}` avoids the dict-default `json.loads` question). ASCII only. +**RUN -- jumphost -> magnum/0** ```bash juju ssh -m openstack magnum/0 "sudo tee /etc/magnum/magnum.conf.d/00-capi-helm.conf >/dev/null <<'CONF' [capi_helm] @@ -263,18 +289,21 @@ # api_resources = {"Cluster": {"api_version": "cluster.x-k8s.io/v1beta2"}, ...} ``` Re-check ASCII cleanliness: + +**CHECK (read-only) -- jumphost -> magnum/0** ```bash juju ssh -m openstack magnum/0 \ 'LC_ALL=C grep -nP "[^\x00-\x7F]" /etc/magnum/magnum.conf.d/00-capi-helm.conf && echo NON-ASCII || echo "ASCII clean"' magnum/0` These OpenStack debs run the daemon through an LSB +These OpenStack debs run the daemon through an LSB init script wrapped by systemd `systemd-start`; a systemd `ExecStart` drop-in is INERT (appendix-A: D-037, L-P6-1/L-P6-2). The sanctioned extension point is `/etc/default/magnum-conductor`, sourced inside the init script AFTER the base `--config-file` is assembled. The charm does not manage that file. +**RUN -- jumphost -> magnum/0** ```bash # confirm the daemon currently has NO --config-dir (the problem we are fixing) juju ssh -m openstack magnum/0 'ps -ww -C magnum-conductor -o args=' magnum/0` The charm renders `auth_version = v2.0` in magnum.conf +The charm renders `auth_version = v2.0` in magnum.conf `[keystone_authtoken]`/`[keystone_auth]` (a template type-compare bug; Caracal keystone does not serve v2.0). On THIS deploy it is COSMETIC -- magnum's domain_admin_auth rewrites v2.0->v3 and token validation worked throughout -- but v2.0 is the provably wrong value, so override it @@ -302,6 +331,8 @@ Step 7.7 wired `--config-dir` only for the conductor, and oslo.config reads `--config-dir` AFTER `--config-file`, so the drop-in wins. v3 URLs are DERIVED from the live `[keystone_authtoken]` (no hardcoded VIPs). No restart here -- Step 7.8 restarts both services. + +**RUN -- jumphost -> magnum/0** ```bash juju ssh -m openstack magnum/0 sudo bash -s <<'REOF' set -e @@ -320,14 +351,15 @@ echo "[OK] 50-keystone-v3-override.conf:"; cat /etc/magnum/magnum.conf.d/50-keystone-v3-override.conf REOF ``` -GATE: the drop-in lists `auth_version = v3` + `/v3` URLs in BOTH sections, and +**GATE:** the drop-in lists `auth_version = v3` + `/v3` URLs in BOTH sections, and `grep -- --config-dir /etc/default/magnum-api` returns the line. The effective value is proven in Step 7.8 by the magnum-api launched cmdline carrying `--config-dir` (L-P6-1/2: gate on the assembled cmdline, not the file text). Restart happens in Step 7.8. ## Step 7.8 -- Restart conductor + verify driver + HEALTHY (P6e + D-042 Stage 6) -`# RUN: jumphost -> magnum/0`, then jumphost health poll. +Restart on magnum/0, then a jumphost-side health poll. +**RUN -- jumphost -> magnum/0** ```bash juju ssh -m openstack magnum/0 \ 'sudo systemctl restart magnum-conductor magnum-api && sleep 3 && \ @@ -348,6 +380,8 @@ (`capi-test-1` reaching `health_status = HEALTHY`). The poll below applies when grafting onto a cloud that already has a CAPI-driver cluster: substitute that cluster's name and the current `ENV(project)` id (both are run-specific). + +**RUN -- jumphost** ```bash ( { source ~/admin-openrc @@ -361,18 +395,20 @@ done } ) ``` -GATE (existing-cluster graft only): `health_status -> HEALTHY`, with the +**GATE:** (existing-cluster graft only): `health_status -> HEALTHY`, with the `infrastructure` sub-check now `Ready` (it was the only failing axis under 1.3.0). On a FRESH DEPLOY this gate is deferred to phase-08 step 8.2 -- do not block here. If it does not clear on an existing-cluster graft, go to Rollback. ## Step 7.9 -- Regression check (confirm create/manage path intact) -`# RUN: jumphost` (capi-mgmt scope). Prove the upgraded driver still creates+deletes. +(capi-mgmt scope). Prove the upgraded driver still creates+deletes. FRESH DEPLOY ROUTING: SKIP this step -- the `capi-k8s-v1-34` template does not exist yet (phase-08 step 8.0 creates it), and phase-08 itself (create `capi-test-1` to CREATE_COMPLETE, full acceptance, then 8.5 delete) is a superset of this check. Run 7.9 as written only when grafting onto an existing cloud where the template is present. + +**RUN -- jumphost** ```bash openstack coe cluster create capi-fix-check --cluster-template capi-k8s-v1-34 \ --keypair capi-mgmt-key --master-count 1 --node-count 1 @@ -381,11 +417,13 @@ ``` ## Rollback (TEMPORARY holding state only -- if 7.8 health does not clear or 7.9 regresses) -`# RUN: jumphost -> magnum/0` Reverts to the as-first-built functional +Reverts to the as-first-built functional (cosmetic-UNHEALTHY) state on 1.3.0 -- a TEMPORARY holding state to keep the conductor serving while the 1.4.0 issue is diagnosed, NOT a v1 end state. v1 is NOT complete until `magnum-capi-helm==1.4.0` is installed and `health_status = HEALTHY` (D-011). Re-attempt 7.3-7.9 after diagnosis. + +**RUN -- jumphost -> magnum/0** ```bash juju ssh -m openstack magnum/0 'sudo python3 -m pip install --no-deps --force-reinstall "magnum-capi-helm==1.3.0"' admin (the admin-openrc project) @@ -68,12 +72,23 @@ --- +## Command-label convention +Every command block below is bracketed by bold labels, so a command line is never mistaken +for surrounding prose (these render in GitBucket and read clearly in a raw editor): +- **RUN -- LOC** -- the block CHANGES state; run it at LOC (e.g. `jumphost`, `vault/0`, `jumphost -> magnum/0`). +- **CHECK (read-only) -- LOC** -- a read-only verification; safe to re-run. +- **GATE:** -- a hard stop; do NOT proceed past the block unless the stated condition holds. +- **Expect:** -- what a passing result looks like. +- `> CAUTION:` -- marks a destructive, secret-handling, or irreversible step. + + ## Step 8.0 -- Verify prerequisites; create the template if absent -`# RUN: jumphost` (capi-mgmt scope). Read-only checks consolidated; template create +(capi-mgmt scope). Read-only checks consolidated; template create gated separately. (NOTE: template + image are tenant-setup artifacts; on a fully fresh build they may be produced by the magnum-setup step -- this phase verifies/creates the template for self-containment.) +**RUN -- jumphost** ```bash ( { set -u @@ -97,6 +112,8 @@ HAS `image create --import` = glance-direct and image-conversion lands it `raw`; it does NOT have standalone `image stage`/`image import` subcommands, and the standalone `glance` client is not assumed present): + +**RUN -- jumphost** ```bash ( { set -u @@ -135,12 +152,14 @@ done } ) ``` -GATE: image `active` and the 8.0 property check above passes (kube_version v1.34.8 / +**GATE:** image `active` and the 8.0 property check above passes (kube_version v1.34.8 / os_distro ubuntu). Then create the template only if absent. DOCFIX-032: pin `--network-driver calico` EXPLICITLY. Under the 1.4.0 driver `--network-driver` maps to the chart `network_driver`, and chart 0.25.1 ships ONLY Calico (flannel is not packaged) -- an explicit `calico` documents intent and removes reliance on the default staying Calico. Do NOT set `flannel`: it is unsupported by chart 0.25.1 and would fail to converge. + +**RUN -- jumphost** ```bash openstack coe cluster template create capi-k8s-v1-34 \ --coe kubernetes --server-type vm \ @@ -155,10 +174,11 @@ ``` ## Step 8.1 -- Create the workload cluster (MUTATION) -`# RUN: jumphost` (capi-mgmt scope). 1 control-plane + 2 workers, matching the +(capi-mgmt scope). 1 control-plane + 2 workers, matching the as-built capi-test-1. The driver auto-mints the app-cred (D-039) and always provisions an Octavia LB (+FIP) for the API. +**RUN -- jumphost** ```bash openstack coe cluster create capi-test-1 \ --cluster-template capi-k8s-v1-34 \ @@ -168,7 +188,9 @@ ``` ## Step 8.2 -- Watch to CREATE_COMPLETE; capture the LB/FIP -`# RUN: jumphost` (capi-mgmt scope). Poll; capture run-specific LB id + FIP. +(capi-mgmt scope). Poll; capture run-specific LB id + FIP. + +**CHECK (read-only) -- jumphost** ```bash ( { for i in $(seq 1 40); do @@ -181,12 +203,14 @@ openstack coe cluster show capi-test-1 -f value -c api_address -c master_count -c node_count -c health_status } ) ``` -GATE: `status = CREATE_COMPLETE`. Record `api_address` (the FIP endpoint, e.g. +**GATE:** `status = CREATE_COMPLETE`. Record `api_address` (the FIP endpoint, e.g. https://10.12.7.180:6443) for 8.3. If `CREATE_FAILED`, see appendix-A (stuck-delete / app-cred 403 / OOM). With phase-07's driver, `health_status` should read HEALTHY. ## Step 8.3 -- Retrieve the workload kubeconfig; verify nodes / CNI / addons -`# RUN: jumphost`. Pull the cluster's kubeconfig via Magnum, then inspect. +. Pull the cluster's kubeconfig via Magnum, then inspect. + +**RUN -- jumphost** ```bash # capi-mgmt scope mkdir -p ~/capi-test-1 # DOCFIX-037: `coe cluster config --dir` does NOT create the dir @@ -209,7 +233,7 @@ kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded } ) ``` -GATE: 3 nodes Ready; Calico pods Running; CCM Running (NOT crash-looping -- this is +**GATE:** 3 nodes Ready; Calico pods Running; CCM Running (NOT crash-looping -- this is D-011 item 5); Cinder CSI + CoreDNS Running; no stuck pods. ================================================================================ @@ -243,6 +267,8 @@ a THROWAWAY Kubernetes `Service type=LoadBalancer` on the workload cluster: the OpenStack CCM provisions an Octavia LB + pool + members for it automatically (the Roosevelt-real path -- tenant workloads get LBs exactly this way), then tear it down. `# RUN: jumphost, KUBECONFIG=~/capi-test-1/config` + +**RUN -- jumphost** ```bash export KUBECONFIG=~/capi-test-1/config kubectl create deploy rr --image=registry.k8s.io/e2e-test-images/agnhost:2.40 --replicas=2 -- /agnhost netexec --http-port=8080 @@ -280,13 +306,17 @@ returns (v2). NOT required for v1 acceptance. ## Step 8.5 -- (Optional) Clean delete verification -`# RUN: jumphost` (capi-mgmt scope). Confirms the manage/teardown path. +(capi-mgmt scope). Confirms the manage/teardown path. + +**RUN -- jumphost** ```bash openstack coe cluster delete capi-test-1 # watch coe cluster list to gone ``` If a delete WEDGES (DELETE_IN_PROGRESS, CRs stuck Deleting on an Octavia 403 from a frozen app-cred): clear the OpenStackCluster finalizer (the Cluster auto-follows), then manual neutron cleanup in dependency order -- appendix-A: stuck-delete. + +**RUN -- jumphost** ```bash # NS=magnum-$(openstack project show capi-mgmt --domain capi -f value -c id) # resolve; never hardcode # KUBECONFIG=~/capi-mgmt.kubeconfig kubectl -n "$NS" patch openstackcluster - \