diff --git a/README.md b/README.md index 465cb4e..9578877 100644 --- a/README.md +++ b/README.md @@ -78,7 +78,7 @@ | phase-04 | Network carve (provider external network + IPAM reference) | | phase-05 | Octavia enablement | | phase-06 | In-cloud CAPI management cluster (D-035) | -| phase-07 | Magnum conductor graft (magnum-capi-helm driver; D-031/D-037/D-042) | +| phase-07 | Magnum conductor graft (magnum-capi-helm driver; trustee domain-setup; D-031/D-037/D-042/D-046/D-047) | | phase-08 | Workload-cluster acceptance (D-011) | NetBox imports run separately, gated on external NetBox-engineer review diff --git a/bundle.yaml b/bundle.yaml index bdeb77b..a253cf9 100644 --- a/bundle.yaml +++ b/bundle.yaml @@ -21,7 +21,7 @@ # (10.12.16.0/22), ceph-osd cluster->replication (10.12.20.0/22). Bindings, NOT # ceph-*-network config, so the LXD-contained mon actually gets a storage NIC. # Clients bind ceph->storage; container principals carry it too (subset rule). (C2) -# Magnum: Layer A only -- CAPI driver graft is Layer B (runbooks/04a + 05) +# Magnum: Layer A only -- CAPI driver graft is Layer B (runbooks/phase-06..08) # Octavia: lb-mgmt PKI options supplied via overlays/octavia-pki.yaml (gitignored). # Amphora-pipeline options baked (use-internal-endpoints etc.). (B4) # OVN tunnels: geneve overlay on the DATA space (10.12.12.0/22) -- ovn-chassis + ovn-chassis-octavia @@ -466,12 +466,14 @@ # Kubernetes-as-a-Service: Magnum (Layer A -- CAPI graft is Layer B) # ===================================================================== # NOTE: After bundle deploys, magnum/0 will show active/idle but CANNOT create K8s clusters. - # Layer B (post-deploy) brings it to life: - # 1. capi-mgmt VM with k3s + CAPI operators (runbook 04a) - # 2. pip install magnum-capi-helm==1.1.0 into magnum venv (runbook 05) - # 3. /etc/magnum/magnum.conf.d/99-capi.conf with enabled_drivers - # 4. Install kubeconfig at /etc/magnum/kubeconfig - # 5. Create Keystone capi-mgmt project + capo user + app credential + # Layer B (post-deploy) brings it to life -- see runbooks/phase-06..08: + # 1. In-cloud single-homed mgmt VM (capi-mgmt-v2) with k8s-snap + CAPI/CAPO (phase-06; D-035) + # 2. magnum-capi-helm==1.4.0 grafted onto the conductor (phase-07; D-037/D-042) + # 3. /etc/magnum/magnum.conf.d/00-capi-helm.conf (driver) + 50-keystone-v3-override.conf, + # both read via --config-dir wired into /etc/default/magnum-{conductor,api} (D-037/D-047) + # 4. kubeconfig at /etc/magnum/kubeconfig (server = the mgmt FIP) (phase-07) + # 5. magnum trustee domain-setup (REQUIRED; D-046); per-cluster app-creds are + # minted by magnum at cluster-create -- NO static capo user/app-cred (D-039) magnum: charm: magnum @@ -647,7 +649,7 @@ - [barbican-vault:secrets-storage, vault:secrets] - [barbican:ha, barbican-hacluster:ha] - # ---- Magnum (Layer A only; CAPI graft is Layer B/runbook 05) + # ---- Magnum (Layer A only; CAPI graft is Layer B/runbooks phase-06..08) - [magnum-mysql-router:db-router, mysql-innodb-cluster:db-router] - [magnum:shared-db, magnum-mysql-router:shared-db] - [magnum:identity-service, keystone:identity-service] diff --git a/docs/design-decisions.md b/docs/design-decisions.md index 5cef1c9..ad98255 100644 --- a/docs/design-decisions.md +++ b/docs/design-decisions.md @@ -121,7 +121,7 @@ --- -## D-007: Magnum inclusion +## D-007: Magnum inclusion (Layer A current; Layer B mechanism/topology superseded -- see D-035 / D-037 / D-042) **Decision:** Magnum in bundle from day one. Two-layer install. @@ -146,7 +146,7 @@ **CAPI mgmt plane:** Post-pivot, the workload cluster IS the CAPI management plane (per **runbook 04a §17**, `clusterctl move` pivots cluster state from the `capi-mgmt.maas` bootstrap k3s into the workload cluster, which becomes self-managing). Per **D-017**, both the bootstrap k3s and the workload cluster are rebuilt from scratch every deployment cycle — there is no preserved-across-rebuild artifact. The bootstrap install + pivot procedure lives in `runbooks/04a-capi-bootstrap-cluster.md` and runs **before** this runbook. This pattern transfers to Roosevelt unchanged. -**Superseded portions:** The "preserved across rebuild" stance in earlier drafts of this decision is **superseded by D-017**. See D-017 for rationale. The earlier `stackhpc/magnum-capi-helm` v0.13.0 driver pin is superseded by the `openstack/magnum-capi-helm` 1.1.0 pin above (driver source repo moved + archived). +**Superseded portions:** The "preserved across rebuild" stance in earlier drafts of this decision is **superseded by D-017**. See D-017 for rationale. The earlier `stackhpc/magnum-capi-helm` v0.13.0 driver pin is superseded by the `openstack/magnum-capi-helm` 1.1.0 pin above (driver source repo moved + archived). The Layer B *mechanism and topology* are now further superseded: the CAPI management plane is an in-cloud single-homed VM with NO `clusterctl move` -- the kubeconfig points at that mgmt cluster, not a post-pivot workload cluster (**D-035**); the conductor graft is `/etc/default/magnum-conductor` `DAEMON_ARGS --config-dir` + `00-capi-helm.conf`, NOT a systemd-ExecStart override + `99-capi.conf` (**D-037**); the driver pin is **1.4.0** for CAPI-core contract coherence (**D-042**); and the per-cluster app-cred replaces any static `capo` credential (**D-039**). Layer A (the bundle) is current. Live deploy steps are runbooks/phase-06 (mgmt cluster) and phase-07 (conductor graft + `domain-setup` D-046 + keystone-v3 drop-in D-047); `runbooks/05-magnum-capi-driver.md` is historical. --- @@ -312,7 +312,7 @@ --- -## D-017: CAPI bootstrap cluster lifecycle +## D-017: CAPI bootstrap cluster lifecycle (bootstrap mechanism superseded by D-035; full-rebuild principle retained) **Decision:** L3 full teardown and rebuild every deployment cycle. The `capi-mgmt.maas` MAAS VM is released back to Ready state on teardown; on rebuild, it is re-deployed from scratch with Ubuntu 24.04, k3s, CAPI controllers, and ORC. **Nothing is preserved across cycles.** @@ -327,6 +327,8 @@ **Supersedes:** the "preserved across rebuild" stance in earlier drafts of D-007 and D-013. +**Superseded by:** **D-035** retired the bootstrap MECHANISM -- there is no `capi-mgmt.maas` MAAS VM, no k3s, and no `clusterctl move`/pivot; the management plane is an in-cloud single-homed tenant VM (`capi-mgmt-v2`, k8s-snap) built in runbooks/phase-06. The full-teardown-and-rebuild-every-cycle PRINCIPLE stated here is RETAINED (now realized by phase-00 teardown + the phase-06 rebuild). `runbooks/04a-capi-bootstrap-cluster.md` and `05-magnum-capi-driver.md` are historical (folded into phase-06/07). + **Alternatives considered:** - L1: Wipe just the cluster CRs, keep k3s + controllers. Rejected: skips the install rehearsal that's the whole point. @@ -473,11 +475,11 @@ ## D-035: Management-cluster placement -- in-cloud single-homed tenant VM -**Decision:** run the CAPI management cluster as a single-homed in-cloud tenant VM (`capi-mgmt-v2`): one NIC on the management tenant subnet (10.20.0.0/24), reached via a floating IP (10.12.7.40); k8s-snap (channel `1.32-classic/stable`), Cilium CNI; not CAPI-self-managed (no `clusterctl move`). +**Decision:** run the CAPI management cluster as a single-homed in-cloud tenant VM (`capi-mgmt-v2`): one NIC on the management tenant subnet (10.20.0.0/24), reached via a floating IP (per-rebuild -- this rebuild 10.12.5.103; the original 10.12.7.40 is dead -- DOCFIX-038); k8s-snap (channel `1.32-classic/stable`), Cilium CNI; not CAPI-self-managed (no `clusterctl move`). **Rationale:** D-033's out-of-cloud node was necessarily dual-homed and its pod egress to the OpenStack API VIPs failed -- the Cilium reverse-NAT reply was emitted back out the second NIC instead of being redirected into the pod via `cilium_host` (a multi-NIC reverse-path fault; the `k8s` charm exposes too few Cilium annotations to repair it). A single-homed VM removes the second NIC and the fault entirely. The single-NIC pod-egress premise was then proven by the Phase 4 hard gate (an agnhost pod TCP probe to the Keystone VIP 10.12.4.50:5000 returning exitCode 0). -**Status:** Adopted 2026-06-08; pod-egress premise validated. **Supersedes:** D-033 (revisits D-030 in simpler form). **Unaffected:** D-031, D-034. +**Status:** Adopted 2026-06-08; pod-egress premise validated. **Supersedes:** D-033 (revisits D-030 in simpler form); also retires the Layer B topology of **D-007** and the bootstrap mechanism of **D-017** (k3s-on-MAAS + `clusterctl move` -> in-cloud single-homed VM, no pivot). **Unaffected:** D-031, D-034. **Trade-off:** a single-node management cluster is a SPOF with no self-heal -- see D-041 (manual-start policy) and D-040 (the OOM that surfaced it). @@ -505,6 +507,20 @@ --- +## D-039: Magnum mints per-cluster app-creds carrying the trustor's roles (grant load-balancer_member) + +**Status:** ACCEPTED 2026-06-09 (applied in phase-06; asserted in phase-08 prereqs). Cited by phase-06 (DOCFIX-036 grant), phase-08, and appendix-A; recorded here to close the dangling reference. + +**Context:** the magnum-capi-helm service path uses NO static, pre-provisioned application credential. At cluster-create magnum mints a per-cluster Keystone application credential from a trust, and that app-cred carries the TRUSTOR's roles FROZEN at mint time, delegated unfiltered. The trustor is the identity that creates the cluster in the capi-mgmt project (admin@admin_domain in v1). + +**Decision:** the trustor must hold `load-balancer_member` (plus `member` and `reader`) on the capi-mgmt project BEFORE any cluster is created, so every minted app-cred carries Octavia authority. A trustor holding only `member` mints a frozen app-cred that 403s when CAPO queries Octavia to confirm LB state -- the workload cluster then wedges at API-LB provisioning, and a stuck-delete 403s the same way (appendix-A). The grant is idempotent (member + load-balancer_member + reader) and applied in phase-06. + +**Roosevelt implication:** whichever identity creates Magnum/CAPI clusters must carry `load-balancer_member` + `reader` on the cluster project; `member`-only is a latent 403. There is no static `capo` user/app-cred to provision -- that pattern was retired with the per-cluster mint. + +**Related:** D-031 / D-036 (driver/engine surface), D-046 (the trustee domain the trust resolves against), appendix-A (D-039 + stuck-delete recovery). + +--- + ## D-040: Raise nova-compute reserved-host-memory on the hyperconverged hosts **Decision:** set `nova-compute reserved-host-memory` to 8192 MB (from the default 512) so Nova placement accounts for the non-Nova memory co-located on each hyperconverged host. Charm config -> survives redeploy. @@ -573,6 +589,8 @@ | 2026-06-08 | D-034 (CAPI constellation pinned to dependencies.json; supersedes D-022), D-035 (in-cloud single-homed mgmt VM; supersedes D-033), D-036 (driver/chart/CAPO coherence resolved), D-037 ([capi_helm] via /etc/default DAEMON_ARGS) added. | In-cloud mgmt pivot | | 2026-06-09 | D-040 (reserved-host-memory 8192), D-041 (non-HA manual-start policy), D-042 (driver<->core contract coherence; 1.4.0 pin) added. | OOM incident + driver fix | | 2026-06-09 | D-019..D-042 consolidated into this document (15 decisions). Existing D-001..D-018 left intact (em-dash style preserved); the new entries are ASCII. | Repo sanitation / doc refresh | +| 2026-06-17 | D-044 (Horizon secure-cookie override) + D-045 (haproxy confirmed-LOADED) folded from the changes-doc; D-046 (magnum trustee-domain setup) + D-047 (keystone v2.0 render bug / v3 drop-in) merged and renumbered from the 06-17 addendum (its "D-044/D-045"); D-048 (stage-and-verify canonical image seed, supersedes web-download) + D-049 (D1: kube v1.34.8 / capi-k8s-v1-34) added; D-042 amended (FINDING-5: UNHEALTHY is a <=1.3.0 false-negative, HEALTHY on 1.4.0); D-050 (PROPOSED: keystone policyd-override) recorded. | End-of-deploy runbook sweep | +| 2026-06-18 | D-039 (Magnum per-cluster app-cred roles; grant load-balancer_member) recorded to close a dangling reference (previously cited only in phase-06/08, bundle, appendix-A). D-007 + D-017 annotated as superseded by D-035 / D-037 / D-042 (in-cloud mgmt VM; /etc/default config-dir graft; 1.4.0 driver) -- historical bodies retained. | Pre-commit audit (runbook sweep) | @@ -626,3 +644,95 @@ Note: the restart procedure's failure-mode table already references the config key for SHUTOFF guests; whichever option is chosen, align that table, this decision, and the bundle/runbook with each other. + + + +--- + +## D-044: Horizon Secure-cookie override for internal-HTTP dashboard access (DOCFIX-030) + +**Status:** Adopted 2026-06-17 (PER-REBUILD; phase-03 Step 3.3). Resolves the mislabeled "D-043" tag used for this item in earlier phase-03/changes-doc drafts -- D-043 is the tenant-VM auto-resume decision. + +**Decision:** the jumphost reaches Horizon over a plain-HTTP reverse-proxy leg, but the dashboard sets `SESSION_COOKIE_SECURE`/`CSRF_COOKIE_SECURE=True`, so the browser drops the session/CSRF cookies over HTTP and login fails. Apply a Django settings override on the openstack-dashboard leader (`_99_internal_http_cookies.py`, setting `SESSION_COOKIE_SECURE=False` + `CSRF_COOKIE_SECURE=False`) to allow cookie flow over the internal HTTP leg. A TESTCLOUD accommodation of the no-DNS / no-FQDN-cert posture. + +**Trade-off / Roosevelt:** disabling Secure cookies is acceptable only because the proxy leg is internal and the cloud has no public DNS / FQDN-valid cert. The Roosevelt root-fix is cloud DNS + FQDN-valid certs end-to-end (which also fixes gss and the nginx proxy_ssl_name handling); then this override is removed. Self-signed-client-TLS approaches are NOT part of v1. + +**Related:** DOCFIX-030 (phase-03 Step 3.3), D-043 (distinct -- auto-resume). + +## D-045: Charm-rendered haproxy config must be confirmed LOADED, not just rendered (DOCFIX-031) + +**Status:** Adopted + APPLIED 2026-06-11; re-verified 2026-06-16 post phase-05. + +**Decision:** after the vault/TLS cert cascade settles, confirm every unit's haproxy is actually checking its backends over the freshly-rendered SSL config -- by a functional probe of haproxy's OWN backend state (admin-socket `show stat`, grep `,DOWN,`), NOT by `juju status`. Reload (not restart) any unit whose running haproxy predates the `check-ssl` re-render. + +**Root cause (confirmed, not refined to a check-config defect):** nova-cloud-controller haproxy was not reloaded after the cert cascade, so its health checks ran plaintext against the now-SSL backend port and marked the nova-api backends DOWN -- while juju stayed active/idle (juju is BLIND to per-backend haproxy state). An 8s wire capture showed the checks switch to TLS after reload; both backends returned UP/L7OK/200. The reload is a real fix, not a band-aid. + +**Status note:** the surfaced symptom was nova-api EOF / 503 behind a green juju. phase-03 Step 3.1 now gates a zero-DOWN sweep cloud-wide. + +**Related:** DOCFIX-031 (phase-03 Step 3.1; appendix-A), D-046 / D-047 (the separate magnum-keystone incident). + +## D-046: Magnum trustee-domain setup is a REQUIRED, asserted post-deploy step + +**Status:** ACCEPTED 2026-06-17. Recorded in design-decisions-addendum-20260617.md as "D-044"; renumbered to D-046 here per the 06-17 reconciliation (D-044/D-045 were taken by the Horizon/haproxy decisions). Matches the rootcause doc + phase-08 handoff prompt. + +**Context:** all `openstack coe ...` ops returned 403 ("Keystone client authentication failed"); magnum-api.log showed keystoneauth1 401 on every request since 2026-06-16. Root cause: the keystone domain `magnum` and user `magnum_domain_admin` that magnum.conf `[trust]` references did not exist. `magnum/common/policy.py:130` resolves `trustee_domain_id` on EVERY policy-enforced request (driver-agnostic), authenticating as the trustee domain admin; with the domain/user absent that is a hard 401 -> every coe op 403. + +**Cause:** the magnum charm action `domain-setup` is MANUAL, not automatic; magnum reports active / "Unit is ready" regardless of whether it has run. The 2026-06-11 teardown/redeploy rebuilt keystone with fresh domains but the runbook did not re-run `domain-setup`. + +**Decision:** `domain-setup` is a REQUIRED, ASSERTED post-deploy step on every (re)deploy, after the magnum + identity-service relation is up and BEFORE magnum is declared functional / before phase-08: (1) `juju run magnum/leader domain-setup`; (2) assert `openstack domain show magnum` and `openstack user show magnum_domain_admin --domain magnum` both succeed; (3) gate `openstack coe service list` (must return the conductor row, no 403). magnum's active/ready status MUST NOT be treated as evidence the trustee domain exists. + +**Roosevelt:** carry as an explicit runbook step + assertion; consider upstreaming a charm change so domain-setup runs automatically, or so the charm surfaces "trustee domain not set up" instead of reporting ready. + +**Related:** D-047 (the v2.0 render bug found in the same incident, but NOT the cause), D-031 / D-037 (magnum surface). + +## D-047: keystone auth_version v2.0 charm-render bug -- keep the v3 drop-in + +**Status:** ACCEPTED (keep) 2026-06-17. addendum "D-045" -> D-047 per the 06-17 reconciliation. + +**Context:** the magnum charm template renders `auth_version = v2.0` due to a type bug (the keystone interface delivers `api_version` JSON-decoded as int 3; the template does a strict string compare `3 == "3"` -> False -> v2.0). Full analysis in incident-magnum-keystone-v2-rootcause-20260617.md. + +**Finding:** the v2.0 render was NOT the cause of the coe 403 (that was D-046). On this deployment v2.0 is cosmetic -- magnum's `domain_admin_auth` rewrites v2.0 -> v3, v3 is discovered from the unversioned `auth_url`, and incoming token validation worked throughout. + +**Decision (Jesse, 2026-06-17):** KEEP the magnum.conf.d v3 drop-in. v2.0 is the provably wrong value for Caracal (which does not serve v2.0); the drop-in forces v3 via the same config-dir mechanism as the D-037 conductor graft (no charm-file drift, survives re-render). Architectural correctness over minimize-delta, even though the drop-in did not unblock coe. + +**As-built:** `/etc/magnum/magnum.conf.d/50-keystone-v3-override.conf` (auth_version=v3 + www_authenticate_uri/auth_url v3 in `[keystone_authtoken]` and `[keystone_auth]`); `/etc/default/magnum-api` DAEMON_ARGS adds `--config-dir /etc/magnum/magnum.conf.d` (mirrors D-037 for the standalone magnum-api). + +**Roosevelt:** carry the v3 fix as a drop-in/overlay; upstream the template type bug (and the separate identity-service departed-hook IndexError crash documented in the rootcause doc). + +**Related:** D-046, D-037. + +## D-048: Stage-and-verify is the canonical image-seed method (supersedes web-download) + +**Status:** Adopted 2026-06-17 (operator-approved). Supersedes the 2026-06-16 "web-download canonical" ruling. + +**Decision:** seed ALL glance images (octavia amphora base, the noble mgmt image, the workload kube image) by STAGE-AND-VERIFY: download to `$HOME` (snap-readable; NOT /tmp), verify the file against the published checksum (azimuth-images manifest sha512 for kube images; ubuntu cloud-images SHA256SUMS for noble), then `openstack image create --file [--import]`. Web-download is retained as a TESTED ALTERNATIVE only (appendix-A). + +**Rationale:** (1) FINDING-3 -- glance's web-download plugin fetches with urllib (UA `Python-urllib/3.x`) and the azimuth CDN 403s that UA, so web-download is UNUSABLE for kube images (202-accept, then stuck `queued`); (2) web-download cannot checksum-verify the fetched file (the CDN redirect strips the digest) -- weaker provenance; (3) stage-and-verify is one provenance-verified path cloud-wide -- less Roosevelt delta. CORRECTION-1: a plain `--file` PUT stores qcow2 (boots fine); `--import` runs glance image-conversion -> raw (Ceph fast-clone alignment). + +**Roosevelt:** unify on stage-and-verify; the longer-term target remains gss-from-a-controlled-mirror once cloud DNS + FQDN certs land. + +**Related:** D-021 (amphora pipeline), FINDING-3 (appendix-A image-seeding), phase-05 / 06 / 08. + +## D-049: Workload kube image bumped v1.32.13 (EOL) -> v1.34.8 (D1) + +**Status:** Adopted 2026-06-17. Procedure target; re-validation on v1.34.8 follows the stage-and-verify seed. + +**Decision:** the workload-cluster kube image moves from the EOL ubuntu-jammy-kube-v1.32.13 to ubuntu-jammy-kube-v1.34.8 (azimuth-images 0.28.0, build 260518-1604; sha512 7efde485...760bdb3), and the cluster template is renamed `capi-k8s-v1-32` -> `capi-k8s-v1-34`. v1.34.8 is mature with good runway and within CAPI v1.13.2 support. The management cluster's OWN k8s stays at v1.32.13 (k8s-snap 1.32-classic) -- this bump is the workload image only. + +**Note:** the 2026-06-09 D-011 acceptance ran on v1.32.13; the v1.34.8 image is seeded via D-048 stage-and-verify, and D-011 re-validation on v1.34.8 is the pending acceptance item. The template now pins `--network-driver calico` (DOCFIX-032). + +**Related:** D-031 / D-034 (CAPI surface), D-048 (seed), DOCFIX-032 (CNI pin), phase-08. + +## D-042 -- AMENDMENT (2026-06-17): FINDING-5 -- rescope to "<= 1.3.0" (HEALTHY on 1.4.0) + +The D-042 cosmetic `health_status = UNHEALTHY` false-negative is a property of driver builds <= 1.3.0 (the v1beta2 contract-ref mismatch: those builds read `apiVersion` off the infrastructureRef, which CAPI v1.13's v1beta2 contract no longer carries). The RELEASED 1.4.0 driver carries the `api_resources` override and reports `health_status = HEALTHY` against the CAPI v1.13.2 / CAPO v0.14.4 stack (confirmed this rebuild). FINDING-5: D-042 is therefore CLOSED for v1 -- the UNHEALTHY caveat applies only to the historical <=1.3.0 holding state, NOT to the as-built 1.4.0 pin. Auto-heal is still NOT wired to health_status (CAPI MachineHealthCheck heals independently). + +## D-050: PROPOSED / OPEN -- keystone `use-policyd-override=true` with no policy zip (FINDING-1) + +**Status:** PROPOSED / OPEN (recorded 2026-06-17; no action taken). + +**Question:** keystone is configured with `use-policyd-override=true` but no policy override zip is supplied. This is currently a no-op (no custom policy applied), but the flag advertises an override capability that does not exist -- a latent footgun (a future operator may assume policy is being enforced, or a stray zip could silently change authz). + +**Options (unresolved):** (a) set `use-policyd-override=false` for v1 (the override is unused) and revisit when a real policy is needed; (b) keep true and supply an explicit, reviewed policy zip; (c) leave as-is and document the no-op. No decision made -- recorded as an open point to rule on (cf. D-043, also pending). + +**Related:** D-029 (Keystone SSO deferral), FINDING-1. diff --git a/docs/netbox-vip-queue.md b/docs/netbox-vip-queue.md index 04e512f..9ee1c4e 100644 --- a/docs/netbox-vip-queue.md +++ b/docs/netbox-vip-queue.md Binary files differ diff --git a/runbooks/README.md b/runbooks/README.md index 89f4cd9..7e595e6 100644 --- a/runbooks/README.md +++ b/runbooks/README.md @@ -30,7 +30,7 @@ | 04 | phase-04-network-carve.md | Provider external network + IPAM reference | | | 05 | phase-05-octavia-enablement.md | Enable Octavia (amphora) | D-021 | | 06 | phase-06-incloud-mgmt-cluster.md | In-cloud single-homed CAPI management cluster | D-035 | -| 07 | phase-07-conductor-graft.md | Graft the magnum-capi-helm driver onto the conductor | D-031 / D-037 / D-042 | +| 07 | phase-07-conductor-graft.md | Trustee domain-setup + graft the magnum-capi-helm driver | D-031 / D-037 / D-042 / D-046 / D-047 | | 08 | phase-08-workload-cluster-acceptance.md | End-to-end tenant cluster + acceptance bar | D-011 (amended D-019) | ## Appendices diff --git a/runbooks/appendix-A-troubleshooting.md b/runbooks/appendix-A-troubleshooting.md index 6638d6c..2d7ef3a 100644 --- a/runbooks/appendix-A-troubleshooting.md +++ b/runbooks/appendix-A-troubleshooting.md @@ -114,12 +114,25 @@ "string present in the unit file" as "the daemon received the flag." Gate on the assembled/launched cmdline (`show-args`, then `ps` on the live process). +### DOCFIX-035 -- helm not on the conductor's PATH (phase-07) +- Symptom: the magnum-capi-helm driver fails shelling out to `helm` (cluster create errors on a + helm invocation), yet `command -v helm` in an interactive `juju ssh magnum/0` shell finds it. +- Cause: the conductor runs via an LSB init script (systemd `systemd-start`) with the restricted + init PATH (e.g. `/usr/sbin:/usr/bin:/sbin:/bin`), which EXCLUDES `/usr/local/bin` -- where a + get.helm.sh tarball install lands. An interactive login shell has `/usr/local/bin` on PATH, so + it masks the problem (the classic green-in-the-shell, broken-in-the-daemon trap). +- Fix: install the binary to `/usr/local/bin/helm` AND symlink `/usr/bin/helm -> it` (`/usr/bin` + IS on the restricted PATH). Checksum-verify the tarball (sha256 vs get.helm.sh `.sha256sum`) + before install. VERIFY against the restricted PATH, not a login shell: + `env -i PATH=/usr/sbin:/usr/bin:/sbin:/bin sh -c 'command -v helm && helm version --short'` + must print `/usr/bin/helm` (phase-07 7.4). + ### L-P6-3 -- k8s version comes from the IMAGE, not a template label (phase-08) - Symptom: cluster create fails in the driver before provisioning. - Cause: the magnum-capi-helm driver reads `kube_version` from the Glance image properties and routes on `os_distro`; it does NOT take k8s version from a template label. -- Fix: the workload image (e.g. `ubuntu-jammy-kube-v1.32.13`) MUST carry +- Fix: the workload image (e.g. `ubuntu-jammy-kube-v1.34.8`) MUST carry `kube_version` (e.g. v1.32.13) and `os_distro=ubuntu`. Verify before create (phase-08 8.0). ================================================================================ @@ -164,6 +177,11 @@ so the Cluster auto-finalizes and deletes. Then manually clean orphaned neutron resources in dependency order: router remove subnet -> router unset external-gateway -> router delete -> subnet delete -> network delete -> security group delete. +- Name-guard (FINDING-4): NEVER patch/delete a CR by an inferred name. The OpenStackCluster is + named `-` where the suffix is random per create (NOT the Magnum cluster + name). LIST first -- `kubectl -n get openstackcluster` -- and operate on the EXACT + name returned. The magnum-ns is `magnum-` (resolve the project id; never hardcode). + A wrong-name patch silently no-ops and the delete stays wedged. ### LB-failover -- LB stuck provisioning_status=ERROR after a host event (phase-08) - Symptom: the kube-api Octavia LB shows `operating_status ONLINE` but @@ -181,13 +199,15 @@ cluster. If the mgmt cluster is down (see D-041), the taint persists. - Fix: restore the mgmt cluster API; CAPI then removes the taint and addons schedule. -### CNI-label -- network_driver vs the chart-default Calico (1.4.0) (phase-08) -- Note: under the as-FIRST-built driver 1.3.0 the legacy Magnum `network_driver` label - was IGNORED and the capi-helm `openstack-cluster` chart's default CNI (Calico) always - ran. Under the RELEASED 1.4.0 driver the `network_driver` template option IS honored - (it maps through to the chart). To keep the as-built CNI (Calico), the `capi-k8s-v1-32` - template OMITS `--network-driver` (phase-08); set `flannel` there only to intentionally - switch the CNI. (Mgmt cluster CNI is separately Cilium, via k8s-snap.) +### CNI-label / DOCFIX-032 -- network_driver under driver 1.4.0; pin calico explicitly (phase-08) +- Note: under the as-FIRST-built driver 1.3.0 the legacy Magnum `network_driver` label was + IGNORED and the capi-helm `openstack-cluster` chart's default CNI (Calico) always ran. Under + the RELEASED 1.4.0 driver the `network_driver` template option IS honored (it maps through to + the chart `network_driver`). +- DOCFIX-032: pin `--network-driver calico` EXPLICITLY on the `capi-k8s-v1-34` template + (phase-08) rather than relying on the default staying Calico. Chart 0.25.1 ships ONLY Calico + (flannel is not packaged), so `flannel` there would fail to converge -- do not set it. (Mgmt + cluster CNI is separately Cilium, via k8s-snap.) ================================================================================ ## Hyperconverged host / mgmt-VM resilience @@ -249,6 +269,16 @@ allocation_pool (phase-00 Phase 4). A reserved range stops future auto-assign onto a configured VIP. Negative test post-deploy: no service vip == any unit primary. +### DEVIATION-2 -- raise a KVM host's RAM, then MAAS-recommission to Ready (phase-00) +- Context (2026-06-11): the openstack0-3 KVM guests were bumped 16384 -> 32768 MiB on the 196 GB + hypervisor to relieve memory pressure. Pattern: with the guest SHUT OFF (and after the OSD + wipe), `virsh setmaxmem 32G --config` then `virsh setmem 32G --config`; boot; then + MAAS RECOMMISSION the node so MAAS re-reads hardware and lands it back at Ready at the new size + (4x Ready at 32768 in ~3 min). Do the maxmem change while shut off -- a live setmaxmem is rejected. +- D-040 `reserved-host-memory 8192` is RETAINED (correctness floor, independent of host size). + Re-measure the per-host container/service footprint against the 32 GiB envelope before the + Roosevelt node-role split -- 16 GiB-era pressure numbers do not map 1:1. + ================================================================================ ## Deploy-time (phase-01) ================================================================================ @@ -330,6 +360,39 @@ revs instead of re-introducing the 401-by-hardcode. ================================================================================ +## Core services: HAProxy + reverse-proxy (phase-03) +================================================================================ + +### D-045 / DOCFIX-031 -- juju "active/idle" but an haproxy backend is DOWN (phase-03) +- Symptom: `juju status` is all active/idle, yet a service VIP intermittently 503s or a unit's + API is unreachable. juju health is BLIND to per-backend haproxy state. +- Cause: a charm-rendered haproxy backend can be silently DOWN without the charm going non-idle + -- e.g. (D-045) haproxy was NOT reloaded after the TLS/cert cascade, so its health checks ran + plaintext against an SSL backend and marked it DOWN. juju-green is necessary, not sufficient. +- Fix: sweep haproxy's OWN verdict on every unit via its admin socket, then remediate+reload. + Per unit, read `/var/run/haproxy/admin.sock` (`show stat`) and `grep ',DOWN,'` (excluding the + FRONTEND/BACKEND summary rows). For any flagged unit: `sudo haproxy -c -f + /etc/haproxy/haproxy.cfg` (must say valid) then `sudo systemctl reload haproxy` (graceful + master-worker; reload, not restart). Phase-03 3.1 gates on a zero-DOWN sweep cloud-wide -- + it closes the juju-green-but-backend-DOWN hole. + +### nginx-reverse-proxy -- jumphost -> internal-VIP proxy gotchas (phase-03) +- Context: the jumphost reaches internal-only dashboards/APIs via an nginx reverse proxy + (phase-03 3.3). Four traps, each with the as-built fix: +- reload race: a `systemctl reload nginx` right after editing the vhost can be served by a + still-draining old worker (a curl ~2s later hits stale behavior; the co-hosted MAAS proxy + blips too). `nginx -t` FIRST; prefer `restart` for a definitive cutover when the listen/upstream + set changed, reload only for content-equivalent edits. +- proxy_ssl_name / SNI: the upstream presents a DNS-SAN cert (a juju-internal name, e.g. + `juju-ffe3b8-2-lxd-2`); set `proxy_ssl_name` to that SAN, `proxy_ssl_verify on`, and the vault + CA in `proxy_ssl_trusted_certificate`, or verification fails on the IP-only connect. +- sed no-op: a `sed -i` that does not match silently changes nothing and the proxy keeps the old + behavior -- assert the post-edit content, do not trust sed's exit code. +- scheme-mismatch redirect loop: the backend issues `https://` Location headers while the proxy + listens `http`; without `proxy_redirect https:// http://` (or a matching listen scheme) the + browser loops. Match the scheme end-to-end or rewrite the redirect. + +================================================================================ ## Octavia enablement (phase-05) ================================================================================ @@ -357,6 +420,36 @@ The amphora pipeline gate asserts the two are equal before building (phase-05 5.2). ================================================================================ +## Image seeding (phase-05/06/08) +================================================================================ + +### FINDING-3 -- azimuth CDN 403s glance web-download; stage-and-verify is canonical (phase-06, phase-08) +- Symptom: a glance web-download import (`--import-method web-download`) 202-accepts, then the + image hangs in `queued` forever and never reaches `active`. +- Cause: glance's web-download plugin fetches with urllib (User-Agent `Python-urllib/3.x`); the + azimuth-images CDN (`azimuth-images.stackhpc.cloud`) returns HTTP 403 to that UA. A curl/HEAD + probe with a different UA passes -- which is why an earlier probe false-passed while the real + import failed. +- Fix (canonical): STAGE-AND-VERIFY. curl the qcow2 to `$HOME` (snap-readable, NOT /tmp -- L7; + curl's UA is not blocked), verify the checksum against the published manifest (azimuth-images + manifest.json -- sha512 for kube images; the ubuntu cloud-images SHA256SUMS for noble), then + `openstack image create --file --import` (the openstack snap's `--import` == glance-direct; + image-conversion lands it `raw`). CORRECTION-1: a plain `--file` PUT (no `--import`) stores + qcow2 -- fine for boot, but `--import` gives the raw Ceph fast-clone alignment. +- Clear a stuck record before retry: gated `openstack image delete ` on the `queued` remnant + (verify the EXACT id first -- FINDING-4 name-guard discipline). +- Roosevelt: unify ALL image seeding (amphora base, noble mgmt, kube) on stage-and-verify for one + provenance-verified path cloud-wide. + +### web-download -- tested ALTERNATIVE to stage-and-verify (phase-05/06/08) +- Web-download (`openstack image create --import --import-method web-download --uri `) is + retained as a tested ALTERNATIVE, not the canonical path (superseded 2026-06-17; see + design-decisions). Caveats: (1) it cannot checksum-verify the fetched file against a published + digest (the CDN redirect strips it) -- weaker provenance; (2) it 403s on the azimuth CDN + (FINDING-3), so it is unusable for kube images; (3) for ubuntu cloud-images it works on the + hardened bundle (the 2026-06-08 403 was transient/pre-hardening). Use only as an expedient. + +================================================================================ ## Notes ================================================================================ - This index covers phases 00-08. It grows the same way for any future phase: keyed by diff --git a/runbooks/appendix-B-asbuilt-version-lock.md b/runbooks/appendix-B-asbuilt-version-lock.md index add0350..a434fad 100644 --- a/runbooks/appendix-B-asbuilt-version-lock.md +++ b/runbooks/appendix-B-asbuilt-version-lock.md @@ -1,7 +1,8 @@ # Appendix B -- As-Built Version / Channel / Revision Lock Source: `juju export-bundle` (model `openstack`) + the in-cloud mgmt-cluster -captures, 2026-06-09. ASCII-only. +captures, 2026-06-09; B.2/B.3 workload-image, template, driver, and helm facts refreshed +in the 2026-06-17 sweep (D1 + DOCFIX-032 + DOCFIX-035). ASCII-only. POLICY (D-002 + consolidation prompt): the bundle PINS CHANNELS, not revisions. This appendix records the as-built REVISIONS as the known-good baseline. A fresh @@ -80,7 +81,7 @@ ## B.2 In-cloud management cluster + CAPI constellation (D-034 / D-035 / D-037) -Node `capi-mgmt-v2` (FIP 10.12.7.40, internal 10.20.0.45), single-node, non-CAPI-managed: +Node `capi-mgmt-v2` (FIP + internal IP are per-rebuild -- this rebuild FIP 10.12.5.103 / internal 10.20.0.107; 2026-06-09: 10.12.7.40 / 10.20.0.45), single-node, non-CAPI-managed: - k8s-snap: channel `1.32-classic/stable`, rev 5326, k8s v1.32.13 (classic confinement) - CAPI core + kubeadm-bootstrap + kubeadm-control-plane: v1.13.2 - CAPO (infra provider): v0.14.4 @@ -89,7 +90,10 @@ - CAAPH (cluster-api-addon-provider): chart 0.12.0 (`helm --version`, from dependencies.json; deploys image 62f7c00) - cluster-api-janitor-openstack: chart 0.11.0 (`helm --version`, from dependencies.json; deploys image d527847) - cluster-autoscaler (per-workload): v1.30.4 -- Mgmt CNI: Cilium 1.17.12-ck0. Workload-cluster CNI: Calico (chart default). +- Mgmt CNI: Cilium 1.17.12-ck0. Workload-cluster CNI: Calico (DOCFIX-032: pinned explicitly, not relied-on default). +- helm: v3.17.3 -- mgmt-VM tooling (phase-06 6.6a) AND the magnum conductor (phase-07 7.4), + installed to /usr/local/bin + a /usr/bin/helm symlink so the conductor's restricted init PATH + resolves it (DOCFIX-035). VERSION-SOURCE RULE (D-034): every provider ref above is read live from the chosen `capi-helm-charts` release tag's `dependencies.json` via `jq`. DO NOT hardcode @@ -97,8 +101,8 @@ ## B.3 Magnum driver + chart (Layer B -- outside Juju channels, manually pinned) -- magnum-capi-helm driver: 1.3.0 was the AS-FIRST-BUILT pin; the v1 TARGET is the - RELEASED `magnum-capi-helm==1.4.0` (D-042). 1.3.0 is contract-INCOHERENT with the +- magnum-capi-helm driver: 1.3.0 was the AS-FIRST-BUILT pin; the v1 AS-BUILT pin is the + RELEASED `magnum-capi-helm==1.4.0` (D-042; installed, health HEALTHY). 1.3.0 is contract-INCOHERENT with the Layer-A core -- it reads `apiVersion` off the infrastructureRef, which CAPI v1.13 (v1beta2 contract) no longer carries, so the driver's `infrastructure` health GET returns "not found" (cosmetic only -- the create path is unaffected; the chart @@ -120,10 +124,15 @@ - chart repo: https://azimuth-cloud.github.io/capi-helm-charts - chart name: openstack-cluster ; default_helm_chart_version: 0.25.1 - conf.d drop-in: /etc/magnum/magnum.conf.d/00-capi-helm.conf (D-037) -- note (CNI): the `capi-k8s-v1-32` template OMITS the Magnum `network_driver` field, so - the workload cluster gets the chart-default Calico (the as-built CNI). Whether 1.4.0 - honors `network_driver` is unverified and not relied on -- omitting the field is what - guarantees Calico (appendix-A: CNI-label; phase-08). +- workload kube image (D1): ubuntu-jammy-kube-v1.34.8 (azimuth-images 0.28.0, build 260518-1604); + kube_version v1.34.8, os_distro ubuntu; sha512 7efde4857c9f9da045a98d71def30e229b3d7fffd8a5680e8aee0c5a8b13ba73fca3cf758a927230a1fbe3c451d8d21cfaeded96091e2a4f313c6a404760bdb3 + (manifest.json). Seeded by STAGE-AND-VERIFY from the azimuth CDN (FINDING-3 -- glance + web-download 403s the urllib UA). Bumped from EOL v1.32.13 (within CAPI v1.13.2 support). +- workload template (D1): capi-k8s-v1-34 (was capi-k8s-v1-32), --network-driver calico pinned (DOCFIX-032). +- note (CNI, DOCFIX-032): the `capi-k8s-v1-34` template PINS `--network-driver calico` + explicitly. Under driver 1.4.0 `network_driver` IS honored (maps to the chart); chart 0.25.1 + ships only Calico (flannel not packaged), so the explicit pin documents intent and does not + rely on the default staying Calico (appendix-A: CNI-label / DOCFIX-032; phase-08). - v1 END STATE: 1.4.0 installed and `health_status = HEALTHY` (D-011). 1.3.0 is only a TEMPORARY rollback/holding state (phase-07 Rollback), never a v1 completion. Either way, do NOT wire magnum auto-heal to health_status (CAPI MachineHealthCheck handles diff --git a/runbooks/ops-capi-recovery.md b/runbooks/ops-capi-recovery.md index 01e6886..06eb862 100644 --- a/runbooks/ops-capi-recovery.md +++ b/runbooks/ops-capi-recovery.md @@ -11,8 +11,12 @@ Magnum health). Everything upstream stays red until the layer below is green. Scope-hygiene preambles are the canonical ones from the 2026-06-09 as-executed -log. ENV literals: project capi-mgmt 674171fd28d446d3a37073b6a761e910; mgmt FIP -10.12.7.40; kube-api LB 0f968008-...; regenerate per site on rebuild. +log. ENV values are PER-REBUILD and resolved at run time in the blocks below: +project capi-mgmt id via `openstack project show capi-mgmt --domain capi`; mgmt FIP +via `~/capi-mgmt-net.env` as `$MGMT_FIP` (the phase-06 single source); the +magnum- driver namespace via `kubectl get ns`; the kube-api LB by id at failover. +Never reuse a prior rebuild's literals (2026-06-09 example, do NOT paste: project +674171fd..., FIP 10.12.7.40, LB 0f968008-...). --- @@ -38,7 +42,8 @@ # capi-mgmt scope source ~/admin-openrc unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME OS_PROJECT_DOMAIN_ID -export OS_PROJECT_ID=674171fd28d446d3a37073b6a761e910 +CAPI_PID=$(openstack project show capi-mgmt --domain capi -f value -c id) # per-rebuild; resolve, never hardcode +export OS_PROJECT_ID="$CAPI_PID" unset OS_PROJECT_NAME OS_PROJECT_DOMAIN_NAME OS_TENANT_NAME OS_TENANT_ID openstack server stop capi-mgmt-v2 # NOTE: Nova ACPI stop does NOT produce a clean guest shutdown on this VM @@ -61,7 +66,8 @@ ( { source ~/admin-openrc unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME OS_PROJECT_DOMAIN_ID - export OS_PROJECT_ID=674171fd28d446d3a37073b6a761e910 + CAPI_PID=$(openstack project show capi-mgmt --domain capi -f value -c id) # per-rebuild; resolve, never hardcode + export OS_PROJECT_ID="$CAPI_PID" unset OS_PROJECT_NAME OS_PROJECT_DOMAIN_NAME OS_TENANT_NAME OS_TENANT_ID openstack server start capi-mgmt-v2 for i in $(seq 1 20); do @@ -70,9 +76,10 @@ [ "$ST" = ACTIVE ] && break sleep 10 done + source ~/capi-mgmt-net.env # MGMT_FIP (per-rebuild single source from phase-06; never hardcode) echo "=== TCP probe loop: FIP :22 (sshd lags ACTIVE by ~3 min) ===" for i in $(seq 1 18); do - timeout 5 bash -c 'exec 3<>/dev/tcp/10.12.7.40/22' 2>/dev/null \ + timeout 5 bash -c "exec 3<>/dev/tcp/$MGMT_FIP/22" 2>/dev/null \ && { echo "[$i] SSH-PORT-OK"; break; } || echo "[$i] not yet" sleep 10 done @@ -90,10 +97,11 @@ BEGIN runbook block: mgmt k8s readiness poll (cold-start aware) ------------------------------------------------------------------------ ( { + source ~/capi-mgmt-net.env # MGMT_FIP (per-rebuild single source from phase-06; never hardcode) for i in $(seq 1 15); do echo "--- [$i] $(date -u +%H:%M:%S) ---" ssh -i ~/.ssh/id_ed25519 -o BatchMode=yes -o StrictHostKeyChecking=no \ - -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10 ubuntu@10.12.7.40 \ + -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10 ubuntu@"$MGMT_FIP" \ 'uptime; sudo k8s status 2>&1 ~/capi-mgmt.kubeconfig +# -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10 ubuntu@"$MGMT_FIP" \ +# "sudo k8s config server=https://$MGMT_FIP:6443 ~/capi-mgmt.kubeconfig ( { source ~/admin-openrc unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME OS_PROJECT_DOMAIN_ID - export OS_PROJECT_ID=674171fd28d446d3a37073b6a761e910 + CAPI_PID=$(openstack project show capi-mgmt --domain capi -f value -c id) # per-rebuild; resolve, never hardcode + export OS_PROJECT_ID="$CAPI_PID" unset OS_PROJECT_NAME OS_PROJECT_DOMAIN_NAME OS_TENANT_NAME OS_TENANT_ID openstack loadbalancer list -f yaml } ) @@ -222,11 +231,12 @@ unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME openstack loadbalancer amphora list -f yaml # all ALLOCATED export KUBECONFIG="$HOME/capi-mgmt.kubeconfig" - NS=magnum-674171fd28d446d3a37073b6a761e910 + NS=$(kubectl get ns -o name | cut -d/ -f2 | grep "^magnum-" | head -1) # capi-mgmt driver ns; resolve, never hardcode kubectl -n "$NS" get cluster,openstackcluster # Available=True (allow ~10 min post-failover for CAPO resync) source ~/admin-openrc unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME OS_PROJECT_DOMAIN_ID - export OS_PROJECT_ID=674171fd28d446d3a37073b6a761e910 + CAPI_PID=$(openstack project show capi-mgmt --domain capi -f value -c id) # per-rebuild; resolve, never hardcode + export OS_PROJECT_ID="$CAPI_PID" unset OS_PROJECT_NAME OS_PROJECT_DOMAIN_NAME OS_TENANT_NAME OS_TENANT_ID openstack coe cluster show capi-test-1 -f value -c health_status openstack coe cluster show capi-test-1 -f value -c health_status_reason diff --git a/runbooks/phase-00-teardown-maas-reset.md b/runbooks/phase-00-teardown-maas-reset.md index 94bf9ba..433376e 100644 --- a/runbooks/phase-00-teardown-maas-reset.md +++ b/runbooks/phase-00-teardown-maas-reset.md @@ -13,8 +13,9 @@ libvirt/qemu-img), KI-P3-001. !!! DESTRUCTIVE. Phase 1 (destroy-model + release) and Phase 2 (OSD wipe) are - irreversible short of the KVM snapshots (the D-017 safety net). Each destructive - step is DISCRETE and individually gated -- do not batch. + irreversible. There is NO model-state rollback (DEVIATION-1): a KVM snapshot revert + cannot restore the destroyed Juju model -- the repo runbooks ARE the tested restore + path (D-017). Each destructive step is DISCRETE and individually gated -- do not batch. CAPI-MGMT NOTE: this teardown releases the FOUR openstack hosts only. The MAAS `capi-mgmt` VM is the RETIRED D-033 out-of-cloud node; the in-cloud `capi-mgmt-v2` @@ -25,8 +26,11 @@ --- ## Prerequisites -- KVM snapshots of openstack0-3 exist (safety net). Authenticated juju session - (`juju whoami`). MAAS CLI logged in as profile `admin`. +- (OPTIONAL) KVM snapshots of openstack0-3. NOTE (DEVIATION-1): snapshots do NOT give + model-state rollback -- destroy-model erases the Juju controller DB, so a disk revert + resurrects machines with no managing model + a stale MAAS view. The repo runbooks are + the restore path (D-017); snapshots are not required for this cycle. +- Authenticated juju session (`juju whoami`). MAAS CLI logged in as profile `admin`. - Run from jumphost `vopenstack-jesse` (user `jessea123`, sudo; also the libvirt hypervisor). ## Constants and env-literals @@ -47,7 +51,10 @@ ```bash ( { echo "=== 0a. five network spaces (hard blocker if absent) ===" - juju spaces # expect metal 10.12.8.0/22 | provider 10.12.4.0/22 | data 10.12.12.0/22 | storage 10.12.16.0/22 | replication 10.12.20.0/22 + # DOCFIX-026: MAAS is authoritative for spaces (Juju imports them at add-model); use the + # model-independent query (same as Phase 5). Expect: metal 10.12.8.0/22 | provider 10.12.4.0/22 + # | data 10.12.12.0/22 | storage 10.12.16.0/22 | replication 10.12.20.0/22 (lbaas + undefined also appear). + maas admin spaces read | jq -r '.[] | "\(.name)\t\([.subnets[]?.cidr] | join(", "))"' echo "=== 0b. VIP ipranges (note the front-loaded ones to KEEP + the stale .224-.254 to remove) ===" maas admin ipranges read \ @@ -69,7 +76,8 @@ printf '%-46s state=%s owner=%s mode=%s\n' "$f" \ "$(sudo virsh -c qemu:///system domstate "$host" 2>/dev/null)" \ "$(sudo stat -c '%U:%G' "$f" 2>/dev/null)" "$(sudo stat -c '%a' "$f" 2>/dev/null)" -done # expect (AFTER Phase 1 release): 4 lines, state=shut off, owner=root:root, mode=600 +done # expect (AFTER Phase 1 release): 4 lines, state=shut off, owner=root:root, mode=600. + # (Run PRE-teardown as a baseline: state=running, owner=libvirt-qemu:kvm -- correct live state.) ``` ## Phase 1 -- Teardown (D-018) DISCRETE / DESTRUCTIVE @@ -206,7 +214,7 @@ `# RUN: jumphost` ```bash ( { - juju spaces # 5 spaces present + maas admin spaces read | jq -r '.[] | "\(.name)\t\([.subnets[]?.cidr] | join(", "))"' # DOCFIX-026: 5 spaces (juju spaces FAILS here -- model gone post-teardown) maas admin machines read | jq -r '.[]|select(.hostname|test("^openstack[0-3]$"))|"\(.hostname)\t\(.status_name)"' | sort # all Ready for SID in 4na83t qdbqd6 h8frng tmsafc; do echo "-- $SID --" maas admin interfaces read "$SID" | jq -r '.[]|select(.name|test("^enp(8|9|10)s0$"))|" \(.name)\t\([.links[]?|{(.subnet.cidr):.ip_address}])"' @@ -238,6 +246,17 @@ openstack0 resolved dynamically (the block does not depend on these). - MAAS carve: front-loaded .2-.63 reservations created earlier and persistent; stale metal .224-.254 was iprange id=2 (deleted after confirmation). +- DEVIATION-2 (2026-06-11): hypervisor 196 GB; openstack0-3 each 16384 -> 32768 MiB + (virsh setmaxmem/setmem --config while shut off, post-OSD-wipe), then MAAS recommission + with `skip_networking=1 skip_storage=1 testing_scripts=none` -- refreshes hardware + inventory WITHOUT losing interface links/storage layout (all 12 storage links preserved; + 4x Ready at 32768 in ~3 min). D-040 reserved-host-memory 8192 retained (correctness floor, + not a function of total RAM). Per-host footprint for Roosevelt rebalancing is measured at + the 32 GiB envelope (16 GiB-era pressure numbers do not map 1:1). [recommission pattern -> appendix-A] +- DEVIATION-3 (2026-06-11): the destroy-model released Juju machine 4 (the retired D-033 + out-of-cloud capi-mgmt MAAS node) as a side effect; MAAS shows capi-mgmt = Ready (landed + Ready, not re-released by the Phase 1C loop, which targeted only the four system_ids). + The separate "Phase 7 teardown of old MAAS capi-mgmt node" queue item is thereby closed. ## Next phase-01 -- bundle deploy. diff --git a/runbooks/phase-01-bundle-deploy.md b/runbooks/phase-01-bundle-deploy.md index a6a28b9..8b745ba 100644 --- a/runbooks/phase-01-bundle-deploy.md +++ b/runbooks/phase-01-bundle-deploy.md @@ -74,11 +74,13 @@ } ) ``` ```bash -# CHECK 4b: OSD /dev/vdb blank (run on each host; sudo required -- appendix-A: R7) +# CHECK 4b: OSD /dev/vdb blank (DOCFIX-027 -- LOCAL libvirt-host loop, NOT ssh: the four +# hosts are Released/powered-off entering phase-01, and /var/lib/libvirt/images is a +# hypervisor (jumphost) path that does not exist on the hosts. RUN: jumphost (libvirt host; sudo). for h in openstack0 openstack1 openstack2 openstack3; do echo "== $h ==" - ssh jessea123@$h "sudo qemu-img info /var/lib/libvirt/images/${h}-1.qcow2 | grep -E 'virtual size|disk size'" /dev/null): @@ -144,6 +154,12 @@ * Waiting on vault certs (expected pre-init): ovn-central x3, ovn-chassis x3 (incl nova-compute subordinates), ovn-chassis-octavia, neutron-api-plugin-ovn, barbican-vault. * octavia BLOCKED "Awaiting configure-resources" (D-021); gss unknown (pre-run). + * magnum/0 BLOCKED "Ports which should be open, but are not: 9501" -- pre-vault posture: + magnum-api is loopback-bound ([api] host not yet templated) and haproxy backends target + unit IPs. EXPECTED phase-01 end-state; self-resolves at the phase-02 cert rollout (apache2 + takes *:9501). Confirmed self-resolving 2026-06-12 (FINDING-2); verify in the phase-02 post-init sweep. + * keystone/0 "PO (broken): Unit is ready" -- expected while use-policyd-override=true with + no policy zip attached (FINDING-1); keystone runs the DEFAULT policy. No mutation this arc. - Section-G NIC payoff confirmed (no subset/binding errors): ceph-mon -> storage 10.12.16.x; octavia -> data 10.12.12.1; nova-compute -> data 10.12.12.4x; vault -> metal 10.12.8.x. - Proceed to phase-02 (vault init). @@ -154,6 +170,24 @@ - Pre-deploy verify: VIPs 11/11/0; enp8s0 -> 10.12.12.40-43 (all 4); subnet DNS as above; nodes Ready; OSD blank. - Settled: zero errors; mysql /0 R/W (10.12.8.173), /1 (.179) /2 (.185) R/O; vault blocked needs-init. +## Balance / stability observations (Roosevelt rebalancing inputs -- post-deploy item 6) +- Quorum triads (mysql-innodb-cluster, ovn-central, ceph-mon) all on machines 0/1/2: correct + anti-affinity; machine 3 loss breaks no quorum; any single loss of 0/1/2 leaves a 2-of-3 majority. +- Machine 0 = no-compute control host, largest container count: prefigures the Roosevelt role split. +- FLAG: machine 3 concentrates six singletons (vault, glance, nova-cloud-controller, octavia, + placement, barbican) + compute + OSD. Acceptable on testcloud; Roosevelt answer is role split + HA, + informed by measured footprints at the 32 GiB envelope (DEVIATION-2 caveat). +- rabbitmq-server single unit: messaging SPOF, as designed for v1. + +## PATTERN-1 (standing convention) -- dynamic lookup vs. pinned identifiers +READ/VERIFY ops discover values at runtime (never hardcode what resolves: hostname->system_id via +`maas admin machines read | jq`; subnet id by CIDR). DESTRUCTIVE/IRREVERSIBLE ops discover +dynamically, ASSERT against a pinned EXPECTED set, ABORT on mismatch, then operate on the pinned +values (a filter bug or an unexpected new machine must not become collateral damage). Retrofit +candidates (apply as fixture-tested gated blocks, NOT bulk edits): phase-00 release/host loops -> +discover-assert-pin; subnet ids -> resolve by CIDR; octet maps -> derive from hostname index. +Canonical statement in runbooks/README.md. + ## Next phase-02 -- vault bring-up. diff --git a/runbooks/phase-02-vault-bringup.md b/runbooks/phase-02-vault-bringup.md index fc03cf4..74582e4 100644 --- a/runbooks/phase-02-vault-bringup.md +++ b/runbooks/phase-02-vault-bringup.md @@ -42,7 +42,12 @@ init with the `2>&1 | tee` capture (NOT `>`). Save `~/vault-init/init.txt` off-host the moment the gate passes. ```bash +# RUN: jumphost -- open the interactive session ONLY (paste this line alone; DOCFIX-029) juju ssh -m openstack vault/0 +``` +WAIT for the remote prompt (`ubuntu@juju-...`) before pasting the next block -- a combined +paste buffers the in-session lines and feeds them to the session on connect. +```bash # --- inside the vault/0 session: --- export VAULT_ADDR=http://127.0.0.1:8200 ; umask 077 ; mkdir -p ~/vault-init vault status 2>&1 | grep -E 'Initialized|Sealed|Storage Type|HA Enabled' || true # pre-check: Initialized false (fresh) @@ -86,18 +91,29 @@ juju actions vault --schema --format yaml -m openstack | sed -n '/authorize-charm:/,/^[a-z]/p' ``` ```bash -# RUN: on vault/0 -- mint a short-lived child token (root entered hidden, never on argv/history) +# RUN: jumphost -- open the interactive session ONLY (paste this line alone; DOCFIX-029) juju ssh -m openstack vault/0 -# --- inside the session: --- +``` +WAIT for the remote prompt (`ubuntu@juju-...`). This in-session block contains a hidden +`read -s` -- a combined paste would let read swallow the next buffered line as the secret. +NO trailing `exit`: exit MANUALLY after copying the child token (a paste-ahead `exit` could +self-terminate the session and mask the swallow). +```bash +# --- inside the session: mint a short-lived child token (root entered hidden, never on argv/history) --- export VAULT_ADDR=http://127.0.0.1:8200 read -s -p "root token: " VAULT_TOKEN; echo ; export VAULT_TOKEN vault token create -ttl=10m -field=token # prints ONLY the child token -- copy it unset VAULT_TOKEN -exit +# (exit manually after you have copied the child token) ``` ```bash # RUN: jumphost -- authorize + root CA + status (each juju run blocks to completion) -juju run vault/leader authorize-charm token= -m openstack +# ENHANCEMENT-2: enter the child token via hidden read (keeps it out of jumphost shell +# history). The token still transits the Juju operation log (inherent to the action; +# mitigated by the 10m TTL) -- this narrows exposure, it does not eliminate it. +read -s -p "child token: " TOK; echo +juju run vault/leader authorize-charm token="$TOK" -m openstack +unset TOK juju run vault/leader generate-root-ca -m openstack juju status vault -m openstack ``` @@ -114,6 +130,14 @@ - The narrow cert cascade to the Vault consumers (ovn-central x3, ovn-chassis x3, ovn-chassis-octavia, neutron-api-plugin-ovn, barbican-vault) now proceeds -- it is watched and accepted in phase-03. +- POST-INIT SWEEP (FINDING-2 / DOCFIX-028 cross-check) -- after the cert cascade settles: + * magnum/0 -> active "Unit is ready"; magnum-api is now served by apache2 on *:9501 (all + interfaces; haproxy backends reachable; [api] port moved to the wsgi backend). The + phase-01 pre-vault 9501 BLOCK was the expected loopback-bound posture and self-resolves + here at the TLS cutover (confirmed 2026-06-12). If it is STILL loopback-bound after certs + settle, escalate to charm diagnosis BEFORE phase-03 (then the phase-01 line is a defect). + * keystone/0 PO state UNCHANGED ("PO (broken): Unit is ready") -- still default policy + (FINDING-1: use-policyd-override=true with no zip). Not a regression; no mutation. ## As-built reference (2026-06-03 run -- audit trail) - init: 5 shares / threshold 3, "Vault initialized with 5 key shares and a key diff --git a/runbooks/phase-03-core-verify.md b/runbooks/phase-03-core-verify.md index 38d2ddb..1fdbead 100644 --- a/runbooks/phase-03-core-verify.md +++ b/runbooks/phase-03-core-verify.md @@ -5,9 +5,12 @@ API reachability, and repoint the external Horizon reverse proxy. Decisions: B5 (IP-only endpoints; no FQDN), D-021 (octavia stays BLOCKED awaiting -configure-resources -- expected, cleared in phase-05). Troubleshooting: appendix-A -- -DOCFIX-021 (action human-output corrupts captured artifacts), DOCFIX-018 (IP-only -OS_AUTH_URL), DOCFIX-022 (admin project discovered, not hardcoded). +configure-resources -- expected, cleared in phase-05), D-044 (Horizon Secure-cookie +override on the plain-HTTP proxy leg; Step 3.3, PER-REBUILD), D-045 / DOCFIX-031 (haproxy +backends confirmed LOADED via a functional sweep, NOT juju status; Step 3.1). Troubleshooting: +appendix-A -- DOCFIX-021 (action human-output corrupts captured artifacts), DOCFIX-018 (IP-only +OS_AUTH_URL), DOCFIX-022 (admin project discovered, not hardcoded), D-045/DOCFIX-031 (haproxy +plaintext-check-vs-SSL backend DOWN), nginx reverse-proxy lessons. --- @@ -62,6 +65,31 @@ # juju ssh -m openstack -- 'sudo tail -120 /var/log/juju/unit-.log' plaintext checks vs the SSL backend). +Probe haproxy's own verdict on every unit: +```bash +( { + echo "=== POST-TLS GATE: haproxy backend health sweep across all units ===" + for unit in $(juju status -m openstack --format=json | python3 -c 'import json,sys; d=json.load(sys.stdin); [print(u) for a in d.get("applications",{}).values() for u in (a.get("units") or {})]'); do + juju ssh -m openstack "$unit" -- "test -S /var/run/haproxy/admin.sock || exit 0; sudo python3 -c 'import socket;s=socket.socket(socket.AF_UNIX);s.connect(\"/var/run/haproxy/admin.sock\");s.sendall(b\"show stat\n\");print(s.makefile().read())' | grep -vE 'FRONTEND|BACKEND' | grep ',DOWN,'" /dev/null | sed "s|^|[$unit] DOWN: |" + done + echo "=== sweep complete -- no DOWN lines above means every haproxy backend is UP ===" +} ) +``` +GATE: zero `[unit] DOWN:` lines. On a DOWN line (check token L7STS/400 == plaintext-vs-SSL), +remediate the flagged unit (set U, then validate-and-reload): +```bash +U=nova-cloud-controller/0 +juju ssh -m openstack "$U" -- 'sudo haproxy -c -f /etc/haproxy/haproxy.cfg' 10.12.4.10:5240). Horizon vhost `/etc/nginx/sites-available/openstack` + (symlinked into sites-enabled), listen 81; corporate clients reach it via 10.17.11.246:81. + +As-executed change set (gate every edit -- `sed -i` exits 0 on zero matches, so grep-assert +the expected line after any mutation): +```bash +# RUN: jumphost -- ship the vault root CA to the proxy +scp ~/vault-init/vault-ca-root.pem jessea123@10.12.4.7:/tmp/ +``` +```bash +# RUN: operator ON 10.12.4.7 -- install CA, back up + edit the Horizon vhost, validate, restart. +sudo install -o root -g root -m 644 /tmp/vault-ca-root.pem /etc/nginx/vault-ca-root.pem && rm -f /tmp/vault-ca-root.pem +sudo cp -a /etc/nginx/sites-available/openstack "/etc/nginx/sites-available/openstack.bak-$(date -u +%Y%m%dT%H%M%SZ)" +# Set in the Horizon server block (then `grep` to confirm each landed): +# proxy_pass https://10.12.4.58:443; +# proxy_ssl_trusted_certificate /etc/nginx/vault-ca-root.pem; +# proxy_ssl_verify on; +# proxy_ssl_name juju-ffe3b8-2-lxd-2; # the dashboard cert's DNS SAN -- per site (discover: openssl s_client -connect 10.12.4.58:443 /dev/null | openssl x509 -noout -ext subjectAltName) +# proxy_redirect https://$http_host/ http://$http_host/; # unwind the scheme-mismatch redirect loop (Horizon emits absolute https:// on the client Host -> browser then speaks TLS to the :81 plaintext listener) +sudo nginx -t # GATE: configuration ok +sudo systemctl restart nginx # prefer restart over reload for a definitive cutover (a curl ~2s after `reload` can be served by a draining old worker; ~2s blip incl. the co-hosted MAAS proxy) +``` +GATE (on the proxy): `curl -sI http://127.0.0.1:81/horizon/` -> 302 to .../auth/login; no TLS errors in error.log. + +### DOCFIX-030 -- Horizon Secure-cookie override (D-044; PER-REBUILD) +The charm renders `CSRF_COOKIE_SECURE`/`SESSION_COOKIE_SECURE = True` (vault:certificates). +On the plain-HTTP client leg the browser drops the Secure csrftoken and login fails with +"CSRF cookie not set" -- so a clean follow of 3.3 otherwise stalls at the browser login. +Drop an ASCII-only post-load override on the dashboard unit, then graceful-reload apache2: +```bash +# RUN: jumphost -- D-044 cookie override on the dashboard unit (ASCII-only; PER-REBUILD) +juju ssh -m openstack openstack-dashboard/leader -- "printf 'CSRF_COOKIE_SECURE = False\nSESSION_COOKIE_SECURE = False\n' | sudo tee /usr/share/openstack-dashboard/openstack_dashboard/local/local_settings.d/_99_internal_http_cookies.py >/dev/null && sudo systemctl reload apache2" keystone VIP TLS verify rc 0; haproxy backend sweep zero DOWN cloud-wide. ## Next phase-04 -- network carve (external provider network). diff --git a/runbooks/phase-04-network-carve.md b/runbooks/phase-04-network-carve.md index 107da29..9bce883 100644 --- a/runbooks/phase-04-network-carve.md +++ b/runbooks/phase-04-network-carve.md @@ -114,11 +114,18 @@ - FIP allocation + tenant router gateways are now possible (needed by phase-06 mgmt VM FIP, phase-08 cluster FIPs + LB validation). -## As-built reference (2026-06-03 run -- audit trail) -- network provider-ext = 70b34bb2-3afb-4b43-96d3-f520dbcbf9a8 (external, flat, physnet1, shared=false, role=provider) -- subnet provider-ext-fip = e3afcbae-ec34-4125-9007-2bfa51851422 +## As-built reference (object IDs regenerate per deploy -- old IDs are dead post-teardown, not a discrepancy) +- network provider-ext = 0d00ddc1-d2bf-4849-a087-14c07d77f167 (06-03 snapshot: 70b34bb2-...) + (external, flat, physnet1, shared=false, role=provider) +- subnet provider-ext-fip = d27f196c-a2d9-4bb9-99f3-bcb8caea3165 (06-03 snapshot: e3afcbae-...) (cidr 10.12.4.0/22, gateway 10.12.4.1, enable_dhcp=false, alloc 10.12.5.0-10.12.7.254, tags role=provider + netbox-iprange=10.12.5.0-10.12.7.254) +- Live MAAS reservations the IPAM draft + D-003 do NOT yet list (the DRAFT is incomplete, not + the cloud -- draft <- live): 10.12.4.101-10.12.4.110 (subnet 1, provider) + + 10.12.8.101-10.12.8.110 (subnet 2, metal), both "mgmt-plane reserved" (10 IPs each). Both sit + OUTSIDE the FIP pool (10.12.5.0-10.12.7.254) and the VIP /26 blocks -> no conflict with + provider-ext-fip. FOLD into docs/netbox-vip-queue.md + D-003 in the docs sub-pass (purpose + annotation pending operator confirmation; do NOT mutate NetBox until IPAM design is confirmed -- D-010). - Transitional note: MAAS already carried the front-loaded VIP reservations (.2-.63 provider + .8.2-.63 metal; old D-020 .8.224-.254 gone) ahead of the bundle's interim .50-.60 VIPs -- harmless (a reserved range blocks future auto-assign, does not evict diff --git a/runbooks/phase-05-octavia-enablement.md b/runbooks/phase-05-octavia-enablement.md index d89fb27..61d0293 100644 --- a/runbooks/phase-05-octavia-enablement.md +++ b/runbooks/phase-05-octavia-enablement.md @@ -91,6 +91,16 @@ fresh -> download+checksum+upload+retrofit). For a FIRST live run in a new environment you may stop after the seed to eyeball before the multi-minute build. +SEED METHOD (canonical): stage-and-verify (download + sha256-vs-published-SHA256SUMS + +`openstack image create --file`) is CANONICAL here -- it carries provenance verification, +works for any source, and unifies with the phase-08 kube-image seed (FINDING-3). This +SUPERSEDES the 2026-06-16 "web-download canonical" ruling: web-download cannot checksum-verify +the fetched file and is infeasible for the azimuth CDN (urllib UA 403). Web-download is retained +as a TESTED ALTERNATIVE in appendix-A. Note the staged base lands QCOW2 (legacy `--file` does +NOT run glance's import conversion -- CORRECTION-1); that is fine, the retrofit consumes the +qcow2 base and emits the raw `octavia-amphora` OUTPUT (the config gate's image-format=raw is on +the retrofit OUTPUT, not the base). + ```bash # Tunables (operator-confirm the first two for your environment): BASE_IMG_URL="https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img" @@ -171,20 +181,26 @@ admin-scope failover) is D-011 criterion 4 -- run in phase-08 (needs tenant scaffolding + the external provider network from phase-04). -## As-built reference (2026-06-03 run -- audit trail) +## As-built reference (current rebuild 2026-06-16; per-deploy values regenerate -- old IDs are not discrepancies) - octavia/0: octavia 14.0.0, charm rev 441 2024.1/stable, on 3/lxd/3, data leg 10.12.12.1; multi-homed (reaches provider VIPs over eth1). -- configure-resources op 15 / task 16 completed (--wait=20m). Created lb-mgmt-net - (d1ee4bca-...), lb-mgmt-subnetv6 (1c1f50df-..., IPv6 geneve), lb-mgmt-sec-grp (acbacb21-...). - o-hm0 fc00:9c49:5b4e:cf23:f816:3eff:fead:56df/64, br-int port. -- amphora: retrofit is metal-only (10.12.8.172) -> internal glance VIP 10.12.8.53. - base jammy-amphora-base uploaded (f8b48cdb-...); retrofit op 19/task 20 built - amphora-haproxy-x86_64-ubuntu-22.04-20260603 (4e4a94ac-...), ACTIVE, tag octavia-amphora - (matches octavia amp-image-tag). image-format raw. +- configure-resources op 9 / task 10 completed (--wait=20m; 06-03 snapshot: op 15/task 16). + Created lb-mgmt-net / lb-mgmt-subnetv6 (IPv6 geneve) / lb-mgmt-sec-grp; o-hm0 UP, IPv6-ULA + fc00:3f8c:7162:d105:f816:3eff:feea:7e45/64 (06-03: fc00:9c49:...:56df; the ULA regenerates per deploy). +- amphora: retrofit is metal-only -> internal glance VIP 10.12.8.53. base jammy-amphora-base + = da757cb1-... (untagged; 06-03: f8b48cdb-...); retrofit op 13/task 14 (06-03: op 19/task 20) + built amphora-haproxy-x86_64-ubuntu-22.04-20260616 = ca5552a5-... ACTIVE, tag octavia-amphora + (matches octavia amp-image-tag), image-format raw, ~6.2 GB, owned by the services project + (06-03 OUTPUT: 4e4a94ac-...). +- mgmt VM image pre-staged for phase-06: ubuntu-24.04-noble = 899b4b5c-... (public, os props). +- SEED METHOD this rebuild vs canonical: the base + noble were seeded via WEB-DOWNLOAD this + rebuild (the 06-16 expedient). Canonical going forward is STAGE-AND-VERIFY (Step 5.2 header); + web-download is a tested alternative (appendix-A). The web-downloaded base landed raw (import + conversion ran); a staged --file base lands qcow2 (CORRECTION-1) and is equally fine for the retrofit. - Charm gap (parked): glance-simplestreams-sync is metal-only and cannot reach glance on a no-DNS deploy (use-internal-endpoints steers keystone auth but not the - glance/swift client) -> gss does NOT seed the base. The base is seeded manually - (above) and the amphora BUILD stays charm-native via the retrofit over internal + glance/swift client) -> gss does NOT seed the base. The base is seeded per Step 5.2 + and the amphora BUILD stays charm-native via the retrofit over internal endpoints. Roosevelt root-fix: cloud DNS + FQDN-valid certs (also fixes gss). ## Next diff --git a/runbooks/phase-06-incloud-mgmt-cluster.md b/runbooks/phase-06-incloud-mgmt-cluster.md index da84bd5..6c5b87c 100644 --- a/runbooks/phase-06-incloud-mgmt-cluster.md +++ b/runbooks/phase-06-incloud-mgmt-cluster.md @@ -30,13 +30,13 @@ ## Constants and env-literals (TAG: regenerate/confirm per site on rebuild) Literals below are tagged `ENV(...)` so the later generalization pass is mechanical. Discover everything else dynamically at run time. -- `ENV(project)` capi-mgmt (id 674171fd28d446d3a37073b6a761e910) -- `ENV(ext-net)` provider-ext (id 70b34bb2-3afb-4b43-96d3-f520dbcbf9a8) -- `ENV(image)` ubuntu-24.04-noble (id c66342ce-f402-4e6e-a324-ae27032396d7) +- `ENV(project)` capi-mgmt (resolve by name; this rebuild id d5bc125c7c1841d389b76cd0a7b0a915, domain capi) +- `ENV(ext-net)` provider-ext (resolve by name; this rebuild id 0d00ddc1-d2bf-4849-a087-14c07d77f167) +- `ENV(image)` ubuntu-24.04-noble (resolve by name; this rebuild id 899b4b5c-d8f6-4df4-860b-a9210d0eefe8) - `ENV(flavor)` gp.large (16384 MB / 4 vCPU / 80 GB) - `ENV(mgmt-cidr)` 10.20.0.0/24 (capi-mgmt-subnet; overlay, non-IPAM) - `ENV(keystone-vip)` 10.12.4.50:5000 (the gate target -- the deployed VIP) -- `ENV(mgmt-fip)` 10.12.7.40 (assigned in 6.2; apiserver SAN) +- `ENV(mgmt-fip)` assigned in 6.2 (apiserver SAN; resolve dynamically. This rebuild capi-mgmt-v2 = 10.12.5.103, tenant 10.20.0.107; the old 10.12.7.40 / 10.20.0.45 was the pre-teardown mgmt VM -- DOCFIX-038) - `ENV(pod-cidr)` 10.1.0.0/16 `ENV(svc-cidr)` 10.152.183.0/24 (snap defaults; non-colliding) - `ENV(capi-tag)` 0.25.1 (capi-helm-charts release; dependencies.json source) @@ -44,7 +44,8 @@ - `# RUN: jumphost` -- on vopenstack-jesse as jessea123, admin-openrc sourced. - `# RUN: mgmt VM` -- shipped to the VM over SSH via the FIP (heredoc below). - VM SSH form (used verbatim throughout; DOCFIX-021 `/dev/null \ && echo "[OK] project capi-mgmt (domain $PROJ_DOMAIN)"; } - echo "=== role: $OS_USERNAME gets MEMBER on capi-mgmt (as-built grant; OS_PROJECT_ID blocks in 6.x/7.8/8.x) ===" - openstack role assignment list --user "$OS_USERNAME" --user-domain "$OS_USER_DOMAIN_NAME" \ - --project capi-mgmt --project-domain "$PROJ_DOMAIN" -f value 2>/dev/null | grep -q . \ - && echo "[SKIP] role assignment present" \ - || { openstack role add --user "$OS_USERNAME" --user-domain "$OS_USER_DOMAIN_NAME" \ - --project capi-mgmt --project-domain "$PROJ_DOMAIN" member \ - && echo "[OK] member role on capi-mgmt"; } + echo "=== roles: $OS_USERNAME gets member + load-balancer_member + reader on capi-mgmt (DOCFIX-036 / D-039) ===" + # D-039 ROOT CAUSE: magnum mints the per-cluster app-cred carrying the TRUSTOR's roles, + # FROZEN at mint, and delegates ALL trustor roles unfiltered. If admin@admin_domain holds + # only `member` here, CAPO's app-cred 403s on Octavia (needs load-balancer_member) and the + # workload cluster wedges at API-LB provisioning. Grant all three so future mints carry LB + # authority. (load-balancer_member + reader are keystone/Octavia default roles.) + for ROLE in member load-balancer_member reader; do + if openstack role assignment list --user "$OS_USERNAME" --user-domain "$OS_USER_DOMAIN_NAME" \ + --project capi-mgmt --project-domain "$PROJ_DOMAIN" --role "$ROLE" -f value 2>/dev/null | grep -q .; then + echo "[SKIP] $ROLE already on capi-mgmt" + else + openstack role add --user "$OS_USERNAME" --user-domain "$OS_USER_DOMAIN_NAME" \ + --project capi-mgmt --project-domain "$PROJ_DOMAIN" "$ROLE" \ + && echo "[OK] $ROLE on capi-mgmt" + fi + done echo "=== flavors (as-built specs; public -- verified live 2026-06-10 pre-teardown) ===" for spec in "gp.large 4 16384 80" "gp.mid 2 8192 40" "capi.node 2 4096 40" \ @@ -114,19 +131,32 @@ && echo "[OK] $1 ($2 vcpu / $3 MB / $4 GB)"; } done - echo "=== mgmt VM image ubuntu-24.04-noble (verify-or-import; glance-direct; HOME-staged, L7) ===" + echo "=== mgmt VM image ubuntu-24.04-noble (verify-or-seed; STAGE-AND-VERIFY canonical; HOME-staged, L7) ===" if openstack image show ubuntu-24.04-noble >/dev/null 2>&1; then echo "[SKIP] image ubuntu-24.04-noble exists" else - SRC="$HOME/noble-server-cloudimg-amd64.img" - [ -f "$SRC" ] || { echo "ABORT: $SRC missing (re-fetch: cloud-images.ubuntu.com/noble/current/)"; exit 1; } - glance image-create-via-import \ - --import-method glance-direct \ - --file "$SRC" \ - --container-format bare --disk-format qcow2 \ - --visibility public \ - --property os_distro=ubuntu --property os_version=24.04 \ - --name ubuntu-24.04-noble + # Stage-and-verify (FINDING-3): download to $HOME (snap-readable; NOT /tmp -- L7) if missing/ + # checksum-stale, verify sha256 vs the published SHA256SUMS, then client-safe import via the + # openstack snap (--import == glance-direct; image-conversion lands it raw). NOT the standalone + # `glance` client (unconfirmed on this jumphost). + IMG_URL="https://cloud-images.ubuntu.com/noble/current/noble-server-cloudimg-amd64.img" + SUM_URL="https://cloud-images.ubuntu.com/noble/current/SHA256SUMS" + IMG_FILE="noble-server-cloudimg-amd64.img"; SRC="$HOME/$IMG_FILE" + EXP=$(curl -fsSL "$SUM_URL" | awk -v f="$IMG_FILE" '$2=="*"f || $2==f {print $1}') + [ -n "$EXP" ] || { echo "GATE FAIL: no published checksum for $IMG_FILE"; exit 1; } + if [ -f "$SRC" ] && [ "$(sha256sum "$SRC" | awk '{print $1}')" = "$EXP" ]; then + echo "[OK] staged noble present + checksum-valid; skipping download" + else + echo "[..] downloading noble to $SRC (snap-readable; NOT /tmp)" + wget -q -O "$SRC" "$IMG_URL" + GOT=$(sha256sum "$SRC" | awk '{print $1}') + [ "$EXP" = "$GOT" ] || { echo "GATE FAIL: checksum mismatch exp='$EXP' got='$GOT'"; exit 1; } + echo "[OK] checksum verified ($GOT)" + fi + openstack image create ubuntu-24.04-noble \ + --file "$SRC" --import \ + --container-format bare --disk-format qcow2 --public \ + --property os_distro=ubuntu --property os_version=24.04 fi # as-built (verified live 2026-06-10): visibility=public, os_distro=ubuntu, os_version=24.04, # stored raw in Ceph via the bundle's glance image-conversion=true. @@ -215,21 +245,29 @@ openstack server show capi-mgmt-v2 -f value -c status -c addresses echo "=== floating ip on provider-ext, associate to the VM ===" FIP=$(openstack floating ip create "$EXT" -f value -c floating_ip_address) - echo "allocated FIP=$FIP # expect this to be 10.12.7.40 on a clean run -- ENV(mgmt-fip)" openstack server add floating ip capi-mgmt-v2 "$FIP" + # tenant (fixed) IP = the server address that is NOT the FIP (single-NIC VM has exactly the two) + TENANT_IP=$(openstack server show capi-mgmt-v2 -f json \ + | FIP="$FIP" python3 -c "import os,json,sys; a=json.load(sys.stdin).get('addresses',{}) or {}; ips=[ip for net in a.values() for ip in net]; print(next((ip for ip in ips if ip!=os.environ['FIP']), ''))") + [ -n "$TENANT_IP" ] || { echo "ABORT: could not resolve tenant IP"; exit 1; } + # PERSIST both (single source for 6.3-6.6 -- PATTERN-1; the FIP is pool-allocated + the tenant + # IP DHCP-assigned, so NEITHER is deterministic per rebuild -- never hardcode them) + printf 'MGMT_FIP=%s\nMGMT_TENANT_IP=%s\n' "$FIP" "$TENANT_IP" | tee ~/capi-mgmt-net.env openstack server show capi-mgmt-v2 -f value -c addresses } ) ``` -Note: the tenant IP lands on `10.20.0.45` and the FIP on `10.12.7.40` on the -as-built run. If the FIP differs on rebuild, carry the new value into 6.4 -(`extra-sans`) and 6.5 (kubeconfig server) and phase-07 (conductor kubeconfig). +Note (DOCFIX-038): the FIP is pool-allocated and the tenant IP is DHCP-assigned -- NEITHER is +deterministic (this rebuild: FIP 10.12.5.103, tenant 10.20.0.107; the pre-teardown VM was +10.12.7.40 / 10.20.0.45). Step 6.2 persists both to `~/capi-mgmt-net.env`; 6.3-6.6a source it, +and phase-07 (conductor kubeconfig) uses the same FIP. Do not hardcode either value. ## Step 6.3 -- GATE 1: OS-level egress (before any k8s investment) `# RUN: mgmt VM` This is the premise of D-035. PROCEED ONLY IF VIP-OK. ```bash +source ~/capi-mgmt-net.env # MGMT_FIP, MGMT_TENANT_IP (written by 6.2) ssh -i ~/.ssh/id_ed25519 -o BatchMode=yes -o StrictHostKeyChecking=no \ - -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10 ubuntu@10.12.7.40 bash -s <<'REOF' + -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10 ubuntu@"$MGMT_FIP" bash -s <<'REOF' set -u echo "=== VM -> Keystone VIP 10.12.4.50:5000 ===" # ENV(keystone-vip) timeout 6 bash -c 'exec 3<>/dev/tcp/10.12.4.50/5000' && echo VIP-OK || echo VIP-FAIL @@ -250,15 +288,18 @@ from stdin). ```bash +source ~/capi-mgmt-net.env # MGMT_FIP, MGMT_TENANT_IP (written by 6.2) ssh -i ~/.ssh/id_ed25519 -o BatchMode=yes -o StrictHostKeyChecking=no \ - -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10 ubuntu@10.12.7.40 bash -s <<'REOF' + -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10 ubuntu@"$MGMT_FIP" \ + bash -s "$MGMT_FIP" "$MGMT_TENANT_IP" <<'REOF' set -euo pipefail +MGMT_FIP="$1"; MGMT_TENANT_IP="$2" # passed from the jumphost (extra-sans must be the real FIP + tenant IP) echo "=== install k8s snap 1.32-classic/stable ===" sudo snap install k8s --classic --channel=1.32-classic/stable /dev/null <<'CFG' +sudo tee /root/bootstrap-config.yaml >/dev/null < ~/capi-mgmt.kubeconfig + -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10 ubuntu@"$MGMT_FIP" \ + "sudo k8s config server=https://$MGMT_FIP:6443 ~/capi-mgmt.kubeconfig # [SENSITIVE] ~/capi-mgmt.kubeconfig contains a cluster-admin credential. wc -l ~/capi-mgmt.kubeconfig ; head -1 ~/capi-mgmt.kubeconfig # expect >0 lines, "apiVersion: v1" ``` @@ -320,7 +362,7 @@ ## Step 6.6 -- CAPI provider stack (pinned to dependencies.json; D-034) `# RUN: mgmt VM` Run VM-side as root with `KUBECONFIG=/root/kubeconfig` (local -apiserver 10.20.0.45:6443) so the matched 1.32.13 kubectl is used -- avoids the +apiserver = the VM's tenant IP:6443) so the matched 1.32.13 kubectl is used -- avoids the jumphost kubectl's +3-minor skew. Versions are READ from the tag's dependencies.json, never hardcoded (D-034). The as-built pins are in the reference block below as a known-good cross-check only. @@ -339,14 +381,15 @@ by 6.6b-6.6f (same jumphost shell). ```bash # define the mgmt-VM connection once (reused by 6.6b-6.6f) -MGMT_VM=10.12.7.40 +source ~/capi-mgmt-net.env # MGMT_FIP, MGMT_TENANT_IP (written by 6.2) +MGMT_VM="$MGMT_FIP" SSH_OPTS="-i $HOME/.ssh/id_ed25519 -o BatchMode=yes -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10" ssh $SSH_OPTS ubuntu@"$MGMT_VM" bash -s <<'REOF' set -euo pipefail sudo apt-get update -qq helm/clusterctl/kubectl need no sudo +# kubeconfig for the local apiserver (the VM's own tenant IP:6443), readable by ubuntu -> helm/clusterctl/kubectl need no sudo mkdir -p "$HOME/.kube"; sudo k8s config "$HOME/.kube/config"; chmod 600 "$HOME/.kube/config" # egress pre-check (the VM pulls charts/binaries/manifests from these) @@ -470,7 +513,9 @@ - Proceed to phase-07 (conductor graft). ## As-built reference (2026-06-08/09 run -- audit trail; values are run-specific) -- VM `capi-mgmt-v2`: gp.large, ubuntu-24.04-noble; tenant IP 10.20.0.45 (ens3); FIP 10.12.7.40. +- VM `capi-mgmt-v2`: gp.large, ubuntu-24.04-noble; tenant IP + FIP are per-rebuild (this rebuild + 10.20.0.107 ens3 / FIP 10.12.5.103; 2026-06-08/09: 10.20.0.45 / 10.12.7.40). 6.2 persists both + to ~/capi-mgmt-net.env. - Net `capi-mgmt-net` / subnet `capi-mgmt-subnet` 10.20.0.0/24; router `capi-mgmt-router`. - k8s-snap: 1.32-classic/stable, rev 5326, v1.32.13 (classic confinement); CNI Cilium 1.17.12-ck0. - pod CIDR 10.1.0.0/16; svc CIDR 10.152.183.0/24; cluster DNS 10.152.183.31. diff --git a/runbooks/phase-07-conductor-graft.md b/runbooks/phase-07-conductor-graft.md index d36a162..e9e8355 100644 --- a/runbooks/phase-07-conductor-graft.md +++ b/runbooks/phase-07-conductor-graft.md @@ -10,9 +10,10 @@ Decisions: D-031 (driver/engine/surface), D-037 (conf.d drop-in + config-dir via /etc/default, NOT a systemd ExecStart drop-in), D-042 (driver must be -contract-coherent with the Layer-A core; amends D-034). D-036 (driver/engine/ -chart coherence). Troubleshooting: appendix-A DOCFIX-021, D-037, D-042, and -lessons L-P6-1..4. +contract-coherent with the Layer-A core; amends D-034), D-036 (driver/engine/ +chart coherence), D-046 (magnum trustee domain-setup; REQUIRED manual step -- Step 7.0), +D-047 (keystone v3 drop-in for magnum-api -- Step 7.7b). Troubleshooting: appendix-A +DOCFIX-021, D-037, D-042, and lessons L-P6-1..4. --- @@ -20,19 +21,22 @@ - phase-06 EXIT GATE passed: `capi-mgmt-v2` Ready, CAPI stack up (ORC `Image` CRD present, no crash-looping CAPO), `~/capi-mgmt.kubeconfig` (server = FIP) works from the jumphost. -- Magnum charm live (`magnum/0`); the Keystone trustee domain is auto-configured by the - magnum charm via its keystone (identity-credentials) relation -- verify [trust] - (trustee_domain_id / trustee_domain_admin_id / trustee_domain_admin_password) is - populated in magnum.conf; no manual step. +- Magnum charm live (`magnum/0`) and related to keystone. The charm RENDERS magnum.conf + `[trust]` (trustee_domain_name=magnum, trustee_domain_admin_name=magnum_domain_admin, + password) from the identity-credentials relation, but it does NOT create the keystone + domain/user those names reference -- that is the MANUAL `domain-setup` action (Step 7.0, + D-046). `[trust]` being populated is NOT sufficient; magnum reports "Unit is ready" + whether or not the domain exists, and the omission 403s every `coe` op. Step 7.0 creates + AND asserts the domain/user. - `admin-openrc` on the jumphost; `juju` (model openstack); `jq`. ## Constants and env-literals (TAG: confirm per site on rebuild) -- `ENV(conductor-unit)` magnum/0 (LXD 1/lxd/2 on openstack1; addr 10.12.4.76) -- `ENV(conductor-src)` 10.12.4.76/32 (the conductor's provider IP; SG source) -- `ENV(mgmt-fip)` 10.12.7.40 (mgmt apiserver; kubeconfig server) +- `ENV(conductor-unit)` magnum/0 (LXD 1/lxd/2 on openstack1; addr 10.12.4.76 -- confirm per site) +- `ENV(conductor-src)` 10.12.4.76/32 (the conductor's provider IP; SG source -- confirm per site) +- `ENV(mgmt-fip)` per-rebuild (mgmt apiserver / kubeconfig server; source ~/capi-mgmt-net.env from phase-06 -- this rebuild 10.12.5.103; the old 10.12.7.40 is dead -- DOCFIX-038) - `ENV(mgmt-sg)` capi-mgmt-sg (in the capi-mgmt project) -- `ENV(project)` capi-mgmt (id 674171fd28d446d3a37073b6a761e910) -- `ENV(magnum-ns)` magnum-674171fd28d446d3a37073b6a761e910 (driver namespace per project) +- `ENV(project)` capi-mgmt (resolve by name; this rebuild id d5bc125c7c1841d389b76cd0a7b0a915, domain capi) +- `ENV(magnum-ns)` magnum- (driver namespace per project; this rebuild magnum-d5bc125c7c1841d389b76cd0a7b0a915) - `ENV(chart-ver)` 0.25.1 (capi-helm-charts; load-bearing -- driver default is 0.10.1) - `ENV(helm-ver)` v3.17.3 @@ -46,6 +50,38 @@ --- +## Step 7.0 -- Magnum trustee domain-setup (D-046; REQUIRED on every (re)deploy) +`# RUN: jumphost` The magnum charm action `domain-setup` is MANUAL and idempotent; magnum +reports active/"Unit is ready" REGARDLESS of whether the trustee domain exists. If the keystone +domain `magnum` + user `magnum_domain_admin` (referenced by magnum.conf `[trust]`) are absent, +`magnum/common/policy.py` 401s on EVERY policy-enforced request -> every `coe` op 403s (the +2026-06-17 incident; the 2026-06-11 redeploy omitted this and it stayed latent until the first +coe call). Run here, AFTER magnum + identity-service are related, and BEFORE any coe call +(Step 7.9 / phase-08). No magnum restart needed (domain_admin_auth resolves by NAME; +trustee_domain_id is recomputed per request). + +Step A -- create the trustee domain (charm-native; idempotent; takes no parameters): +```bash +juju run magnum/leader domain-setup mgmt apiserver reachability: ```bash -# RUN: jumphost -> magnum/0 +# RUN: jumphost -> magnum/0 (FIP from phase-06's ~/capi-mgmt-net.env -- never hardcode; DOCFIX-038) +source ~/capi-mgmt-net.env # MGMT_FIP juju ssh -m openstack magnum/0 \ - "timeout 6 bash -c 'exec 3<>/dev/tcp/10.12.7.40/6443' && echo TCP-OK || echo TCP-FAIL" /dev/tcp/$MGMT_FIP/6443' && echo TCP-OK || echo TCP-FAIL" it (/usr/bin IS on the restricted init PATH). Checksum-verified. +juju ssh -m openstack magnum/0 'set -e + WANT=v3.17.3 + if [ -x /usr/bin/helm ] && /usr/bin/helm version --short 2>/dev/null | grep -q "$WANT"; then + echo "[SKIP] /usr/bin/helm already $WANT" + else + T=helm-$WANT-linux-amd64.tar.gz + D=$(mktemp -d); cd "$D" + curl -fsSLO "https://get.helm.sh/$T" + EXP=$(curl -fsSL "https://get.helm.sh/$T.sha256sum" | cut -d" " -f1) + GOT=$(sha256sum "$T" | cut -d" " -f1) + [ -n "$EXP" ] && [ "$EXP" = "$GOT" ] || { echo "GATE FAIL: helm checksum exp=$EXP got=$GOT"; exit 1; } + tar xzf "$T" + sudo install -o root -g root -m 0755 linux-amd64/helm /usr/local/bin/helm + sudo ln -sfn /usr/local/bin/helm /usr/bin/helm + cd /; rm -rf "$D" + echo "[OK] installed $(/usr/bin/helm version --short)" + fi' magnum/0` The charm renders `auth_version = v2.0` in magnum.conf +`[keystone_authtoken]`/`[keystone_auth]` (a template type-compare bug; Caracal keystone does +not serve v2.0). On THIS deploy it is COSMETIC -- magnum's domain_admin_auth rewrites v2.0->v3 +and token validation worked throughout -- but v2.0 is the provably wrong value, so override it +with a drop-in (D-047). Same config-dir mechanism as Step 7.7, but for the magnum-API service: +Step 7.7 wired `--config-dir` only for the conductor, and oslo.config reads `--config-dir` AFTER +`--config-file`, so the drop-in wins. v3 URLs are DERIVED from the live `[keystone_authtoken]` +(no hardcoded VIPs). No restart here -- Step 7.8 restarts both services. +```bash +juju ssh -m openstack magnum/0 sudo bash -s <<'REOF' +set -e +# (1) wire --config-dir into magnum-api (mirror Step 7.7's conductor wiring; idempotent) +grep -q -- '--config-dir /etc/magnum/magnum.conf.d' /etc/default/magnum-api 2>/dev/null \ + || echo 'DAEMON_ARGS="$DAEMON_ARGS --config-dir /etc/magnum/magnum.conf.d"' >> /etc/default/magnum-api +chmod 0644 /etc/default/magnum-api +# (2) derive v3 URLs from the live [keystone_authtoken] block; write the override drop-in +WWW=$(awk -F'= ' '/^\[keystone_authtoken\]/{s=1} s&&/^www_authenticate_uri/{print $2; exit}' /etc/magnum/magnum.conf) +AURL=$(awk -F'= ' '/^\[keystone_authtoken\]/{s=1} s&&/^auth_url/{print $2; exit}' /etc/magnum/magnum.conf) +WWW3=${WWW/\/v2.0//v3}; case "$WWW3" in */v3) ;; *) WWW3="${WWW3%/}/v3";; esac +AURL3=${AURL/\/v2.0//v3}; case "$AURL3" in */v3) ;; *) AURL3="${AURL3%/}/v3";; esac +printf '[keystone_authtoken]\nauth_version = v3\nwww_authenticate_uri = %s\nauth_url = %s\n[keystone_auth]\nauth_version = v3\nwww_authenticate_uri = %s\nauth_url = %s\n' \ + "$WWW3" "$AURL3" "$WWW3" "$AURL3" > /etc/magnum/magnum.conf.d/50-keystone-v3-override.conf +chmod 0644 /etc/magnum/magnum.conf.d/50-keystone-v3-override.conf +echo "[OK] 50-keystone-v3-override.conf:"; cat /etc/magnum/magnum.conf.d/50-keystone-v3-override.conf +REOF +``` +GATE: the drop-in lists `auth_version = v3` + `/v3` URLs in BOTH sections, and +`grep -- --config-dir /etc/default/magnum-api` returns the line. The effective value is +proven in Step 7.8 by the magnum-api launched cmdline carrying `--config-dir` (L-P6-1/2: +gate on the assembled cmdline, not the file text). Restart happens in Step 7.8. + ## Step 7.8 -- Restart conductor + verify driver + HEALTHY (P6e + D-042 Stage 6) `# RUN: jumphost -> magnum/0`, then jumphost health poll. ```bash juju ssh -m openstack magnum/0 \ - 'sudo systemctl restart magnum-conductor && sleep 3 && systemctl is-active magnum-conductor && \ - ps -ww -C magnum-conductor -o args=' /dev/null | grep capi || \ echo "driver list (full):"; sudo magnum-driver-manage list-drivers' /dev/null)" echo " reason=$(openstack coe cluster show capi-test-1 -f value -c health_status_reason 2>/dev/null)" @@ -269,12 +369,12 @@ ## Step 7.9 -- Regression check (confirm create/manage path intact) `# RUN: jumphost` (capi-mgmt scope). Prove the upgraded driver still creates+deletes. -FRESH DEPLOY ROUTING: SKIP this step -- the `capi-k8s-v1-32` template does not exist +FRESH DEPLOY ROUTING: SKIP this step -- the `capi-k8s-v1-34` template does not exist yet (phase-08 step 8.0 creates it), and phase-08 itself (create `capi-test-1` to CREATE_COMPLETE, full acceptance, then 8.5 delete) is a superset of this check. Run 7.9 as written only when grafting onto an existing cloud where the template is present. ```bash -openstack coe cluster create capi-fix-check --cluster-template capi-k8s-v1-32 \ +openstack coe cluster create capi-fix-check --cluster-template capi-k8s-v1-34 \ --keypair capi-mgmt-key --master-count 1 --node-count 1 # watch to CREATE_COMPLETE, then: openstack coe cluster delete capi-fix-check # watch to gone @@ -310,13 +410,13 @@ DEB magnum 18.0.1, python3.10, container ubuntu 22.04; conductor user `magnum`. - As-FIRST-built driver: 1.3.0 (pip --no-deps) -> read the version-less v1beta2 ref -> health UNHEALTHY (D-042). PHASE-07 BASELINE supersedes this with the RELEASED magnum-capi-helm==1.4.0 (api_resources; default v1beta1). -- kubeconfig: /etc/magnum/kubeconfig, -rw------- magnum, ~5657 bytes, server = FIP 10.12.7.40:6443. +- kubeconfig: /etc/magnum/kubeconfig, -rw------- magnum, ~5657 bytes, server = the mgmt FIP:6443 (per-rebuild; this rebuild 10.12.5.103, old 10.12.7.40 dead). - conf.d drop-in /etc/magnum/magnum.conf.d/00-capi-helm.conf: kubeconfig_file, helm_chart_repo (azimuth), helm_chart_name openstack-cluster, default_helm_chart_version 0.25.1 (api_resources left default -- v1beta1 served by CAPI v1.13.2 / CAPO v0.14.4). - config-dir injection: /etc/default/magnum-conductor `DAEMON_ARGS="$DAEMON_ARGS --config-dir /etc/magnum/magnum.conf.d"`; verified live via `ps` and the init script `show-args`. -- helm v3.17.3 at /usr/local/bin/helm. +- helm v3.17.3 at /usr/local/bin/helm + /usr/bin/helm symlink (DOCFIX-035: on the conductor's restricted init PATH). - Driver internals (reference, from installed source): routes on (server_type vm, os ubuntu, coe kubernetes); k8s version comes from the IMAGE `kube_version` property (NOT a template label), os_distro=ubuntu; flavor floor 2048 MB / 2 vCPU; auto-mints an app credential (workload nodes use @@ -324,5 +424,5 @@ ## Next phase-08 -- workload-cluster acceptance: create a tenant cluster from template -`capi-k8s-v1-32`, confirm CREATE_COMPLETE + Ready nodes + Calico + LB, and run the +`capi-k8s-v1-34`, confirm CREATE_COMPLETE + Ready nodes + Calico + LB, and run the D-011 (amended per D-019) acceptance criteria. diff --git a/runbooks/phase-08-workload-cluster-acceptance.md b/runbooks/phase-08-workload-cluster-acceptance.md index 5ac8a0b..5976ce4 100644 --- a/runbooks/phase-08-workload-cluster-acceptance.md +++ b/runbooks/phase-08-workload-cluster-acceptance.md @@ -1,7 +1,7 @@ # Phase 08 -- Workload-Cluster Acceptance (D-011) Prove tenant self-service Kubernetes end to end: create a workload cluster from -the `capi-k8s-v1-32` template, confirm it converges (Ready nodes, CNI, CCM/CSI, +the `capi-k8s-v1-34` template, confirm it converges (Ready nodes, CNI, CCM/CSI, API LB), then run the D-011 acceptance bar. Passing D-011 is the gate that unlocks the project-completion tasks. @@ -25,12 +25,12 @@ (8.2 health gate; 8.1-8.5 create path). On an existing-cluster graft, `health_status` already reports HEALTHY (if the phase-07 1.4.0 upgrade was skipped, expect the COSMETIC UNHEALTHY of D-042 -- functional, but not an acceptance pass). -- Image `ubuntu-jammy-kube-v1.32.13` present AND carrying Glance properties - (8.0 below verifies, and on a fresh deploy imports it from the jumphost-staged qcow2) - `kube_version` (e.g. v1.32.13) and `os_distro=ubuntu`. The driver reads the k8s +- Image `ubuntu-jammy-kube-v1.34.8` present AND carrying Glance properties + (8.0 below verifies, and on a fresh deploy stage-and-verifies it from the azimuth CDN -- + FINDING-3) `kube_version` (v1.34.8) and `os_distro=ubuntu`. The driver reads the k8s version from the IMAGE, not a template label (P6-CONTRACT / L-P6-3); a missing - property fails create. -- Cluster template `capi-k8s-v1-32` present (8.0 verifies/creates it). + property fails create. (D1: bumped from EOL v1.32.13 to v1.34.8, within CAPI v1.13.2 support.) +- Cluster template `capi-k8s-v1-34` present (8.0 verifies/creates it). - D-039: the Magnum service path mints app-creds carrying `load-balancer_member` (+ member, reader). A frozen pre-D-039 app-cred 403s on the Octavia LB step and wedges create/delete (appendix-A: stuck-delete). @@ -39,29 +39,31 @@ hyperconverged hosts and OOM-kills guests. ## Constants and env-literals (TAG: confirm per site / run on rebuild) -- `ENV(project)` capi-mgmt (id 674171fd28d446d3a37073b6a761e910) +- `ENV(project)` capi-mgmt (resolve by name; this rebuild id d5bc125c7c1841d389b76cd0a7b0a915, domain capi) - `ENV(admin-project)` admin (id 65ce73e6798e4d1e8dd066609b7033ef) -- `ENV(template)` capi-k8s-v1-32 (uuid e2549d8b-4b89-4947-8b9a-0f4fdbe87d59) -- `ENV(image)` ubuntu-jammy-kube-v1.32.13 (id de69c243-bd1f-4182-8e9e-33933e926857) -- `ENV(ext-net)` provider-ext (id 70b34bb2-3afb-4b43-96d3-f520dbcbf9a8) +- `ENV(template)` capi-k8s-v1-34 (D1; uuid regenerates per rebuild -- resolve by name) +- `ENV(image)` ubuntu-jammy-kube-v1.34.8 (D1; kube_version v1.34.8; id regenerates -- resolve by name) +- `ENV(ext-net)` provider-ext (resolve by name; this rebuild id 0d00ddc1-d2bf-4849-a087-14c07d77f167) - `ENV(keypair)` capi-mgmt-key - `ENV(cluster)` capi-test-1 - `ENV(workload-cidr)` 10.20.16.0/24 - `ENV(flavors)` master gp.mid (8192/2) ; worker capi.node (4096/2) - run-specific (do NOT hardcode -- capture at run): API LB id, LB VIP (10.20.16.x), - workload API FIP (10.12.7.180 on the as-built run). + workload API FIP (10.12.7.180 on the 2026-06-09 as-built run; per-rebuild). ## Scope-hygiene preambles (the project-scope leak guard) -Capi-mgmt-scoped (cluster CRUD, show, config): +Capi-mgmt-scoped (cluster CRUD, show, config). DOCFIX-034: resolve the capi-mgmt project id +dynamically while admin-scoped, THEN narrow to it -- never hardcode (it regenerates per rebuild): ```bash source ~/admin-openrc +CAPI_PID=$(openstack project show capi-mgmt --domain capi -f value -c id) # ENV(project) unset OS_PROJECT_NAME OS_PROJECT_ID OS_TENANT_NAME OS_TENANT_ID OS_PROJECT_DOMAIN_ID OS_PROJECT_DOMAIN_NAME -export OS_PROJECT_ID=674171fd28d446d3a37073b6a761e910 # ENV(project) +export OS_PROJECT_ID="$CAPI_PID" ``` Admin-scoped (LB amphora/failover -- these 403 under tenant member scope): ```bash source ~/admin-openrc -unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME # token -> admin 65ce73e6... +unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME # token -> admin (the admin-openrc project) ``` --- @@ -76,60 +78,77 @@ ( { set -u echo "=== image present + carries kube_version / os_distro ===" - openstack image show ubuntu-jammy-kube-v1.32.13 -f json \ + openstack image show ubuntu-jammy-kube-v1.34.8 -f json \ | python3 -c 'import json,sys;d=json.load(sys.stdin);p=d.get("properties",d);print("kube_version=",d.get("kube_version") or p.get("kube_version"));print("os_distro=",d.get("os_distro") or p.get("os_distro"))' echo "=== reserved-host-memory (D-040) on a compute unit ===" juju ssh nova-compute/0 'sudo grep -i reserved_host_memory /etc/nova/nova.conf' /dev/null \ + openstack coe cluster template show capi-k8s-v1-34 -f value -c uuid 2>/dev/null \ && echo "template OK" || echo "template ABSENT -- create it below" } ) ``` -If the image is ABSENT (fresh deploy -- nothing survives teardown), import it from -the jumphost-staged qcow2. The command is the VERBATIM 2026-06-08 as-executed path -(glance-direct; plain web-download 403s on this cloud). With the hardened bundle's -glance `image-conversion: true` the stored disk_format lands `raw` on the redeploy -(expected -- the as-built run stored qcow2 because conversion was off then): +If the image is ABSENT (fresh deploy -- nothing survives teardown), seed it by +STAGE-AND-VERIFY (FINDING-3 -- REQUIRED, not merely preferred, for azimuth kube images): +glance's web-download plugin fetches with urllib (User-Agent `Python-urllib/3.x`) and the +azimuth CDN returns HTTP 403 to that UA, so a web-download import 202-accepts then hangs in +`queued` forever. curl sends a different UA and is NOT blocked. So curl the qcow2 to the +jumphost ($HOME -- snap-readable, NOT /tmp, L7), verify sha512 against the azimuth-images +0.28.0 manifest, then `openstack image create --file --import` (client-safe: the openstack snap +HAS `image create --import` = glance-direct and image-conversion lands it `raw`; it does NOT +have standalone `image stage`/`image import` subcommands, and the standalone `glance` client is +not assumed present): ```bash ( { set -u source ~/admin-openrc - if openstack image show ubuntu-jammy-kube-v1.32.13 >/dev/null 2>&1; then - echo "[SKIP] image ubuntu-jammy-kube-v1.32.13 present" + IMG_NAME=ubuntu-jammy-kube-v1.34.8 # ENV(image) + KUBE_VER=v1.34.8 # driver reads this from the image, not a label + if openstack image show "$IMG_NAME" >/dev/null 2>&1; then + echo "[SKIP] image $IMG_NAME present" else - SRC="$HOME/ubuntu-jammy-kube-v1.32.13-260401-2014.qcow2" - [ -f "$SRC" ] || { echo "ABORT: $SRC missing on the jumphost (azimuth-images source; see appendix-B)"; exit 1; } - glance image-create-via-import \ - --import-method glance-direct \ - --file "$SRC" \ + # azimuth-images 0.28.0 manifest (build 260518-1604) -- re-confirm vs manifest.json on any bump: + URL="https://azimuth-images.stackhpc.cloud/ubuntu-jammy-kube-v1.34.8-260518-1604.qcow2" + SHA512_EXP="7efde4857c9f9da045a98d71def30e229b3d7fffd8a5680e8aee0c5a8b13ba73fca3cf758a927230a1fbe3c451d8d21cfaeded96091e2a4f313c6a404760bdb3" + SRC="$HOME/ubuntu-jammy-kube-v1.34.8-260518-1604.qcow2" + if [ -f "$SRC" ] && [ "$(sha512sum "$SRC" | cut -d' ' -f1)" = "$SHA512_EXP" ]; then + echo "[OK] staged image present + sha512-valid; skipping download" + else + echo "[..] curl the qcow2 to $SRC (curl UA passes the CDN; glance urllib UA 403s -- FINDING-3)" + curl -fSL -o "$SRC" "$URL" + GOT=$(sha512sum "$SRC" | cut -d' ' -f1) + [ "$SHA512_EXP" = "$GOT" ] || { echo "GATE FAIL: sha512 mismatch exp=$SHA512_EXP got=$GOT"; exit 1; } + echo "[OK] sha512 verified against the azimuth-images 0.28.0 manifest" + fi + # CORRECTION-1: a plain --file (no --import) PUT stores qcow2 (boots fine); --import runs + # glance-direct + image-conversion -> raw (Ceph fast-clone alignment), so use --import here. + openstack image create "$IMG_NAME" \ + --file "$SRC" --import \ --container-format bare --disk-format qcow2 \ - --property os_distro=ubuntu --property kube_version=v1.32.13 \ - --name ubuntu-jammy-kube-v1.32.13 + --property os_distro=ubuntu --property kube_version="$KUBE_VER" fi - echo "=== poll to active (3.7G stage + conversion; allow ~10 min) ===" + echo "=== poll to active (multi-GB stage + conversion; allow ~10 min) ===" for i in $(seq 1 40); do - ST=$(openstack image show ubuntu-jammy-kube-v1.32.13 -f value -c status 2>/dev/null || echo '?') + ST=$(openstack image show "$IMG_NAME" -f value -c status 2>/dev/null || echo '?') echo "[$i] status=$ST" [ "$ST" = active ] && break sleep 15 done } ) ``` -GATE: image `active` and the 8.0 property check above passes (kube_version -v1.32.13 / os_distro ubuntu). Then create the template only if absent (spec from -the as-built capture; the two labels -are intentionally the whole config -- chart 0.25.1 + the conf.d drop-in govern the -rest). `--network-driver` is OMITTED deliberately: under the 1.4.0 driver the option -IS honored (it maps to the chart `network_driver`), so to keep the as-built chart -default (Calico) we leave it unset. Setting `flannel` here would now switch the CNI -- -do that only if Calico is being intentionally replaced (appendix-A: CNI-label / 1.4.0). +GATE: image `active` and the 8.0 property check above passes (kube_version v1.34.8 / +os_distro ubuntu). Then create the template only if absent. DOCFIX-032: pin +`--network-driver calico` EXPLICITLY. Under the 1.4.0 driver `--network-driver` maps to the +chart `network_driver`, and chart 0.25.1 ships ONLY Calico (flannel is not packaged) -- an +explicit `calico` documents intent and removes reliance on the default staying Calico. Do NOT +set `flannel`: it is unsupported by chart 0.25.1 and would fail to converge. ```bash -openstack coe cluster template create capi-k8s-v1-32 \ +openstack coe cluster template create capi-k8s-v1-34 \ --coe kubernetes --server-type vm \ - --image ubuntu-jammy-kube-v1.32.13 \ + --image ubuntu-jammy-kube-v1.34.8 \ --external-network provider-ext \ --master-flavor gp.mid --flavor capi.node \ --master-lb-enabled --floating-ip-enabled \ + --network-driver calico \ --dns-nameserver 8.8.8.8 \ --docker-storage-driver overlay2 \ --labels fixed_subnet_cidr=10.20.16.0/24,octavia_provider=amphora @@ -142,7 +161,7 @@ ```bash openstack coe cluster create capi-test-1 \ - --cluster-template capi-k8s-v1-32 \ + --cluster-template capi-k8s-v1-34 \ --keypair capi-mgmt-key \ --master-count 1 --node-count 2 openstack coe cluster show capi-test-1 -f value -c uuid -c status @@ -170,17 +189,19 @@ `# RUN: jumphost`. Pull the cluster's kubeconfig via Magnum, then inspect. ```bash # capi-mgmt scope +mkdir -p ~/capi-test-1 # DOCFIX-037: `coe cluster config --dir` does NOT create the dir openstack coe cluster config capi-test-1 --dir ~/capi-test-1 --force export KUBECONFIG=~/capi-test-1/config -# LIVE-REVIEW: confirm `coe cluster config` returns a usable kubeconfig under the -# capi-helm driver; alternative is the CAPI kubeconfig secret on the mgmt cluster: -# KUBECONFIG=~/capi-mgmt.kubeconfig clusterctl -n get kubeconfig +# confirmed: `coe cluster config` returns a usable kubeconfig under the capi-helm driver. +# Alternative (CAPI kubeconfig secret on the mgmt cluster), magnum-ns resolved dynamically: +# NS=magnum-$(openstack project show capi-mgmt --domain capi -f value -c id) +# KUBECONFIG=~/capi-mgmt.kubeconfig clusterctl -n "$NS" get kubeconfig ( { export KUBECONFIG=~/capi-test-1/config - echo "=== nodes (expect 3 Ready, v1.32.13: 1 control-plane + 2 workers) ===" + echo "=== nodes (expect 3 Ready, v1.34.8: 1 control-plane + 2 workers) ===" kubectl get nodes -o wide - echo "=== CNI = Calico (chart default; --network-driver omitted) ===" + echo "=== CNI = Calico (DOCFIX-032: --network-driver calico pinned on the template) ===" kubectl -n kube-system get pods | grep -Ei 'calico|tigera' || kubectl get pods -A | grep -Ei 'calico|tigera' echo "=== CCM (OpenStack cloud-controller-manager) + Cinder CSI + CoreDNS Running ===" kubectl get pods -A | grep -Ei 'cloud-controller|openstack-cloud|cinder-csi|coredns' @@ -201,26 +222,42 @@ `juju status --format=short | grep -vE 'active|idle' || echo "all active/idle"` Pass: nothing but active/idle (phase-03 re-confirmed here). -- **D-011.2 -- API reachability from the jumphost (all public VIPs).** `# RUN: jumphost` - IP-only: hit each service VIP, e.g. Keystone: +- **D-011.2 -- API reachability from the jumphost (CORE service VIPs).** `# RUN: jumphost` + IP-only: hit each CORE service VIP, e.g. Keystone: `curl -sk https://10.12.4.50:5000/v3 -o /dev/null -w '%{http_code}\n'` (expect 200/300). - Repeat per public VIP (.50-.60 block). Pass: all respond. + Repeat per core public VIP (.50-.60 block: keystone .50, barbican .51, cinder .52, glance .53, + magnum .54, neutron .55, nova .56, octavia .57, horizon .58/.60, placement .59). DOCFIX-039: + product-streams / glance-simplestreams (gss) is NOT a core API VIP -- it registers a unit-IP + HTTP endpoint (this rebuild 10.12.8.196) with NO jumphost route to the container space, so it is + EXPECTED unreachable from the jumphost and is OUT OF SCOPE for D-011.2. Pass: all core VIPs respond. -- **D-011.3 -- API reachability from a tenant VM (Option B).** `# RUN: mgmt VM` - The generalized phase-06 GATE 1: a tenant VM reaches the provider VIP. - `ssh ... ubuntu@10.12.7.40 "timeout 6 bash -c 'exec 3<>/dev/tcp/10.12.4.50/5000' && echo VIP-OK || echo VIP-FAIL" mgmt VM` + The generalized phase-06 GATE 1: a tenant VM reaches the provider VIP. DOCFIX-038: the mgmt + FIP is per-rebuild -- source it (never hardcode the dead 10.12.7.40): + `source ~/capi-mgmt-net.env` + `ssh ... ubuntu@"$MGMT_FIP" "timeout 6 bash -c 'exec 3<>/dev/tcp/10.12.4.50/5000' && echo VIP-OK || echo VIP-FAIL" ` -> watch - ERROR/PENDING_UPDATE -> ACTIVE (~100s; single STANDALONE amphora -> brief blip; - operating_status holds ONLINE). (appendix-A: LB-failover; amphora ops are - admin-scope only.) Pass: round-robin distributes; failover returns to ACTIVE. - TODO (before sign-off): this runbook does NOT yet contain the build steps for the - standalone 2-member round-robin pool (LB + listener + pool + 2 backend members + - health monitor). Add them here, or fold the round-robin check into the - workload-cluster API LB the driver already builds, before D-011.4 is marked complete. + DOCFIX-040 -- do NOT hand-build a standalone LB/listener/pool/members. Exercise round-robin via + a THROWAWAY Kubernetes `Service type=LoadBalancer` on the workload cluster: the OpenStack CCM + provisions an Octavia LB + pool + members for it automatically (the Roosevelt-real path -- tenant + workloads get LBs exactly this way), then tear it down. `# RUN: jumphost, KUBECONFIG=~/capi-test-1/config` + ```bash + export KUBECONFIG=~/capi-test-1/config + kubectl create deploy rr --image=registry.k8s.io/e2e-test-images/agnhost:2.40 --replicas=2 -- /agnhost netexec --http-port=8080 + kubectl expose deploy rr --port=80 --target-port=8080 --type=LoadBalancer + kubectl get svc rr -w # Ctrl-C once EXTERNAL-IP is assigned (CCM builds the Octavia LB + FIP) + EXT=$(kubectl get svc rr -o jsonpath='{.status.loadBalancer.ingress[0].ip}') + for i in $(seq 1 10); do curl -s "http://$EXT/hostname"; echo; done # expect BOTH pod names (round-robin) + kubectl delete svc rr; kubectl delete deploy rr # tears down the Octavia LB + ``` + Failover/recovery (admin scope -- against the workload-cluster API LB): `openstack loadbalancer + failover ` -> watch ERROR/PENDING_UPDATE -> ACTIVE (~100s; single STANDALONE amphora + -> brief blip; operating_status holds ONLINE). STANDALONE failover needs N+1 amphora placement + headroom (it builds the replacement BEFORE reaping the old -- a cloud at its scheduler ceiling + cannot self-heal its LBs; Roosevelt sizing implication). (appendix-A: LB-failover; amphora ops + are admin-scope only.) Pass: round-robin distributes across both members; failover returns to ACTIVE. - **D-011.5 -- End-to-end Magnum CAPI cluster create, CCM not crash-looping.** Satisfied by 8.1-8.3 (CREATE_COMPLETE + CCM Running). Pass = that gate. @@ -251,8 +288,8 @@ frozen app-cred): clear the OpenStackCluster finalizer (the Cluster auto-follows), then manual neutron cleanup in dependency order -- appendix-A: stuck-delete. ```bash -# NS=magnum-674171fd28d446d3a37073b6a761e910 -# KUBECONFIG=~/capi-mgmt.kubeconfig kubectl -n $NS patch openstackcluster - \ +# NS=magnum-$(openstack project show capi-mgmt --domain capi -f value -c id) # resolve; never hardcode +# KUBECONFIG=~/capi-mgmt.kubeconfig kubectl -n "$NS" patch openstackcluster - \ # --type=merge -p '{"metadata":{"finalizers":[]}}' # then: openstack router remove subnet / router unset external-gateway / router delete / # subnet delete / network delete / security group delete (dependency order) @@ -262,13 +299,21 @@ ## EXIT GATE (phase-08 / v1 acceptance) - 8.1-8.3 passed: capi-test-1 CREATE_COMPLETE, 3 Ready nodes, Calico, CCM/CSI/CoreDNS, API LB ACTIVE/ONLINE. -- D-011 items 1-7 PASS; item 8 deferred (D-019). -- health_status HEALTHY (phase-07 driver). -- => v1 deployment is ACCEPTED. Project-completion tasks unlocked: - consolidate the do-doc runbooks into docs/v1-deploy-runbook.md; revert the - GitBucket repo OpenStack/openstack-caracal-ipv4 to PRIVATE. +- D-011 items 1-6 PASS; item 7 (KVM snapshot baseline) OUTSTANDING -- it is the last gate before + the accept-gate formally closes (D-012; dedicated pass); item 8 deferred (D-019). +- health_status HEALTHY (phase-07 1.4.0 driver clears the D-042 cosmetic UNHEALTHY). +- ACCEPTANCE SUMMARY (this rebuild): .1 charms PASS; .2 core VIPs PASS; .3 tenant->VIP PASS; + .4 Octavia round-robin + admin-scope failover PASS; .5 E2E CAPI create PASS; .6 vault manual + unseal PASS; .7 snapshot DEFERRED (operator); .8 Designate DEFERRED (D-019). => v1 is + FUNCTIONALLY ACCEPTED; the .7 snapshot baseline is the only item left to formally close the gate. +- => Project-completion tasks unlocked: consolidate the per-phase runbooks into + docs/v1-deploy-runbook.md; revert the GitBucket repo OpenStack/openstack-caracal-ipv4 to PRIVATE. -## As-built reference (capi-test-1, suffix kgwwe7c4qj6a, 2026-06-09) +## As-built reference (capi-test-1, suffix kgwwe7c4qj6a, 2026-06-09 -- PRE-D1 v1.32.13 capture) +- D1 NOTE: the procedure above now targets capi-k8s-v1-34 / ubuntu-jammy-kube-v1.34.8. This + capture is the 2026-06-09 v1.32.13 run (the D-011 acceptance ran on v1.32.13); re-validation on + v1.34.8 follows the stage-and-verify seed (8.0). A later D-039-era recreate carried CAPI suffix + qmyxu2xcsghz (CREATE_COMPLETE, HEALTHY). - create: `--master-count 1 --node-count 2`; uuid 6de15cf4-8805-4ac2-b413-8de2c48d92cf. - nodes: control-plane (xsc62) + 2 workers; v1.32.13; Calico CNI. - API LB id 0f968008-8429-4ac3-8b82-452e126982cf, VIP 10.20.16.144, FIP 10.12.7.180, diff --git a/runbooks/v1-ops-capi-recovery-procedure-20260610.md b/runbooks/v1-ops-capi-recovery-procedure-20260610.md deleted file mode 100644 index 01e6886..0000000 --- a/runbooks/v1-ops-capi-recovery-procedure-20260610.md +++ /dev/null @@ -1,241 +0,0 @@ -# v1 ops -- CAPI/Magnum stack recovery procedure (parking, restart, LB repair) - -Status: blocks below are AS-EXECUTED-VERIFIED 2026-06-10 (this is their first -formal consolidation). Destination: runbooks/ as an ops companion to the -phase-NN deploy runbook, cross-referenced from appendix-A and from -OpenStack_Test_Deployment-restart-procedure.md. - -Applies when: capi-mgmt-v2 has been stopped (parking, host event, OOM) and the -CAPI/Magnum stack must be returned to service. ORDER MATTERS: repair from the -bottom up (VM -> k8s -> CAPI controllers -> Octavia LB -> CAPO conditions -> -Magnum health). Everything upstream stays red until the layer below is green. - -Scope-hygiene preambles are the canonical ones from the 2026-06-09 as-executed -log. ENV literals: project capi-mgmt 674171fd28d446d3a37073b6a761e910; mgmt FIP -10.12.7.40; kube-api LB 0f968008-...; regenerate per site on rebuild. - ---- - -## 0. Expectations table (read FIRST; saves an hour of false alarms) - -| Observation | Meaning | -|---|---| -| Magnum UNHEALTHY, reason EMPTY | Conductor cannot reach the mgmt API (VM down / booting). Not D-042. | -| Magnum UNHEALTHY, reason populated, all components 'Ready', infrastructure 'Infrastructure resource not found.' | D-042 cosmetic false-negative. Known good. | -| Horizon Container Infra 504 right after mgmt VM start | Conductor stalled mid-reconnect; nginx proxy timeout. Retry after Step 3. | -| k8sd control.socket deadline / apiserver TLS handshake timeout / mount failures during first ~20 min after boot | Cold-start convergence noise on gp.mid (2 vCPU). Judge by load trend + `k8s status`, not by these. | -| Cluster Available=False with InfrastructureReady LB-timeout message after a cold start | CAPO reconcile raced the storm. Check the LB (Step 4) BEFORE blaming CAPI. | -| LB provisioning ERROR, operating ONLINE | Control-plane op failed; dataplane fine. Needs admin failover (Step 5). No urgency. | -| openstack server list empty in Horizon/CLI | Wrong project scope. CAPI VMs live in capi-mgmt. | -| juju ssh: "cannot get discharge ... EOF" | Stale macaroon + `/dev/null) - echo "[$i] status=$ST" - [ "$ST" = ACTIVE ] && break - sleep 10 - done - echo "=== TCP probe loop: FIP :22 (sshd lags ACTIVE by ~3 min) ===" - for i in $(seq 1 18); do - timeout 5 bash -c 'exec 3<>/dev/tcp/10.12.7.40/22' 2>/dev/null \ - && { echo "[$i] SSH-PORT-OK"; break; } || echo "[$i] not yet" - sleep 10 - done -} ) ------------------------------------------------------------------------- -END runbook block ------------------------------------------------------------------------- -``` -GATE: SSH-PORT-OK. Timing (verified, gp.mid): ACTIVE ~20 s; sshd ~3.5 min. - -## 3. k8s-snap readiness (PATIENCE GATE) - -``` ------------------------------------------------------------------------- -BEGIN runbook block: mgmt k8s readiness poll (cold-start aware) ------------------------------------------------------------------------- -( { - for i in $(seq 1 15); do - echo "--- [$i] $(date -u +%H:%M:%S) ---" - ssh -i ~/.ssh/id_ed25519 -o BatchMode=yes -o StrictHostKeyChecking=no \ - -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10 ubuntu@10.12.7.40 \ - 'uptime; sudo k8s status 2>&1 100 on 2 vCPUs. Do NOT restart services or re-bootstrap inside -this window; the Section-0 noise is expected. (On the phase-06-spec gp.large, -expect substantially faster.) - -## 4. CAPI stack + LB verification (read-only; decides Step 5) - -``` ------------------------------------------------------------------------- -BEGIN runbook block: post-start CAPI + LB verify ------------------------------------------------------------------------- -( { - export KUBECONFIG="$HOME/capi-mgmt.kubeconfig" - kubectl get nodes -o wide - kubectl get pods -A | egrep 'capi-|capo-|cert-manager|orc-system|janitor|addon' - NS=magnum-674171fd28d446d3a37073b6a761e910 - kubectl -n "$NS" get cluster,openstackcluster,machines -} ) -# kubeconfig missing? Re-emit (phase-06 Step 6.5, verbatim): -# ssh -i ~/.ssh/id_ed25519 -o BatchMode=yes -o StrictHostKeyChecking=no \ -# -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10 ubuntu@10.12.7.40 \ -# "sudo k8s config server=https://10.12.7.40:6443 ~/capi-mgmt.kubeconfig -( { - source ~/admin-openrc - unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME OS_PROJECT_DOMAIN_ID - export OS_PROJECT_ID=674171fd28d446d3a37073b6a761e910 - unset OS_PROJECT_NAME OS_PROJECT_DOMAIN_NAME OS_TENANT_NAME OS_TENANT_ID - openstack loadbalancer list -f yaml -} ) ------------------------------------------------------------------------- -END runbook block ------------------------------------------------------------------------- -``` -DECISION: controllers Running + Machines Running + every LB provisioning=ACTIVE --> skip to Step 6. Any LB provisioning=ERROR (operating ONLINE is typical) --> Step 5. Cluster Available=False with an LB-timeout message -> the LB is the -cause; fix it first, the condition clears itself afterward. - -## 5. LB repair: zombie sweep, headroom, sequential failover - -5a. ZOMBIE/ORPHAN SWEEP (admin scope). Confirmed pattern, twice in one day: -failed failovers leave amphora servers with no Octavia DB row. Two variants: -ERROR server (failed spawn) and ACTIVE heartbeating zombie (health-manager logs -"missing from the DB ... An operator must manually delete it" every 10 s). - -``` ------------------------------------------------------------------------- -BEGIN runbook block: amphora orphan/zombie sweep (admin scope; verify-then-delete) ------------------------------------------------------------------------- -( { - source ~/admin-openrc - unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME - echo "=== octavia's amphora inventory (the DB truth) ===" - openstack loadbalancer amphora list -f yaml - echo "=== nova's amphora servers (compare; extras are orphans) ===" - openstack server list --all-projects --long -f yaml \ - | grep -B6 -A4 'amphora-haproxy' | grep -E '^(- | (ID|Name|Status)):' -} ) -# For each server whose amphora-NAME-uuid is ABSENT from the amphora list: -# 1) re-grep the amphora list for the uuid (ABORT if present) -# 2) openstack server delete # by UUID; name lookup is project-scoped -# Each deletion frees one amphora slot (charm-octavia: 1024 MB / 1 vCPU / 8 GB). ------------------------------------------------------------------------- -END runbook block ------------------------------------------------------------------------- -``` - -5b. HEADROOM CHECK. Failover transiently needs +1 amphora placement (replacement -is built BEFORE the old one is reaped). Scheduler ceiling per host = -physical_MB * ram_allocation_ratio(1.5) - reserved_host_memory(8192, D-040). -Verify at least one host clears Used + 1024 <= ceiling: -`openstack hypervisor list --long -f yaml | grep -E 'Hostname|Memory MB'`. -If no host clears: free 1024+ MB first (zombie sweep usually suffices; else -power off a disposable VM, e.g. a backend-* test instance). DO NOT retry -failover against NoValidHost -- each attempt mints another zombie. - -5c. FAILOVER, STRICTLY SEQUENTIAL (one slot of headroom = one failover at a -time; completion of each reaps its old amphora and re-frees the slot). - -``` ------------------------------------------------------------------------- -BEGIN runbook block: LB failover + poll (admin scope; v4 Arc D pattern) ------------------------------------------------------------------------- -( { - source ~/admin-openrc - unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME - LB= - openstack loadbalancer failover "$LB" - sleep 2 - for i in $(seq 1 60); do - prov=$(openstack loadbalancer show "$LB" -f value -c provisioning_status 2>/dev/null) - op=$( openstack loadbalancer show "$LB" -f value -c operating_status 2>/dev/null) - printf '%s prov=%s op=%s\n' "$(date +%T)" "${prov:-?}" "${op:-?}" - case "$prov" in - ACTIVE) echo "failover succeeded"; break ;; - ERROR) echo "failover FAILED -- read octavia-worker.log; do NOT retry blind"; break ;; - esac - sleep 10 - done -} ) ------------------------------------------------------------------------- -END runbook block ------------------------------------------------------------------------- -``` -Verified timing: ~108 s to ACTIVE; op holds ONLINE; VIP+FIP preserved (VIP port -is Octavia-owned). A 10-20 s fast-fail to ERROR = early-flow failure (usually -NoValidHost; see 5b). STANDALONE amphora = brief kube-api endpoint blip -mid-failover; nodes/pods unaffected. - -## 6. Top-of-stack verification - -``` ------------------------------------------------------------------------- -BEGIN runbook block: final verify (amphorae, CAPO condition, magnum health) ------------------------------------------------------------------------- -( { - source ~/admin-openrc - unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME - openstack loadbalancer amphora list -f yaml # all ALLOCATED - export KUBECONFIG="$HOME/capi-mgmt.kubeconfig" - NS=magnum-674171fd28d446d3a37073b6a761e910 - kubectl -n "$NS" get cluster,openstackcluster # Available=True (allow ~10 min post-failover for CAPO resync) - source ~/admin-openrc - unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME OS_PROJECT_DOMAIN_ID - export OS_PROJECT_ID=674171fd28d446d3a37073b6a761e910 - unset OS_PROJECT_NAME OS_PROJECT_DOMAIN_NAME OS_TENANT_NAME OS_TENANT_ID - openstack coe cluster show capi-test-1 -f value -c health_status - openstack coe cluster show capi-test-1 -f value -c health_status_reason -} ) ------------------------------------------------------------------------- -END runbook block ------------------------------------------------------------------------- -``` -SUCCESS = amphorae ALLOCATED; Cluster Available=True; Magnum reason POPULATED -with the D-042 cosmetic signature (or HEALTHY post-D-042-fix). Reload Horizon -Container Infra last. Workload check if desired: -`KUBECONFIG=~/capi-test-1-kc/config kubectl get nodes -o wide`.