diff --git a/.gitignore b/.gitignore index e699e05..bc74216 100644 --- a/.gitignore +++ b/.gitignore @@ -42,3 +42,13 @@ passphrase.txt # Bundle backups (timestamped) bundle.yaml.bak-* + +# --- repo-sanitation sweep additions --- +/remove/ +*.kubeconfig +kubeconfig +vault-init*.txt +init.txt +*.bak +*.tmp +.DS_Store diff --git a/README.md b/README.md index ec91812..465cb4e 100644 --- a/README.md +++ b/README.md @@ -1,120 +1,111 @@ -# openstack-caracal-ipv4 — VR0 DC0 Omega Cloud (v1) +# openstack-caracal-ipv4 -- VR0 DC0 Omega Cloud (v1) -**Scope:** Charmed OpenStack Caracal (2024.1) IPv4-only testcloud deployment -on the 4-VM KVM lab, modeled in NetBox as **VR0 DC0 Omega Cloud**. +**Scope:** Charmed OpenStack Caracal (2024.1), IPv4-only, on the 4-VM KVM lab and +modeled in NetBox as **VR0 DC0 Omega Cloud**. This repository is the deployment +method: bundle, overlay, gated runbook, and validation scripts together describe +everything required to bring the cloud up from a clean MAAS-managed Juju model. It is +a rehearsal for the future bare-metal **Roosevelt** deployment; design choices favour +the transferable answer over the quick fix so the testcloud surfaces real production +requirements. -## v1 vs. v2 — read this first +## v1 vs. v2 -- read this first -This repository is the **v1 deliverable** — IPv4-only Caracal Charmed -OpenStack on the existing MAAS-provisioned network layout. v1 ships first -because the upstream router infrastructure is not yet IPv6-ready; deferring -on IPv6 lets v1 prove the bundle, Option B binding fix, Magnum CAPI graft, -Designate-from-day-one, and the hacluster relation pattern at testcloud scale -without waiting on network-side IPv6 readiness. - -**v2** adds IPv6 / dual-stack per the address-family matrix retained as -v2-scope decisions in `docs/design-decisions.md` (D-004, D-004a). v2 will -ship either as a sibling overlay in this repository (`overlays/v2-dualstack.yaml` -on a `v2` branch) or as a separate repository — TBD when v2 work begins. - -The IPv6 prefixes already imported into NetBox under VR0 DC0 remain in -NetBox as **Reservation status** to document the v2 intent without -implying they are active. See `netbox/ipv6-mark-reserved.py`. - -## Repository purpose - -This repository is the deployment method. Bundle, overlays, runbooks, and -validation scripts together describe everything required to bring up the -cloud from a clean MAAS-managed Juju model. Anyone with NetBox read access, -MAAS access, and the Juju controller can clone this repository and reproduce -the cloud. +v1 is IPv4-only Caracal on the existing MAAS-provisioned network layout; it ships first +because the upstream router infrastructure is not yet IPv6-ready. v2 adds IPv6 / +dual-stack (decisions D-004 / D-004a, retained as v2-scope in +`docs/design-decisions.md`) and will ship either as a sibling overlay on a `v2` branch +or as a separate repository (TBD when v2 begins). IPv6 prefixes already imported into +NetBox under VR0 DC0 remain at **Reservation** status to document v2 intent without +implying they are active (`netbox/ipv6-mark-reserved.py`). ## Source of truth -**NetBox is authoritative for IPAM.** Any IP, prefix, or VLAN value -referenced in this repository traces back to NetBox. The exception is -tenant per-project subnets, which under the v1 hybrid model (D-016) are -Neutron-managed within a NetBox-modeled upstream tenant pool — i.e., the -pool has NetBox standing, individual tenant subnets do not. +**NetBox is authoritative for IPAM.** Every IP, prefix, and VLAN referenced in this +repository traces back to NetBox. The exception is per-project tenant subnets, which +under the v1 hybrid model (D-016) are Neutron-managed inside a NetBox-modeled upstream +pool -- the pool has NetBox standing; individual tenant subnets do not. ## Repository layout ``` openstack-caracal-ipv4/ -├── README.md # this file -├── bundle.yaml # canonical Charmed OpenStack bundle (IPv4) -├── overlays/ -│ └── vr0-dc0-testcloud.yaml # 4-VM lab specifics; num_units=1 + hacluster -├── runbooks/ -│ # (deprecated; see runbooks/deprecated/ - superseded by D-017 + D-018 + v1-do-doc-NN set) -│ ├── 01-destroy-model.md # destroy openstack model + verify -│ ├── 02-deploy.md # juju deploy + settle wait -│ ├── 03-vault-init.md # vault unseal + cert auth -│ ├── 04-magnum-domain.md # domain-setup action + keystone wiring -│ ├── 04a-capi-bootstrap-cluster.md # capi-mgmt VM deploy + k3s + CAPI + ORC (D-017) -│ ├── 05-magnum-capi-driver.md # pip install driver + kubeconfig + systemd -│ ├── 06-tenant-setup.md # project, user, openrc, app credentials -│ ├── 07-dns-zones.md # Designate zones + API VIP A records (v1) -│ └── 08-validate.md # Roosevelt-rehearsal validation criteria -├── scripts/ -│ ├── pre-flight-checks.sh # pre-deploy sanity checks -│ └── validate.sh # end-to-end validation runner -├── netbox/ -│ ├── README.md # what's here vs. what's deferred to v2 -│ ├── ipv4-prefixes-import.py # adds IPv4 prefixes + IPv4 tenant pool -│ └── ipv6-mark-reserved.py # marks IPv6 entries as Reservation (Q3) -└── docs/ - ├── design-decisions.md # architectural record (D-001 through D-019) - └── netbox-vip-queue.md # post-deploy NetBox imports (workstream 2) +|-- README.md this file +|-- bundle.yaml canonical Charmed OpenStack bundle (IPv4); the +| testcloud num_units / VIPs / hacluster are baked in +|-- overlays/ empty in git -- the only overlay (octavia-pki.yaml) +| is generated at deploy (phase-01) and is secret- +| bearing, so it is never committed +|-- runbooks/ the gated deploy runbook (phase-NN) + appendices +| |-- README.md runbook index, order, and conventions +| |-- phase-00-teardown-maas-reset.md +| |-- phase-01-bundle-deploy.md +| |-- phase-02-vault-bringup.md +| |-- phase-03-core-verify.md +| |-- phase-04-network-carve.md +| |-- phase-05-octavia-enablement.md +| |-- phase-06-incloud-mgmt-cluster.md +| |-- phase-07-conductor-graft.md +| |-- phase-08-workload-cluster-acceptance.md +| |-- appendix-A-troubleshooting.md +| \-- appendix-B-asbuilt-version-lock.md +|-- scripts/ +| |-- pre-flight-checks.sh pre-deploy sanity checks +| |-- validate.sh end-to-end validation runner +| \-- review-bundle.py bundle lint / review +|-- netbox/ +| |-- README.md what is imported vs. deferred to v2 +| |-- ipv4-prefixes-import.py IPv4 prefixes + IPv4 tenant pool +| \-- ipv6-mark-reserved.py marks IPv6 entries Reservation (v2 intent) +\-- docs/ + |-- design-decisions.md architectural record (D-NNN) + |-- netbox-vip-queue.md post-deploy NetBox imports + \-- v1-pre-deploy-fixes.md completed pre-deploy repo-hardening change list (D-019 series) ``` -## v1 deployment order +## Deploy order -The deploy is executed via the `runbooks/v1-do-doc-NN-*.md` execution documents in numeric order: +Run the `runbooks/phase-NN-*.md` documents in numeric order. Each phase ends in a hard +gate (an explicit pass/fail check); do not begin the next phase until the current gate +passes. The two appendices are reference, not steps. See `runbooks/README.md` for the +per-phase summary and the RUN-location conventions. -| Doc | Purpose | -|---|---| -| `v1-do-doc-01-prep.md` | Pre-flight state check (repo, openrc, MAAS state of 5 VMs) | -| `v1-do-doc-02-pki.md` | Octavia PKI overlay generation | -| `v1-do-doc-03-destroy.md` | Conditional model + MAAS teardown (clean state for rebuild) | -| `v1-do-doc-04-deploy.md` | `juju deploy` + settle wait + on-disk PKI verification | -| `v1-do-doc-05-vault-init.md` | Vault initialization + cert cascade + admin-openrc regeneration | -| `v1-do-doc-06-magnum-domain.md` | Magnum Keystone domain setup | -| `v1-do-doc-07-capi-bootstrap.md` | CAPI bootstrap cluster + workload pivot | -| `v1-do-doc-08-magnum-driver.md` | Magnum CAPI Helm driver graft | -| `v1-do-doc-09-tenant.md` | Tenant project/user/openrc + Snapshot 2 | -| `v1-do-doc-10-validate.md` | D-011 acceptance criteria + Snapshot 3 | +| Phase | Purpose | +| -------- | ---------------------------------------------------------------- | +| phase-00 | Teardown + MAAS reset (clean state for rebuild) | +| phase-01 | Bundle deploy (incl. Octavia PKI overlay generation) + settle | +| phase-02 | Vault bring-up (PKI root; cert cascade) | +| phase-03 | Core verify (settle, admin-openrc regeneration, Horizon) | +| phase-04 | Network carve (provider external network + IPAM reference) | +| phase-05 | Octavia enablement | +| phase-06 | In-cloud CAPI management cluster (D-035) | +| phase-07 | Magnum conductor graft (magnum-capi-helm driver; D-031/D-037/D-042) | +| phase-08 | Workload-cluster acceptance (D-011) | -NetBox imports are run separately (gated on external NetBox engineer review; see `netbox/README.md`). +NetBox imports run separately, gated on external NetBox-engineer review +(`netbox/README.md`). -## v1-specific design decisions (summary; see docs/design-decisions.md for full record) +## Key v1 scope (full record in docs/design-decisions.md) -- **D-015 v1/v2 fork** — IPv4-only v1; IPv6/dual-stack v2 deferred -- **D-016 IPv4 tenant pool hybrid model** — NetBox owns upstream `/16` pool; - Neutron owns per-project subnets within it -- **D-003 Option B network architecture** — Provider `/22` carries both - ext_net FIPs (`10.12.4.10–.223`) and OpenStack public API VIPs - (`10.12.4.224–.254`) on the same L2 segment; fixes the tenant→API - unreachability that caused Magnum OCCM crashloop on Bobcat testcloud -- **D-005 Ceph Squid** — matches Caracal default; rehearses Roosevelt -- **D-006 Vault HA backend = etcd + easyrsa** -- **D-007 Magnum from day one** — charm in bundle + CAPI Helm driver graft -- **D-019 (supersedes D-008) DNS scope reduction for v1** — Designate deferred - to v2 alongside corporate DNS / NS-delegation work. Tenant subnets use public - DNS (`1.1.1.1` / `1.0.0.1`) directly via `--dns-nameserver`. - `*.cloud.neumatrix.local` FQDN tree remains internal-only, resolved via static - `/etc/hosts` on bootstrap-relevant hosts. -- **D-009 Hacluster relations included at num_units=1** — decorative on - testcloud; documents the relation pattern for Roosevelt scale-up -- **No OVN pinning on testcloud** — Roosevelt bare-metal will pin via `ovn-source` +- **D-015 v1/v2 fork** -- IPv4-only v1; IPv6 / dual-stack deferred to v2. +- **D-016 IPv4 tenant-pool hybrid** -- NetBox owns the upstream pool; Neutron owns + per-project subnets within it. +- **D-019 (supersedes D-008) DNS scope reduction** -- Designate deferred to v2; tenant + subnets use public DNS (1.1.1.1 / 1.0.0.1) via `--dns-nameserver`; the internal + `*.cloud.neumatrix.local` tree is resolved by static `/etc/hosts` on + bootstrap-relevant hosts. +- **D-020 dual provider+metal HA VIPs** -- API charms carry a VIP on both the provider + and metal spaces (front-loaded; exact values live in `bundle.yaml` / NetBox). +- **D-035 in-cloud management cluster** -- the CAPI / Magnum management cluster is a + single-homed in-cloud tenant VM (no out-of-cloud node, no clusterctl pivot). +- **D-031 / D-037 / D-042 Magnum KaaS** -- tenant self-service Kubernetes via Magnum + the + magnum-capi-helm driver + the azimuth capi-helm-charts engine; the driver pin must be + contract-coherent with the CAPI core (see Appendix B). +- **D-009 hacluster at num_units=1** -- decorative on testcloud; documents the relation + pattern for Roosevelt scale-up. -## v2-scope decisions (deferred — read but do not action in v1) +v2-scope decisions (D-004 dual-stack/IPv6 matrix; D-004a host-management-to-metal) are +recorded but NOT actioned in v1. -- **D-004 Dual-stack/IPv6-only matrix** — applies in v2 only -- **D-004a Host management → Metal (Option A)** — applies in v2 only; - v1 keeps openstack0-3 host management IPs on the storage fabric -- **VLAN modeling in NetBox** (Q2) — the VR0 DC0-VLANs group remains with - only VID 240 (OS-Provider) imported during prior session work; remaining - VLAN entries are deferred to v2 when actual VLAN tagging is in play. - Currently MAAS uses untagged-per-fabric, so the additional VLAN entries - would be misleading documentation +> `docs/design-decisions.md` is the authoritative decision record. If it lags the +> bundle/runbook (for example the D-020 VIP scheme or the D-028..D-042 series), +> reconcile it there. diff --git a/bundle.yaml b/bundle.yaml index b17ff09..bdeb77b 100644 --- a/bundle.yaml +++ b/bundle.yaml @@ -126,7 +126,8 @@ num_units: 1 # 3 on Roosevelt (D-009) to: [lxd:8] options: - vip: "10.12.4.10 10.12.8.10" # B1 front-loaded VIP; IS the catalog endpoint (B5, no os-public-hostname) + vip: "10.12.4.50 10.12.8.50" # B1 front-loaded VIP; IS the catalog endpoint (B5, no os-public-hostname) + use-policyd-override: true # as-built reconcile 2026-06-09 (origin untraced -- Review-later) bindings: *api-bindings constraints: arch=amd64 @@ -145,7 +146,8 @@ num_units: 1 to: [lxd:11] options: - vip: "10.12.4.13 10.12.8.13" # B1 + vip: "10.12.4.53 10.12.8.53" # B1 + image-conversion: true # as-built; image conversion enabled (raw on Ceph-backed glance) bindings: # api-bindings + ceph->storage (C2; glance is a Ceph client) "": metal public: provider @@ -180,7 +182,7 @@ options: console-access-protocol: novnc network-manager: Neutron - vip: "10.12.4.16 10.12.8.16" # B1 + vip: "10.12.4.56 10.12.8.56" # B1 bindings: *api-bindings constraints: arch=amd64 @@ -197,6 +199,7 @@ migration-auth-type: ssh resume-guests-state-on-host-boot: true virt-type: qemu # Testcloud nested-KVM; Roosevelt will use 'kvm' + reserved-host-memory: 8192 # ENV(testcloud 16GiB hosts) D-040 OOM fix; charm default 512 -- DO NOT drop bindings: # C2 ceph/ceph-access -> storage. OVN-on-data: neutron-plugin -> data "": metal # puts 'data' in this principal's binding set so ovn-chassis' data ceph: storage # binding is a valid SUBSET (subordinate subset rule). nova-compute is @@ -215,7 +218,7 @@ num_units: 1 to: [lxd:11] options: - vip: "10.12.4.19 10.12.8.19" # B1 + vip: "10.12.4.59 10.12.8.59" # B1 bindings: *api-bindings constraints: arch=amd64 @@ -237,7 +240,7 @@ enable-ml2-port-security: true flat-network-providers: physnet1 neutron-security-groups: true - vip: "10.12.4.15 10.12.8.15" # B1 + vip: "10.12.4.55 10.12.8.55" # B1 bindings: *api-bindings constraints: arch=amd64 @@ -298,7 +301,7 @@ options: block-device: None glance-api-version: 2 - vip: "10.12.4.12 10.12.8.12" # B1 + vip: "10.12.4.52 10.12.8.52" # B1 bindings: # api-bindings + ceph -> storage. cinder's container needs a storage NIC "": metal # for Ceph; binding the regular 'ceph' endpoint provisions it AND puts public: provider # 'storage' in cinder's binding set, so cinder-ceph's ceph->storage is a @@ -360,7 +363,7 @@ to: [lxd:8] options: source: *ceph-source - vip: "10.12.4.20 10.12.8.20" # B1 -- radosgw HA un-deferred for Roosevelt fidelity (decorative HA on testcloud) + vip: "10.12.4.60 10.12.8.60" # B1 -- radosgw HA un-deferred for Roosevelt fidelity (decorative HA on testcloud) bindings: # api-bindings + mon->storage (C2). radosgw IS externally-facing (S3/Swift API). "": metal public: provider @@ -378,7 +381,7 @@ to: [lxd:10] options: debug: "false" - vip: "10.12.4.18 10.12.8.18" # B1 -- browse HTTPS by IP (B5); ALLOWED_HOSTS must permit the VIP IP (verify at deploy) + vip: "10.12.4.58 10.12.8.58" # B1 -- browse HTTPS by IP (B5); ALLOWED_HOSTS must permit the VIP IP (verify at deploy) bindings: *api-bindings constraints: arch=amd64 @@ -409,7 +412,7 @@ # juju deploy ./bundle.yaml \ # --overlay overlays/vr0-dc0-testcloud.yaml \ # --overlay overlays/octavia-pki.yaml - vip: "10.12.4.17 10.12.8.17" # B1 + vip: "10.12.4.57 10.12.8.57" # B1 bindings: # api-bindings + ovsdb-cms -> data. octavia's CONTAINER needs a data NIC so "": metal # ovn-chassis-octavia can geneve-encap on the overlay; ovsdb-cms is a public: provider # REGULAR (octavia<->ovn-central) endpoint -- unused in the amphora-driver @@ -446,7 +449,7 @@ to: [lxd:11] options: openstack-origin: *openstack-origin - vip: "10.12.4.11 10.12.8.11" # B1 + vip: "10.12.4.51 10.12.8.51" # B1 bindings: *api-bindings constraints: arch=amd64 @@ -478,7 +481,7 @@ options: openstack-origin: *openstack-origin region: RegionOne - vip: "10.12.4.14 10.12.8.14" # B1 + vip: "10.12.4.54 10.12.8.54" # B1 bindings: *api-bindings constraints: arch=amd64 diff --git a/docs/design-decisions.md b/docs/design-decisions.md index 3e3aab3..995e473 100644 --- a/docs/design-decisions.md +++ b/docs/design-decisions.md @@ -32,7 +32,7 @@ | Charm group | Channel | | ----------------------------------------------------------------------------------------------------------------------- | -------------------------- | -| OpenStack core (keystone, glance, nova-\*, neutron-api, cinder, placement, octavia, barbican, designate, magnum) | `2024.1/stable` | +| OpenStack core (keystone, glance, nova-\*, neutron-api, cinder, placement, octavia, barbican, designate, magnum, vault) | `2024.1/stable` | | OVN (ovn-central, ovn-chassis, ovn-dedicated-chassis-octavia) | `24.03/stable` | | Ceph (ceph-mon, ceph-osd, ceph-radosgw if used) | `squid/stable` (see D-005) | | MySQL (mysql-innodb-cluster, mysql-router subordinates) | `8.0/stable` | @@ -139,12 +139,12 @@ - `juju run magnum/leader domain-setup --wait=10m` - pip install `magnum-capi-helm==1.1.0` from PyPI into the magnum charm venv with `--break-system-packages` (stackhpc/magnum-capi-helm fork archived Dec 2024; canonical project moved to `openstack/magnum-capi-helm` on opendev/PyPI; 1.1.0 is the last Caracal-cycle release. Upstream tests against Magnum 2023.1+, so backward-compatible through Caracal 2024.1.) -- Deploy `/etc/magnum/kubeconfig` pointing at `capi-mgmt.maas` bootstrap k3s +- Deploy `/etc/magnum/kubeconfig` pointing at the **workload cluster** (the post-pivot home of CAPI controllers per **runbook 04a §17** `clusterctl move`). Staged on jumphost at `$HOME/magnum-capi/capi-mgmt-cluster.kubeconfig` by runbook 04a §19, transferred to the magnum unit by runbook 05 §6. Bobcat had this pointing at bootstrap k3s because the pivot was never executed; workstream 3b (2026-05-22) made the pivot mandatory. - Systemd override replacing init.d ExecStart to load `--config-dir` - `/etc/magnum/magnum.conf.d/99-capi.conf` setting `enabled_drivers=k8s_capi_helm_v1` and `[capi_helm] kubeconfig_file=/etc/magnum/kubeconfig` (ASCII-only; non-ASCII characters in conf.d cause silent daemon failures) -**CAPI mgmt plane:** `capi-mgmt.maas` bootstrap k3s. Per **D-017**, this cluster is rebuilt from scratch every deployment cycle — there is no preserved-across-rebuild artifact. The install procedure for the bootstrap cluster lives in `runbooks/04a-capi-bootstrap-cluster.md` and runs **before** this runbook. This pattern transfers to Roosevelt unchanged. +**CAPI mgmt plane:** Post-pivot, the workload cluster IS the CAPI management plane (per **runbook 04a §17**, `clusterctl move` pivots cluster state from the `capi-mgmt.maas` bootstrap k3s into the workload cluster, which becomes self-managing). Per **D-017**, both the bootstrap k3s and the workload cluster are rebuilt from scratch every deployment cycle — there is no preserved-across-rebuild artifact. The bootstrap install + pivot procedure lives in `runbooks/04a-capi-bootstrap-cluster.md` and runs **before** this runbook. This pattern transfers to Roosevelt unchanged. **Superseded portions:** The "preserved across rebuild" stance in earlier drafts of this decision is **superseded by D-017**. See D-017 for rationale. The earlier `stackhpc/magnum-capi-helm` v0.13.0 driver pin is superseded by the `openstack/magnum-capi-helm` 1.1.0 pin above (driver source repo moved + archived). @@ -153,9 +153,7 @@ ## D-008: DNS architecture -**Status:** Superseded by D-019 (2026-05-27). v2-scope. Original decision text preserved below for audit. - -**Decision (original; superseded):** Layered — static /etc/hosts for bootstrap + Designate (in bundle from day one) for tenant-level resolution. +**Decision:** Layered — static /etc/hosts for bootstrap + Designate (in bundle from day one) for tenant-level resolution. **Naming convention:** @@ -222,12 +220,11 @@ 5. End-to-end Magnum CAPI cluster creation succeeds, including OCCM not crash-looping 6. Vault unseal + auto-unseal-after-reboot pattern verified 7. KVM snapshot baseline taken (Phase 5) +8. Designate zones populated and tenant VMs resolve API hostnames Validation script: `scripts/validate.sh` (TBD). -**Amendment (2026-05-27):** Per D-019, the "Designate resolves" criterion (former item 8) is removed for v1. Designate is deferred to v2; tenant subnets resolve via public DNS. v2 will reinstate a DNS-resolution validation criterion calibrated to whatever DNS mechanism is in place (NS delegation from corporate DNS, or otherwise). - --- @@ -261,11 +258,7 @@ **Decision:** Self-hosted GitBucket at `git.baldurkeep.com`. -**Repo path:** `OpenStack/openstack-caracal-ipv4` (v1; IPv4-only). - -- Web: `https://git.baldurkeep.com/OpenStack/openstack-caracal-ipv4` -- Clone: `https://git.baldurkeep.com/git/OpenStack/openstack-caracal-ipv4.git` -- Moved from `jesse.austin/openstack-caracal-ipv4` to the `OpenStack` group on 2026-05-27. GitBucket does not redirect from the old path. +**Repo path:** `jesse.austin/openstack-caracal-ipv4` (v1; IPv4-only). **v2 repository:** TBD when v2 work begins. Two viable paths: sibling repo `openstack-caracal-ipv6` or `openstack-caracal-dualstack`, OR `v2` branch in this repo with an `overlays/v2-dualstack.yaml`. The single-repo-with-branch approach preserves history of what changed v1→v2 together; the sibling-repo approach keeps v1 frozen as a reference once v2 is in motion. @@ -372,46 +365,179 @@ --- -## D-019: DNS scope reduction for v1 — Designate deferred to v2 +## D-019: Cloud DNS (Designate) deferred to v2 / Roosevelt -**Decision (2026-05-27):** Designate is removed from the v1 testcloud bundle and deferred to v2 alongside corporate DNS / NS delegation work. v1 tenant subnets resolve via public DNS (`1.1.1.1`, `1.0.0.1`) directly via the `--dns-nameserver` option at subnet-create time. +**Decision:** v1 ships with NO cloud-internal DNS; Designate is not deployed. Public service endpoints use FQDNs (`os-public-hostname`) that resolve to the provider VIPs via external/corporate DNS; internal and admin endpoints stay IP-based on the metal VIPs. Tenant instances use upstream resolvers (1.1.1.1 / 1.0.0.1). The D-011 acceptance bar is amended to drop the cloud-DNS criterion, and the planned `v1-do-doc-10-dns` runbook is dropped. -**Supersedes:** D-008 (DNS architecture). +**Consequence (documented, not a blocker):** metal-only charm units that make catalog-based client calls pull the PUBLIC (FQDN) endpoint and cannot resolve or route it (the internal-endpoint certs carry no FQDN SAN). This is the root of the gss/retrofit amphora-pipeline constraint recorded in D-021. The proper fix (cloud-internal DNS + FQDN-valid certs, or charms consuming internal endpoints) is a Roosevelt item. -**Amends:** D-011 (validation bar — removes "Designate resolves" criterion). +**Status:** Decided (v1). Reconstructed into this doc from the deploy record (no standalone D-019 file). -### Rationale +**Related:** D-008 (DNS architecture), D-021 (amphora-pipeline consequence), D-011 (acceptance bar amended). -Three findings from the 2026-05-27 testcloud topology investigation: +--- -1. **Outside-in DNS** (corporate clients resolving `*.cloud.neumatrix.local`) is not needed for v1. Corporate access to the cloud already flows through the existing `openstack.baldurkeep.com → 10.17.4.20 → 10.12.x` HTTPS proxy chain (handled by the edge nginx at `10.17.8.7`), which does not depend on corporate-side resolution of cloud-internal FQDNs. +## D-020: Dual provider + metal API VIPs on clustered charms -2. **The edge nginx cannot route to `10.12.x` directly.** Inspection confirmed the edge has only `10.17.8.7/22` plus a tailscale interface; reaching `10.12.4.x` requires the libvirt-host NAT path. Adding DNS to the testcloud would require parallel UDP/53 NAT/proxy plumbing across three hosts (edge nginx, libvirt host, internal nginx) for a feature that has no v1 consumer. +**Decision:** Every clustered OpenStack API application (keystone, glance, nova-cloud-controller, neutron-api, cinder, placement, barbican, octavia, openstack-dashboard, magnum, vault) is configured with BOTH a provider VIP and a metal VIP, as a space-separated pair: `vip: "10.12.4.X 10.12.8.X"` (Option B). -3. **Inside-out DNS** (tenant VMs resolving external names) is satisfied by tenant subnets pointing `--dns-nameserver` at public DNS (`1.1.1.1`, `1.0.0.1`). Designate is not needed in the inside-out path either, since: - - Tenant VMs do not need to resolve cloud-internal FQDNs (their API access goes through documented IPs / `--cloud` configs in cloud.conf) - - Cross-tenant DNS visibility is not a v1 requirement +**Rationale:** with a provider-only VIP, `charms_openstack/ip.py:resolve_address(INTERNAL)` returns `None` and raises `ValueError`, breaking `identity-service-relation-joined` (and the analogous internal-endpoint registration on every clustered API charm). Supplying a metal-network VIP alongside the provider VIP gives `resolve_address` an internal address to return, and keeps east-west service traffic on the metal network rather than the provider network. -The remaining v1 use case for Designate (FIP DNS auto-registration via the `neutron-api ↔ designate` integration) is informational only — nothing in v1 consumes those records. +**Status:** Decided (v1). Reconstructed into this doc from the deploy record (no standalone D-020 file). -### v1 implementation +**Related:** D-003 (network architecture), D-002 (channels). -- Tenant subnets created with `--dns-nameserver 1.1.1.1 --dns-nameserver 1.0.0.1` (or via the openrc `OS_DNS_NAMESERVERS` env) -- CAPI workload cluster template variable `OPENSTACK_DNS_NAMESERVERS` set to `1.1.1.1,1.0.0.1` (per `v1-do-doc-07-capi-bootstrap.md` §13) -- Cloud-internal `*.cloud.neumatrix.local` FQDN tree resolved via static `/etc/hosts` on bootstrap-relevant hosts (jumphost, openstack0-3, LXD containers per charm bootstrap, capi-mgmt — staged in `v1-do-doc-05-vault-init.md` §11 and `v1-do-doc-07-capi-bootstrap.md` §6) -- Charms continue to use FQDN-based `os-public-hostname` (cert SANs depend on it) — internal resolution via `/etc/hosts` is sufficient +--- -### v2 plan +## D-021: Octavia amphora image pipeline on the no-DNS dual-endpoint deploy -- Re-introduce Designate (charm + designate-bind + relations + hacluster sub) -- NS delegation from corporate DNS to designate-bind on a real (non-NAT) network VIP -- Tenant subnets transitioning to use Designate VIP as their resolver (after corporate DNS delegation lands) -- Designate v2 deploy on a real-network Roosevelt or v2-testcloud topology where the bridging-host complexity from v1 testcloud does not apply -- D-011 validation re-introduces a calibrated DNS-resolution criterion (mechanism TBD: NS delegation working end-to-end vs static A records at corporate DNS) +**Decision:** build the amphora image with the charm-native `octavia-diskimage-retrofit` set `use-internal-endpoints: true`, seeded by a manually uploaded stock Ubuntu base image carrying the five Glance properties the retrofit reads (architecture, os_distro, os_version, version_name, product_name). Park `glance-simplestreams-sync` for the amphora pipeline. The amphora image is `image-format: raw`, tagged `octavia-amphora` to match octavia's `amp-image-tag`. -### v2-residency note +**Root cause:** on the dual-endpoint, no-DNS topology (D-019), metal-net catalog-callers (gss + its retrofit subordinate) cannot reach Glance: the public Glance FQDN does not resolve/route from the metal net, and the internal-endpoint cert carries no FQDN SAN (so an `/etc/hosts` FQDN->metal-VIP mapping fails TLS). gss `use-internal-endpoints` steers only its Keystone auth to internal; its glance/swift clients still use the public FQDN and there is no further charm-native lever -- a charm gap on the no-DNS topology. The retrofit's `use-internal-endpoints` lever DOES cover its build path, so it is the charm-native amphora builder here. -The IPv6 prefixes already imported into NetBox (and marked Reservation status) include allocations that would be appropriate for Designate's VIPs in a v2 design — these stay in NetBox as Reservation until v2 work begins. +**Status:** Decided + validated end-to-end (v1): the retrofit, over internal endpoints, reads the seeded base and writes the amphora; gss parked; octavia + subordinates active/idle. + +**Roosevelt:** cloud-internal DNS + FQDN-valid certs removes the manual seed and fixes gss end to end. + +**Related:** D-007 (Octavia inclusion), D-019 (no-DNS root cause). + +--- + +## D-028: Defer the CAPI v1beta2-contract cutover (deploy the single-contract v1beta1 stack) + +**Decision:** defer adopting the CAPI v1beta2-CONTRACT generation until upstream ships it correctly for this path; deploy the clean single-contract v1beta1 stack now. + +**Context:** while grounding the (then-current) Canonical CK8s workload chart, the chart referenced control-plane/bootstrap kinds at apiVersion v1beta1 while the pinned provider served them only at v1beta2 (DOCFIX-022). The broader question -- is the v1beta2-contract generation available and correct for long-term support on this path -- resolved to "not yet." + +**Status:** Decided (v1). The CK8s-chart-specific particulars were subsequently retired when D-031 replaced the direct-CAPI CK8s path with Magnum + the azimuth kubeadm charts; the single-contract principle carries forward, and D-042 later made the driver-side contract axis concrete. + +**Builds on:** D-022 / D-023 (do-07-era CAPI/CRD work). **Related:** D-031, D-042. + +--- + +## D-029: Defer Keystone SSO (k8s-keystone-auth) to Roosevelt + +**Decision:** Keystone SSO for the workload clusters (the chart's `k8s-keystone-auth` addon) is deferred to the next deployment and folded into the Roosevelt cloud-internal-DNS + trusted-cert foundation. v1 workload clusters run the Kubernetes Dashboard with standard token auth; the `k8sKeystoneAuth` addon stays OFF; SSO is not validated on v1. + +**Rationale:** enabling it on v1 would produce a non-functional SSO path (TLS failure to the private-CA Keystone endpoint) plus apiserver webhook error noise -- a checked box that does not work -- and forcing it would require forking the addon or fighting CAAPH, neither of which carries forward to Roosevelt. + +**Finding (verified 2026-06-05):** k8s-keystone-auth 1.5.1 exposes no keystone-CA option, so it cannot trust a private-CA Keystone endpoint. + +**Status:** Decided (v1). **Related:** D-028 (same "land it on the proper foundation later" principle). + +--- + +## D-030: Management-cluster placement -- in-cloud (superseded twice; see D-033, D-035) + +**Decision (as taken 2026-06-06):** run the CAPI management plane IN-CLOUD for the v1 rehearsal (CAPI core + CAPO + cluster-api-addon-provider as VMs on the OpenStack cloud, following an Azimuth seed + HA pattern with a `clusterctl move` pivot to a self-hosted in-cloud management cluster). Out-of-cloud was recorded as a deferred alternative for Roosevelt. + +**Status:** SUPERSEDED. First by D-033 (out-of-cloud Canonical `k8s`-charm on MAAS); then -- after D-033's dual-homed node hit an unfixable pod-egress fault -- placement returned in-cloud in a simpler single-homed form under D-035. Retained here for lineage. + +**Related:** D-031, D-033, D-035. + +--- + +## D-031: Cluster-creation surface + engine -- Magnum + magnum-capi-helm + azimuth kubeadm charts + +**Decision:** the tenant Kubernetes service is built from three layers: +- Surface: OpenStack Magnum (`openstack coe cluster ...`), so tenants and operators manage clusters through the OpenStack API. +- Driver: the in-tree Cluster API Helm driver `magnum-capi-helm` (opendev.org/openstack/magnum-capi-helm), pip-installed into the Magnum conductor and pointed at a CAPI management cluster via `[capi_helm] kubeconfig_file`. +- Engine: the azimuth-cloud `capi-helm-charts` `openstack-cluster` chart (kubeadm-based: KubeadmControlPlane / KubeadmConfigTemplate + CAPO OpenStackCluster / OpenStackMachineTemplate + MachineDeployment), with addons (Cilium CNI, OpenStack CCM, Cinder CSI, and so on) installed by the cluster-api-addon-provider. +- Management-cluster placement: in-cloud for v1 (D-030, later refined by D-035). + +**Status:** Decided. Supersedes the do-07 direct-CAPI Canonical CK8s chart path; the CK8s-chart-specific findings (DOCFIX-022 ref patch, etc.) are retired for this path. + +**Related:** D-030 / D-035 (placement), D-034 (version constellation), D-036 / D-042 (driver/chart/core coherence). + +--- + +## D-033: Management cluster -- out-of-cloud Canonical k8s-charm on MAAS (superseded by D-035) + +**Decision (as taken 2026-06-07):** management cluster = a Canonical Kubernetes cluster deployed with the `k8s` / `k8s-worker` machine charms on MAAS, OUTSIDE OpenStack, made HA by the charms; CAPI layer via `clusterctl init --infrastructure openstack` + cluster-api-addon-provider, version-pinned to the capi-helm-charts release (NOT the D-022 do-07 pins); the management cluster does not run the OpenStack CCM for itself (CAPO reaches OpenStack through a `clouds.yaml` pointed at the public API endpoints); lifecycle via Juju. + +**Status:** SUPERSEDED by D-035. The chosen node (capi-mgmt MAAS VM) is necessarily dual-homed (MAAS PXE on metal, API VIPs on provider), and pod egress from that multi-NIC node to the API VIPs failed (the Cilium reverse-NAT reply was mis-forwarded out the wrong NIC instead of redirected into the pod). Retained here for lineage. + +**Supersedes:** D-030 (placement) + D-032 (azimuth-config tooling). **Builds on:** D-031. + +--- + +## D-034: CAPI version constellation pinned to capi-helm-charts dependencies.json + +**Decision:** pin the management-cluster CAPI constellation to the `dependencies.json` published with a chosen `capi-helm-charts` RELEASE TAG, read at deploy time on the jumphost with `jq` (dynamic lookup, no hand-picked versions). Retire D-022 "Option A" (driver 1.3.0 / CAPO v0.10.x / v1alpha6) as obsolete. + +**Rationale:** the magnum-capi-helm driver does not hand-pick component versions; its own CI installs the management CAPI stack by reading the per-release `dependencies.json` and running a fixed install sequence -- that file is the single matched-and-tested set. Hand-picking fights the upstream model, and v1alpha6 has been removed from current cluster-api-provider-openstack. (At tag 0.25.1 the set is CAPI v1.13.2, CAPO v0.14.4, cert-manager v1.20.2, ORC v2.5.0, addon-provider 0.12.0, janitor 0.11.0, helm v3.17.3; appendix-B carries the as-built snapshot.) + +**Status:** Adopted 2026-06-08. **Supersedes:** D-022. **Amended by:** D-042 (adds the driver<->core contract-coherence rule). **Related:** D-031, D-028 (CRD-contract note, now subsumed). + +--- + +## D-035: Management-cluster placement -- in-cloud single-homed tenant VM + +**Decision:** run the CAPI management cluster as a single-homed in-cloud tenant VM (`capi-mgmt-v2`): one NIC on the management tenant subnet (10.20.0.0/24), reached via a floating IP (10.12.7.40); k8s-snap (channel `1.32-classic/stable`), Cilium CNI; not CAPI-self-managed (no `clusterctl move`). + +**Rationale:** D-033's out-of-cloud node was necessarily dual-homed and its pod egress to the OpenStack API VIPs failed -- the Cilium reverse-NAT reply was emitted back out the second NIC instead of being redirected into the pod via `cilium_host` (a multi-NIC reverse-path fault; the `k8s` charm exposes too few Cilium annotations to repair it). A single-homed VM removes the second NIC and the fault entirely. The single-NIC pod-egress premise was then proven by the Phase 4 hard gate (an agnhost pod TCP probe to the Keystone VIP 10.12.4.50:5000 returning exitCode 0). + +**Status:** Adopted 2026-06-08; pod-egress premise validated. **Supersedes:** D-033 (revisits D-030 in simpler form). **Unaffected:** D-031, D-034. + +**Trade-off:** a single-node management cluster is a SPOF with no self-heal -- see D-041 (manual-start policy) and D-040 (the OOM that surfaced it). + +--- + +## D-036: magnum-capi-helm driver / chart / CAPO coherence (resolved) + +**Decision / correction:** a mid-session "rebuild Phase 5 on chart 0.10.1" framing -- premised on the GA driver (1.3.0) emitting the v1alpha6 OpenStackCluster CRD and clashing with the modern v1beta1 stack -- is WRONG and is retired. Chart 0.10.1 is the retired v1alpha6 path that D-034 superseded; rebuilding on it would have reversed D-034. + +**Verification:** the 1.3.0 driver is api_version-AGNOSTIC (driver.py has zero v1alpha6/v1beta1/apiVersion references; it helm-installs the chart and watches the CAPI `Cluster`, never writing OpenStackCluster directly). The OpenStackCluster apiVersion is set by the CHART: chart 0.25.1 emits `infrastructure.cluster.x-k8s.io/v1beta1`, matching the installed CAPO v0.14.4. The driver's built-in default chart is 0.10.1 (the v1alpha6-era chart); overriding `default_helm_chart_version` to 0.25.1 yields v1beta1. The "1.3.0 emits v1alpha6" claim was true only of the driver's DEFAULT chart, not of the driver pinned to chart 0.25.1. + +**Status:** Resolved 2026-06-08. Implements D-031 Phase 3 under the D-034 constellation. NOTE: a SEPARATE axis -- the driver-vs-core CONTRACT, not the chart's CRD string -- is what later required the 1.4.0 driver pin; see D-042. **Related:** D-031, D-034. + +--- + +## D-037: [capi_helm] config persistence on the charm-managed conductor + +**Decision:** keep the `[capi_helm]` section in an oslo.config drop-in directory and point the conductor at it: `/etc/magnum/magnum.conf.d/00-capi-helm.conf` (0644, no secrets; it references the 0600 kubeconfig by path), with magnum-conductor launched with `--config-dir /etc/magnum/magnum.conf.d` so oslo.config merges the drop-in over the charm-rendered `magnum.conf`. The charm manages neither the .conf.d directory nor the launch extension, so this survives charm hooks and reproduces on Roosevelt. + +**Problem:** the magnum charm (2024.1/stable rev 70) re-renders `magnum.conf` wholesale on hooks and exposes no conf-override option, so a `[capi_helm]` section written into `magnum.conf` would be clobbered. + +**Mechanism (load-bearing correction):** the conductor's ExecStart is NOT a direct binary -- it is `/etc/init.d/magnum-conductor systemd-start` (an LSB init script wrapped by systemd), so a systemd ExecStart drop-in appending `--config-dir` is inert (the flag reaches the init script as an ignored positional). The adopted method instead creates `/etc/default/magnum-conductor` (0644; the charm does not manage it) containing `DAEMON_ARGS="$DAEMON_ARGS --config-dir /etc/magnum/magnum.conf.d"`; the init script sources `/etc/default/$NAME` after setting the base `DAEMON_ARGS`, then runs `exec $DAEMON $DAEMON_ARGS`. Verify behaviorally with `/etc/init.d/magnum-conductor show-args` and `ps -ww -C magnum-conductor -o args` (not string-presence). + +**Status:** Adopted 2026-06-08 (mechanism revised mid-implementation). **Residual:** breaks silently if a future charm hook writes `/etc/default/magnum-conductor` -- detect via the same show-args/ps check. **Related:** D-031 Phase 3, D-036. + +--- + +## D-040: Raise nova-compute reserved-host-memory on the hyperconverged hosts + +**Decision:** set `nova-compute reserved-host-memory` to 8192 MB (from the default 512) so Nova placement accounts for the non-Nova memory co-located on each hyperconverged host. Charm config -> survives redeploy. + +**Trigger / root cause:** during the first end-to-end Magnum workload-cluster create, openstack1 hit the kernel OOM-killer (no reboot; single boot since 2026-06-03) and killed a tenant qemu worker VM. The host co-locates nova-compute AND roughly 6 GiB of services invisible to Nova placement (mysqld [innodb-cluster member] ~2.9G, ceph-osd + ceph-mon ~1.2G, neutron workers ~0.7G, nova/apache/cinder/ovs ~1.4G) while Nova reserved only the default 512 MB; under the resulting memory pressure the host swap-thrashed (an ovsdb inactivity-probe storm made the workload API and Juju agent look "down" when the host was in fact thrashing, not down). + +**Status:** Adopted + APPLIED 2026-06-09. **Related:** D-035 (the mgmt-VM SPOF the OOM hit), D-041. + +--- + +## D-041: Non-HA deployments default to manual start + +**Decision:** non-HA deployments default to MANUAL START -- no automatic VM power-on / auto-recovery is configured by default. Any non-HA deployment must be documented as non-HA, with the rationale that manual-down surfaces incidents (auto-restart masks capacity/health defects). Auto-recovery is an explicit, out-of-band exception, never the silent default. + +**Trigger:** after the openstack1 OOM (D-040), CAPI's MachineHealthCheck self-healed the workload worker VMs automatically, but the single-node management VM (capi-mgmt-v2, D-035) was OOM-killed and stayed SHUTOFF -- it does not self-heal or auto-restart, which silently broke magnum reconcile/health and left workload nodes with the CAPI uninitialized taint until it was started by hand. The cost (downtime) was real, but the manual-down is also what forced the investigation that found the OOM root cause headed for Roosevelt. + +**Status:** Adopted 2026-06-09 (policy/governance). **Related:** D-035 (the SPOF), D-040 (the OOM). + +--- + +## D-042: magnum-capi-helm driver must be contract-coherent with the CAPI core + +**Decision (amends D-034):** the magnum-capi-helm driver pin (Layer B) MUST be contract-coherent with the CAPI core that `dependencies.json` installs (Layer A). When the Layer-A lockfile is a v1beta2-contract core (CAPI v1.13), the driver pin must be a build that understands v1beta2 references; verify this intersection at deploy. + +**Symptom / root cause:** capi-test-1 reached CREATE_COMPLETE with every real component healthy (3 Ready nodes, Calico, CCM/CSI/CoreDNS, API LB ACTIVE), yet magnum reported `health_status = UNHEALTHY` deterministically -- only the `infrastructure` sub-check failed ("Infrastructure resource not found"). The 1.3.0 driver reads `apiVersion` off the Cluster's `spec.infrastructureRef`, but under the v1beta2 contract that ref is version-less, so the health GET resolves nothing. The create path is unaffected (the chart templates the resource versions) -- a cosmetic health false-negative. The governing axis is the CAPI CONTRACT a provider implements toward core, not the CRD apiVersion string (per D-028); rolling back to a v1beta1 core would mean pinning an EOL CAPI for a Roosevelt rehearsal -- the wrong direction. + +**Fix:** pin a driver build carrying the per-kind `[capi_helm] api_resources` override and set it so the health lookups use the served versions. As of 2026-06-09, D-042 recorded this capability as UNRELEASED (development series only; released line then 1.1.0/1.2.0/1.3.0), with the interim = a current-series commit for the testcloud and a released-tag pin deferred to Roosevelt. + +**Subsequent update (driver-fix work):** the released `magnum-capi-helm==1.4.0` was then confirmed to ship the `api_resources` feature, so the released-tag pin is now available -- v1 pins 1.4.0 with an explicit `api_resources` and targets `health_status = HEALTHY` (installed in phase-07; as-built in appendix-B). This replaces D-042's interim dev-commit path. + +**Operational caveat (while any health false-negative persists):** do NOT wire magnum auto-healing to `health_status` -- a persistent false UNHEALTHY could misfire; CAPI MachineHealthCheck handles node healing independently. + +**Status:** Adopted 2026-06-09; fix landed via the 1.4.0 pin. **Amends:** D-034. **Related:** D-028 (the contract axis made concrete), D-031, D-035. --- @@ -438,4 +564,12 @@ | 2026-05-22 | D-015 v1/v2 fork added; D-004 and D-004a marked v2-scope; D-016 IPv4 tenant pool hybrid model added; D-014 updated with new repo name | v1/v2 fork session | | 2026-05-22 | D-017 CAPI bootstrap full-rebuild lifecycle added; D-018 MAAS-release-direct teardown added. D-013 marked superseded by D-018. D-007 Layer B updated to reference D-017 and `runbooks/04a-capi-bootstrap-cluster.md`. | Teardown planning + handoff session | | 2026-05-22 | D-002 hacluster row added (channel `2.4/stable`) per Canonical Charm Delivery table, verified against Charmhub. D-007 Layer B driver pin updated: `stackhpc/magnum-capi-helm` v0.13.0 → `openstack/magnum-capi-helm` 1.1.0 (PyPI; stackhpc fork archived Dec 2024). | Caracal channel verification + driver pin correction | -| 2026-05-27 | D-019 added (DNS scope reduction; Designate deferred to v2). D-008 marked superseded by D-019. D-011 amended to remove "Designate resolves" criterion. | Testcloud topology investigation + v1 scope refinement | +| 2026-05-22 | D-007 Layer B kubeconfig target corrected: bootstrap k3s → workload cluster (post-pivot per workstream 3b mandatory `clusterctl move`). CAPI mgmt plane paragraph updated accordingly. | Workstream 3 cleanup (post-pivot semantics) | +| 2026-05-29 | D-019 (Designate deferral) and D-020 (dual provider+metal API VIPs) recorded as already-taken; folded into this doc in the 2026-06-09 consolidation. | Deploy execution / handoff | +| 2026-05-30 | D-021 Octavia amphora pipeline (charm-native retrofit over internal endpoints; gss parked) added. | Octavia enablement | +| 2026-06-05 | D-028 (defer v1beta2-contract cutover) and D-029 (defer Keystone SSO) added. | CAPI path research | +| 2026-06-06 | D-030 (mgmt-cluster placement: in-cloud) and D-031 (Magnum + magnum-capi-helm + azimuth kubeadm engine) added. | Magnum/CAPI surface decisions | +| 2026-06-07 | D-033 (mgmt cluster: out-of-cloud k8s-charm on MAAS) added; supersedes D-030 and D-032. | Mgmt-cluster shape | +| 2026-06-08 | D-034 (CAPI constellation pinned to dependencies.json; supersedes D-022), D-035 (in-cloud single-homed mgmt VM; supersedes D-033), D-036 (driver/chart/CAPO coherence resolved), D-037 ([capi_helm] via /etc/default DAEMON_ARGS) added. | In-cloud mgmt pivot | +| 2026-06-09 | D-040 (reserved-host-memory 8192), D-041 (non-HA manual-start policy), D-042 (driver<->core contract coherence; 1.4.0 pin) added. | OOM incident + driver fix | +| 2026-06-09 | D-019..D-042 consolidated into this document (15 decisions). Existing D-001..D-018 left intact (em-dash style preserved); the new entries are ASCII. | Repo sanitation / doc refresh | diff --git a/fix-bundle-add-memcached.py b/fix-bundle-add-memcached.py deleted file mode 100644 index da7bebb..0000000 --- a/fix-bundle-add-memcached.py +++ /dev/null @@ -1,119 +0,0 @@ -#!/usr/bin/env python3 -""" -fix-bundle-add-memcached.py (BUNDLEFIX-004, part 2) - -Adds the `memcached` application AND the -`nova-cloud-controller:memcache <-> memcached:cache` relation to the Caracal -bundle, matching the live `juju deploy memcached` + `juju integrate` already -applied to the running model. - -Why: nova-cloud-controller treats `memcache` as a required relation. The Caracal -rebuild omitted memcached entirely, so a fresh `juju deploy` of the bundle would -leave nova-cc blocked on "Missing relations: memcache" (no instance scheduling). - -App block added (placement to: [lxd:8] = openstack0, where it landed live; metal -space; latest/stable, the only stable channel for the memcached charm): - - memcached: - charm: memcached - channel: latest/stable - num_units: 1 - to: [lxd:8] - bindings: *internal-bindings - constraints: arch=amd64 - -Relation added: - - [nova-cloud-controller:memcache, memcached:cache] - -Safe by construction: line edits (preserve anchors/comments/formatting), -timestamped .bak, unified diff, idempotent, yaml.safe_load verification. - -Usage: python3 fix-bundle-add-memcached.py [path/to/bundle.yaml] (default ./bundle.yaml) -""" -import sys, os, difflib, datetime - -DEFAULT = "bundle.yaml" - -APP_BLOCK = [ - "", - " # memcached: nova-cloud-controller token/cell caching (BUNDLEFIX-004)", - " memcached:", - " charm: memcached", - " channel: latest/stable", - " num_units: 1", - " to: [lxd:8]", - " bindings: *internal-bindings", - " constraints: arch=amd64", - "", -] -RELATION_LINE = " - [nova-cloud-controller:memcache, memcached:cache]" - - -def main(): - path = sys.argv[1] if len(sys.argv) > 1 else DEFAULT - if not os.path.isfile(path): - print(f"[ABORT] not found: {path}") - return 2 - - original = open(path, encoding="utf-8").read() - lines = original.splitlines() - - have_app = any(l.strip().startswith("memcached:") for l in lines) - have_rel = "memcached:cache" in original - if have_app and have_rel: - print("[OK/IDEMPOTENT] memcached app and relation already present; no change.") - return 0 - if have_app != have_rel: - print(f"[ABORT] partial state (app={have_app}, relation={have_rel}); fix by hand to avoid duplication.") - return 3 - - # Bundle order here is description -> variables -> machines -> applications -> relations, - # so `relations:` is the END of the applications section. Anchor BOTH inserts to it: - # the app block goes immediately before `relations:` (last app), the relation immediately after. - rel_idx = next((i for i, l in enumerate(lines) if l.rstrip() == "relations:"), None) - if rel_idx is None: - print("[ABORT] could not find top-level 'relations:' key.") - return 4 - - out = [] - for i, l in enumerate(lines): - if i == rel_idx: - out.extend(APP_BLOCK) # app block: end of applications (just before relations:) - out.append(l) - if i == rel_idx: - out.append(RELATION_LINE) # relation: first entry after relations: - new = "\n".join(out) + ("\n" if original.endswith("\n") else "") - - print("=== unified diff ===") - print("\n".join(difflib.unified_diff( - original.splitlines(), new.splitlines(), - fromfile=f"{path} (orig)", tofile=f"{path} (new)", lineterm=""))) - - try: - import yaml - d = yaml.safe_load(new) - a = d["applications"] - rels = d.get("relations", []) - assert "memcached" in a, "memcached app missing after edit" - assert a["memcached"].get("charm") == "memcached", "charm != memcached" - assert a["memcached"].get("bindings") == {"": "metal"}, f"bindings={a['memcached'].get('bindings')}" - mc = [r for r in rels if any("memcache" in str(x) for x in r)] - assert mc, "memcache relation missing after edit" - print(f"[VERIFY] OK: memcached app present, bindings {{'': 'metal'}}, relation {mc}") - print(f"[VERIFY] totals now: apps={len(a)} relations={len(rels)}") - except ImportError: - print("[WARN] PyYAML missing; skipped semantic verify (re-verify on jumphost after pull).") - except Exception as e: - print(f"[ABORT] verification failed: {e}") - return 5 - - ts = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") - bak = f"{path}.bak-{ts}" - open(bak, "w", encoding="utf-8").write(original) - open(path, "w", encoding="utf-8").write(new) - print(f"[WROTE] {path} (backup: {bak})") - return 0 - - -if __name__ == "__main__": - sys.exit(main()) diff --git a/fix-bundle-haclusters.py b/fix-bundle-haclusters.py deleted file mode 100644 index e3d70e5..0000000 --- a/fix-bundle-haclusters.py +++ /dev/null @@ -1,99 +0,0 @@ -#!/usr/bin/env python3 -""" -fix-bundle-haclusters.py - BUNDLEFIX-003 - -Add `options: { cluster_count: 1 }` to the 10 *active* testcloud haclusters so -the committed bundle matches the running model (we already set this at runtime -via `juju config`). Single-unit principals on the testcloud cannot form the -default 3-peer cluster; cluster_count=1 lets a 1-node cluster form and bring up -the (reachable, public->provider) VIP. Roosevelt's separate 3-unit bundle keeps -the default. - -Text/line based - never round-trips YAML, so anchors/comments/formatting are -preserved. Only touches the named, *uncommented* hacluster lines; the commented -v2-deferred ones (vault-hacluster, ceph-radosgw-hacluster, designate-hacluster) -are left untouched. Idempotent: skips a line that already has cluster_count, and -aborts cleanly if nothing needs changing. - -Usage: python3 fix-bundle-haclusters.py [path-to-bundle.yaml] (default ./bundle.yaml) -""" -import sys, os, re, shutil, difflib, datetime - -PATH = sys.argv[1] if len(sys.argv) > 1 else "bundle.yaml" -HACLUSTERS = ["keystone", "glance", "nova-cloud-controller", "neutron-api", - "cinder", "octavia", "barbican", "magnum", "placement", - "openstack-dashboard"] -INSERT_AFTER = "channel: 2.4/stable }" -INSERT_WITH = "channel: 2.4/stable, options: { cluster_count: 1 } }" - - -def abort(msg): - sys.stderr.write("ABORT (no changes written): %s\n" % msg) - sys.exit(1) - - -if not os.path.isfile(PATH): - abort("file not found: %s (run from the repo root, or pass the path)" % PATH) - -with open(PATH, "r", newline="") as fh: - orig = fh.readlines() -lines = list(orig) - -changed = [] -for name in HACLUSTERS: - # uncommented inline def line for this hacluster - pat = re.compile(r'^\s*%s-hacluster:\s*\{\s*charm:\s*hacluster' % re.escape(name)) - hits = [i for i, ln in enumerate(lines) - if pat.match(ln) and not ln.lstrip().startswith("#")] - if len(hits) != 1: - abort("expected exactly 1 uncommented '%s-hacluster' inline def, found %d" - % (name, len(hits))) - i = hits[0] - if "cluster_count" in lines[i]: - abort("%s-hacluster already has cluster_count - already applied? inspect." - % name) - if INSERT_AFTER not in lines[i]: - abort("%s-hacluster line not in expected inline shape: %r" - % (name, lines[i].strip())) - lines[i] = lines[i].replace(INSERT_AFTER, INSERT_WITH, 1) - changed.append(name) - -if len(changed) != len(HACLUSTERS): - abort("only changed %d of %d haclusters" % (len(changed), len(HACLUSTERS))) - -ts = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") -bak = "%s.bak-%s" % (PATH, ts) -shutil.copy2(PATH, bak) -with open(PATH, "w", newline="") as fh: - fh.writelines(lines) - -print("Backup written: %s" % bak) -print("=== unified diff ===") -sys.stdout.writelines(difflib.unified_diff( - orig, lines, fromfile="bundle.yaml (before)", tofile="bundle.yaml (after)")) -print("") - -try: - import yaml -except Exception: - print("NOTE: PyYAML not importable - semantic verification skipped; re-verify on jumphost.") - sys.exit(0) - -apps = yaml.safe_load(open(PATH))["applications"] -print("=== verification ===") -print("YAML parses: PASS") -ok = True -for name in HACLUSTERS: - a = apps.get("%s-hacluster" % name, {}) - cc = (a.get("options") or {}).get("cluster_count") - p = (cc == 1) - ok &= p - print(" %-30s cluster_count==1 : %s" % (name + "-hacluster", "PASS" if p else "FAIL (%r)" % cc)) -# deferred ones must NOT have appeared -for absent in ("vault-hacluster", "ceph-radosgw-hacluster", "designate-hacluster"): - p = absent not in apps - ok &= p - print(" %-30s stays absent : %s" % (absent, "PASS" if p else "FAIL")) -print("\nRESULT:", "ALL CHECKS PASS" - if ok else "FAILURES - revert: cp %s %s" % (bak, PATH)) -sys.exit(0 if ok else 2) diff --git a/fix-bundle-metal-vips.py b/fix-bundle-metal-vips.py deleted file mode 100644 index 0ac5049..0000000 --- a/fix-bundle-metal-vips.py +++ /dev/null @@ -1,125 +0,0 @@ -#!/usr/bin/env python3 -""" -BUNDLEFIX-006 (D-020): append the metal HA VIP to each clustered API charm's `vip` option. - -For every line of the form `vip: 10.12.4.` where N is in the reserved provider API-VIP -range (224..254), rewrite it to `vip: "10.12.4. 10.12.8."` so the charm advertises a -provider VIP (public endpoint) AND a metal VIP (internal/admin endpoints). This is the -spaces-native dual-VIP fix validated live on placement: internal/admin bindings = metal, so -resolve_address matches the metal VIP; public binding = provider, matches the provider VIP. -No binding/anchor change and no os-*-network needed. - -Safety properties (same pattern as the prior fix scripts): - - pure line edit; never round-trips YAML, so anchors/aliases/comments are preserved - - STRICT match: only single `10.12.4.<224-254>` values are rewritten; anything else (already - dual, out of range, unexpected format) is left untouched -> fail-safe, never mangles - - idempotent: lines already carrying a `10.12.4.x 10.12.8.x` pair are skipped - - timestamped .bak, unified diff to stdout, and a best-effort yaml.safe_load semantic check - (skipped where PyYAML is absent, e.g. the Windows workstation; the jumphost re-verifies) -""" -import sys -import re -import datetime -import shutil -import difflib - -PROVIDER_NET = "10.12.4." -METAL_NET = "10.12.8." -VIP_LO, VIP_HI = 224, 254 # reserved API-VIP range (same last-octet on both nets) - -VIP_LINE = re.compile(r'^(?P\s*)vip:\s*(?P["\']?)(?P[^"\'\n]*)(?P=q)\s*$') -SINGLE = re.compile(r'^10\.12\.4\.(\d+)$') -DOUBLE = re.compile(r'^10\.12\.4\.(\d+)\s+10\.12\.8\.(\d+)$') - - -def main(): - if len(sys.argv) != 2: - print("usage: fix-bundle-metal-vips.py ") - return 2 - path = sys.argv[1] - try: - with open(path) as f: - original = f.read() - except OSError as e: - print(f"[ABORT] cannot read {path}: {e}") - return 3 - - lines = original.split("\n") - changed = 0 - skipped_already = 0 - untouched_unexpected = [] - - out = [] - for l in lines: - m = VIP_LINE.match(l) - if m: - val = m.group("val").strip() - if DOUBLE.match(val): - skipped_already += 1 - out.append(l) - continue - sm = SINGLE.match(val) - if sm: - octet = int(sm.group(1)) - if VIP_LO <= octet <= VIP_HI: - out.append(f'{m.group("indent")}vip: "{PROVIDER_NET}{octet} {METAL_NET}{octet}"') - changed += 1 - continue - # vip line, but not a single in-range provider VIP -> leave alone, but note it - untouched_unexpected.append(val) - out.append(l) - - if untouched_unexpected: - print(f"[NOTE] {len(untouched_unexpected)} vip line(s) left untouched (unexpected value/range): " - f"{untouched_unexpected}") - - if changed == 0: - if skipped_already: - print(f"[OK/IDEMPOTENT] {skipped_already} vip line(s) already carry a metal VIP; no change.") - return 0 - print("[ABORT] found no `vip: 10.12.4.224-254` lines to update.") - return 4 - - new = "\n".join(out) - if original.endswith("\n") and not new.endswith("\n"): - new += "\n" - - print("=== unified diff ===") - sys.stdout.writelines(difflib.unified_diff( - original.splitlines(keepends=True), - new.splitlines(keepends=True), - fromfile=f"{path} (orig)", tofile=f"{path} (new)")) - - try: - import yaml - d = yaml.safe_load(new) - apps = d.get("applications", {}) or {} - dual = sorted( - a for a, c in apps.items() - if isinstance(c, dict) and isinstance(c.get("options"), dict) - and isinstance(c["options"].get("vip"), str) - and len(c["options"]["vip"].split()) == 2 - ) - print(f"\n[VERIFY] yaml parses OK; {len(dual)} charm(s) now have a 2-address vip:") - for a in dual: - print(f" {a}: {apps[a]['options']['vip']}") - except ImportError: - print("\n[VERIFY] PyYAML not present (Windows workstation) - semantic check skipped; " - "jumphost will re-verify after pull.") - except Exception as e: - print(f"\n[ABORT] yaml verify failed, not writing: {e}") - return 5 - - ts = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") - bak = f"{path}.bak-{ts}" - shutil.copy2(path, bak) - with open(path, "w") as f: - f.write(new) - print(f"\n[WROTE] {path} (backup: {bak})") - print(f"[SUMMARY] updated {changed} vip line(s); {skipped_already} already dual; " - f"{len(untouched_unexpected)} untouched.") - return 0 - - -if __name__ == "__main__": - sys.exit(main()) diff --git a/fix-bundle-router-bindings.py b/fix-bundle-router-bindings.py deleted file mode 100644 index 8fc8f0b..0000000 --- a/fix-bundle-router-bindings.py +++ /dev/null @@ -1,111 +0,0 @@ -#!/usr/bin/env python3 -""" -fix-bundle-router-bindings.py (BUNDLEFIX-005, part 2 / option A) - -Adds `bindings: *internal-bindings` to every mysql-router application block in the -Caracal bundle, so the router subordinates bind to the metal space -- matching the -live `juju bind metal` fix already applied to the running model. - -Why: without an explicit binding the routers default to the empty 'alpha' space, -which resolves to the container's PROVIDER address. The cluster then grants -mysqlrouteruser@, but the router's actual TCP connection to the -metal-only cluster egresses the metal interface -> grant host != source -> -"Access denied 1045" -> mysqlrouter never bootstraps. Binding to metal makes the -advertised address == the connection source. - -Safe by construction: - - pure line edits (NO YAML round-trip; preserves anchors, comments, formatting) - - timestamped .bak - - prints a unified diff - - idempotent (skips any router that already carries a bindings line) - - yaml.safe_load verification of the result, asserting every mysql-router app - resolves to bindings {'': 'metal'} via the *internal-bindings anchor - - aborts unless it finds the expected mysql-router blocks and they verify - -Usage: - python3 fix-bundle-router-bindings.py [path/to/bundle.yaml] (default ./bundle.yaml) -""" -import sys, os, difflib, datetime - -DEFAULT = "bundle.yaml" - - -def transform(lines): - """Insert `bindings: *internal-bindings` after the channel line of - every `charm: mysql-router` app block that doesn't already have a bindings line.""" - out = [] - prev_is_mr_charm = False - found = inserted = skipped = 0 - for idx, line in enumerate(lines): - out.append(line) - stripped = line.strip() - if prev_is_mr_charm and stripped.startswith("channel:"): - found += 1 - nxt = lines[idx + 1].strip() if idx + 1 < len(lines) else "" - if nxt.startswith("bindings:"): - skipped += 1 - else: - indent = line[: len(line) - len(line.lstrip())] - out.append(f"{indent}bindings: *internal-bindings") - inserted += 1 - prev_is_mr_charm = (stripped == "charm: mysql-router") - return out, found, inserted, skipped - - -def main(): - path = sys.argv[1] if len(sys.argv) > 1 else DEFAULT - if not os.path.isfile(path): - print(f"[ABORT] not found: {path}") - return 2 - - with open(path, "r", encoding="utf-8") as f: - original = f.read() - lines = original.splitlines() - - out, found, inserted, skipped = transform(lines) - new = "\n".join(out) + ("\n" if original.endswith("\n") else "") - - if found == 0: - print("[ABORT] no `charm: mysql-router` + `channel:` blocks found - unexpected structure.") - return 3 - if inserted == 0 and skipped == found: - print(f"[OK/IDEMPOTENT] all {found} mysql-router apps already bound; no change.") - return 0 - - print("=== unified diff ===") - diff = "\n".join(difflib.unified_diff( - original.splitlines(), new.splitlines(), - fromfile=f"{path} (orig)", tofile=f"{path} (new)", lineterm="")) - print(diff or "(no diff)") - print(f"=== mysql-router blocks: {found} | inserted: {inserted} | already-bound: {skipped} ===") - - # semantic verification (anchors resolve under safe_load) - try: - import yaml - doc = yaml.safe_load(new) - apps = (doc or {}).get("applications", {}) or {} - mr = {k: v for k, v in apps.items() - if isinstance(v, dict) and v.get("charm") == "mysql-router"} - bad = {k: v.get("bindings") for k, v in mr.items() if v.get("bindings") != {"": "metal"}} - if bad: - print(f"[ABORT] verification failed; not bound to {{'': 'metal'}}: {bad}") - return 4 - print(f"[VERIFY] yaml.safe_load OK; all {len(mr)} mysql-router apps -> bindings {{'': 'metal'}}.") - except ImportError: - print("[WARN] PyYAML missing; skipped semantic verify (re-verify on jumphost after pull).") - except Exception as e: - print(f"[ABORT] yaml verification error: {e}") - return 5 - - ts = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") - bak = f"{path}.bak-{ts}" - with open(bak, "w", encoding="utf-8") as f: - f.write(original) - with open(path, "w", encoding="utf-8") as f: - f.write(new) - print(f"[WROTE] {path} (backup: {bak})") - return 0 - - -if __name__ == "__main__": - sys.exit(main()) diff --git a/fix-bundle-v1.py b/fix-bundle-v1.py deleted file mode 100644 index 7c3200b..0000000 --- a/fix-bundle-v1.py +++ /dev/null @@ -1,187 +0,0 @@ -#!/usr/bin/env python3 -""" -fix-bundle-v1.py - Option-A bundle fix for the Caracal v1 deploy. - -SUPERSEDES fix-api-bindings.py (which did only edits 1-2). Run this against the -*original* bundle.yaml (i.e. after `git restore bundle.yaml` if the earlier -script was already applied). - -All edits are text/line based - YAML is never round-tripped, so anchors, -aliases, comments and formatting are preserved. Verification uses safe_load -(read only). Aborts WITHOUT writing if the bundle is not in the expected -pre-fix shape. - -Edits: - 1. Shrink &api-bindings to the two keys that carry meaning: - "": metal - public: provider - (drops admin/internal/shared-db/amqp/certificates/cluster/ha - all 'metal', - i.e. the "" default - which were causing 'unknown endpoint' deploy errors - on keystone / ceph-radosgw / openstack-dashboard.) - 2. vault bindings: *api-bindings -> *internal-bindings (vault is internal-only). - 3. Remove vault's options: block (vip + os-public-hostname) - vault is a - single unit on the testcloud (3 at Roosevelt); a provider VIP is both - unreachable from metal-bound vault and pointless at one unit. - 4. Comment out the vault-hacluster subordinate. - 5. Comment out the [vault:ha, vault-hacluster:ha] relation. - -Net effect on counts: VIPs 11->10, apps 51->50, relations 98->97. -Vault HA is restored at Roosevelt where it is genuinely 3-unit with a real -(metal) VIP from NetBox. - -Usage: python3 fix-bundle-v1.py [path-to-bundle.yaml] (default ./bundle.yaml) -""" -import sys, os, re, shutil, difflib, datetime - -PATH = sys.argv[1] if len(sys.argv) > 1 else "bundle.yaml" -DROP = {"admin", "internal", "shared-db", "amqp", "certificates", "cluster", "ha"} -KEEP = {'""', "public"} -VAULT_OPTS_EXPECTED = {"vip", "os-public-hostname"} - - -def abort(msg): - sys.stderr.write("ABORT (no changes written): %s\n" % msg) - sys.exit(1) - - -def indent_of(line): - return len(line) - len(line.lstrip()) - - -if not os.path.isfile(PATH): - abort("file not found: %s (run from the repo root, or pass the path)" % PATH) - -with open(PATH, "r", newline="") as fh: - orig = fh.readlines() -lines = list(orig) - -# ---------- Edit 1: shrink &api-bindings ---------- -anchor = next((i for i, ln in enumerate(lines) - if re.match(r'^api-bindings:\s*&api-bindings$', ln.strip())), None) -if anchor is None: - abort("could not locate 'api-bindings: &api-bindings'") -a_indent = indent_of(lines[anchor]) -j, kept, dropped = anchor + 1, [], [] -while j < len(lines): - raw = lines[j] - if raw.strip() == "" or indent_of(raw) <= a_indent: - break - key = raw.strip().split(":", 1)[0].strip() - (dropped if key in DROP else kept).append(key if key in DROP else raw) - j += 1 -kept_keys = {l.strip().split(":", 1)[0].strip() for l in kept} -if kept_keys != KEEP or set(dropped) != DROP: - abort("api-bindings not in expected pre-fix shape (kept=%s dropped=%s)" - % (sorted(kept_keys), sorted(dropped))) -lines = lines[:anchor + 1] + kept + lines[j:] - -# ---------- locate vault app block (post edit-1 indices) ---------- -vault = next((i for i, ln in enumerate(lines) - if ln.strip() == "vault:" and indent_of(ln) == 2), None) -if vault is None: - abort("could not locate ' vault:' application block") -# block end = next line at indent <= 2 (non-blank) -vend = vault + 1 -while vend < len(lines): - if lines[vend].strip() and indent_of(lines[vend]) <= 2: - break - vend += 1 - -# ---------- Edit 2: vault bindings -> internal-bindings ---------- -b_fixed = False -for k in range(vault + 1, vend): - if re.match(r'^\s*bindings:\s*\*api-bindings\b', lines[k]): - lines[k] = lines[k].replace("*api-bindings", "*internal-bindings", 1) - b_fixed = True - break -if not b_fixed: - abort("vault 'bindings: *api-bindings' not found (already changed?)") - -# ---------- Edit 3: remove vault options: block ---------- -opt = next((k for k in range(vault + 1, vend) - if re.match(r'^\s{4}options:\s*$', lines[k])), None) -if opt is None: - abort("vault 'options:' line not found") -opt_indent = indent_of(lines[opt]) -c = opt + 1 -opt_children = [] -while c < vend and lines[c].strip() and indent_of(lines[c]) > opt_indent: - opt_children.append(lines[c].strip().split(":", 1)[0].strip()) - c += 1 -if set(opt_children) != VAULT_OPTS_EXPECTED: - abort("vault options are %s, expected %s - inspect by hand (won't blind-delete)" - % (sorted(opt_children), sorted(VAULT_OPTS_EXPECTED))) -del lines[opt:c] # remove 'options:' + its children - -# ---------- Edit 4: comment out the vault-hacluster subordinate ---------- -hac = [i for i, ln in enumerate(lines) - if re.match(r'^\s*vault-hacluster:', ln) and not ln.lstrip().startswith("#")] -if len(hac) != 1: - abort("expected exactly 1 uncommented 'vault-hacluster:' line, found %d" % len(hac)) -i = hac[0] -ind = indent_of(lines[i]) -lines[i] = lines[i][:ind] + "# " + lines[i][ind:] - -# ---------- Edit 5: comment out the vault:ha relation ---------- -rel = [i for i, ln in enumerate(lines) - if ("vault:ha" in ln and "vault-hacluster:ha" in ln - and not ln.lstrip().startswith("#"))] -if len(rel) != 1: - abort("expected exactly 1 uncommented vault:ha relation, found %d" % len(rel)) -i = rel[0] -ind = indent_of(lines[i]) -lines[i] = lines[i][:ind] + "# " + lines[i][ind:] - -# ---------- Backup + write ---------- -ts = datetime.datetime.now().strftime("%Y%m%d-%H%M%S") -bak = "%s.bak-%s" % (PATH, ts) -shutil.copy2(PATH, bak) -with open(PATH, "w", newline="") as fh: - fh.writelines(lines) - -print("Backup written: %s" % bak) -print("=== unified diff ===") -sys.stdout.writelines(difflib.unified_diff( - orig, lines, fromfile="bundle.yaml (before)", tofile="bundle.yaml (after)")) -print("") - -# ---------- Verify ---------- -try: - import yaml -except Exception: - print("NOTE: PyYAML not importable - semantic verification skipped; re-verify on jumphost.") - sys.exit(0) - -doc = yaml.safe_load(open(PATH)) -apps = doc["applications"] -rels = doc.get("relations") or [] -MIN = {"": "metal", "public": "provider"} -INT = {"": "metal"} -on_min = ["keystone", "ceph-radosgw", "openstack-dashboard", "octavia", "glance", - "nova-cloud-controller", "placement", "neutron-api", "cinder", - "barbican", "magnum"] -print("=== verification ===") -print("YAML parses: PASS") -ok = True -for a in on_min: - p = apps.get(a, {}).get("bindings") == MIN - ok &= p - print(" %-22s minimal api-bindings : %s" % (a, "PASS" if p else "FAIL")) -checks = [ - ("vault bindings == internal-bindings", apps.get("vault", {}).get("bindings") == INT), - ("vault has no options block", apps.get("vault", {}).get("options") in (None, {})), - ("vault-hacluster removed from apps", "vault-hacluster" not in apps), - ("vault:ha relation removed", not any("vault:ha" in pair for pair in rels)), -] -for desc, p in checks: - ok &= p - print(" %-34s : %s" % (desc, "PASS" if p else "FAIL")) -nvip = sum(1 for ap in apps.values() - if isinstance(ap, dict) and isinstance(ap.get("options"), dict) - and str(ap["options"].get("vip", "")).startswith("10.12.4.")) -pv = (nvip == 10) -ok &= pv -print(" %-34s : %s" % ("VIP count == 10 (was 11)", "PASS (%d)" % nvip if pv else "FAIL (%d)" % nvip)) -print("\nRESULT:", "ALL CHECKS PASS" - if ok else "FAILURES - revert: cp %s %s" % (bak, PATH)) -sys.exit(0 if ok else 2) diff --git a/overlays/vr0-dc0-testcloud.yaml b/overlays/vr0-dc0-testcloud.yaml deleted file mode 100644 index f7c8c79..0000000 --- a/overlays/vr0-dc0-testcloud.yaml +++ /dev/null @@ -1,20 +0,0 @@ -# Testcloud overlay for VR0 DC0 Omega Cloud -# -# STATUS: PLACEHOLDER — drafted alongside bundle.yaml. -# -# This overlay pins values specific to the 4-VM KVM testcloud at jumphost -# vopenstack-jesse. Roosevelt bare-metal would use a different overlay -# (overlays/roosevelt-prod.yaml — not in this repo) that swaps num_units to 3+, -# adjusts machine constraints to MAAS tags, and removes any KVM-specific -# config tuned for libvirt bridges. -# -# Per D-009, hacluster relations remain in the main bundle.yaml even though -# num_units=1 on testcloud. The overlay only changes num_units, not the -# relation graph. -# -# TODO during bundle drafting: -# - [ ] num_units=1 overrides per API charm -# - [ ] machine constraints (system-id pinning for openstack0-3) -# - [ ] bridge-interface-mappings for libvirt virbr1 (provider) -# - [ ] storage-backend config for cinder/glance pointing at Ceph -# - [ ] Octavia lb-mgmt-* network values (per LBaaS Management VLAN/prefix) diff --git a/review-bundle.py b/review-bundle.py deleted file mode 100644 index 665b1fa..0000000 --- a/review-bundle.py +++ /dev/null @@ -1,546 +0,0 @@ -#!/usr/bin/env python3 -""" -review-bundle.py -- comprehensive pre-deploy review of the Charmed OpenStack -Caracal 2024.1 IPv4-only bundle (VR0 / DC0 / Omega test cloud). - -READ-ONLY. Encodes every lesson learned from the 2026-05-28/29/30 deploy -sessions as a fail-closed check. Superset of audit-bundle-fixes.py. - -Severities: - FAIL deploy-blocker or known regression -> exit 1 - WARN review item / possible issue -> exit 1 only under --strict - INFO informational summary -> never affects exit - -Dependencies: PyYAML only (already used by the existing fix scripts); rest stdlib. -ASCII-only output by design (non-ASCII has caused silent daemon failures here). - -Usage: - python3 review-bundle.py [BUNDLE] [--strict] [--quiet] - BUNDLE path to bundle.yaml (default: ./bundle.yaml) - --strict treat WARN as failing for exit code - --quiet suppress PASS/INFO lines (show only WARN/FAIL) -""" - -import sys -import argparse -import ipaddress - -try: - import yaml -except ImportError: - sys.stderr.write("ERROR: PyYAML not installed (pip install pyyaml --break-system-packages)\n") - sys.exit(2) - -# --------------------------------------------------------------------------- # -# Config -- the known-good baseline. Adjust here if the design changes. -# --------------------------------------------------------------------------- # -EXPECTED_APPS = 51 -EXPECTED_RELATIONS = 98 - -PROVIDER_NET = ipaddress.ip_network("10.12.4.0/22") -METAL_NET = ipaddress.ip_network("10.12.8.0/22") -VIP_OCTET_MIN = 224 # MAAS reserved metal VIP range 10.12.8.224-254 (D-020) -VIP_OCTET_MAX = 254 - -# BUNDLEFIX-001: the 7 per-endpoint binding keys that were phantom and removed. -# Final anchors are {"":metal} and {"":metal, public:provider} -> none of these -# should reappear in any app's effective bindings. -PHANTOM_BINDING_KEYS = { - "admin", "internal", "shared-db", "amqp", "certificates", "cluster", "ha", -} - -# D-020 clustered-API charm -> provider VIP last octet (metal mirrors it). -EXPECTED_CLUSTERED = { - "barbican": 224, "cinder": 226, "glance": 228, "keystone": 229, - "magnum": 230, "neutron-api": 231, "nova-cloud-controller": 232, - "octavia": 233, "openstack-dashboard": 234, "placement": 235, -} - -# Verified Caracal channel matrix (from prior charmhub verification). -# WARN-only: channels can be intentionally pinned; flag deviation, do not block. -OPENSTACK_CORE_CHANNEL = "2024.1/stable" -OPENSTACK_CORE_CHARMS = { - "keystone", "glance", "cinder", "cinder-ceph", "nova-cloud-controller", - "nova-compute", "neutron-api", "neutron-api-plugin-ovn", "placement", - "octavia", "barbican", "magnum", "magnum-dashboard", "openstack-dashboard", -} -CHANNEL_MATRIX = { - "ovn-central": "24.03/stable", "ovn-chassis": "24.03/stable", - "ceph-mon": "squid/stable", "ceph-osd": "squid/stable", - "ceph-fs": "squid/stable", "ceph-radosgw": "squid/stable", - "mysql-innodb-cluster": "8.0/stable", "mysql-router": "8.0/stable", - "rabbitmq-server": "3.9/stable", "vault": "1.8/stable", -} -EXPECTED_BASE = "ubuntu@22.04" # jammy; Caracal-bundle paradigm (not noble) - -MAC_RE = None # compiled below -import re -MAC_RE = re.compile(r"([0-9a-fA-F]{2}:){5}[0-9a-fA-F]{2}") - -# --------------------------------------------------------------------------- # -# Duplicate-key-detecting YAML loader (PyYAML silently keeps the last dup). -# --------------------------------------------------------------------------- # -_DUP_KEYS = [] - - -class DupKeyLoader(yaml.SafeLoader): - def construct_mapping(self, node, deep=False): - seen = set() - for key_node, _ in node.value: - try: - key = self.construct_object(key_node, deep=deep) - except Exception: - continue - if isinstance(key, (str, int, float, bool)) or key is None: - if key in seen: - _DUP_KEYS.append((str(key), key_node.start_mark.line + 1)) - seen.add(key) - return super().construct_mapping(node, deep) - - -# --------------------------------------------------------------------------- # -# Reporter -# --------------------------------------------------------------------------- # -class Reporter: - def __init__(self, quiet=False): - self.quiet = quiet - self.rows = [] # (section, level, code, msg) - self.counts = {"PASS": 0, "WARN": 0, "FAIL": 0, "INFO": 0} - - def add(self, section, level, code, msg): - self.rows.append((section, level, code, msg)) - self.counts[level] = self.counts.get(level, 0) + 1 - - def emit(self): - section = None - for sec, level, code, msg in self.rows: - if self.quiet and level in ("PASS", "INFO"): - continue - if sec != section: - print("\n--- %s ---" % sec) - section = sec - print(" [%-4s] %-10s %s" % (level, code, msg)) - print("\n==================== SUMMARY ====================") - print(" PASS=%d WARN=%d FAIL=%d INFO=%d" - % (self.counts["PASS"], self.counts["WARN"], - self.counts["FAIL"], self.counts["INFO"])) - - -# --------------------------------------------------------------------------- # -# Helpers -# --------------------------------------------------------------------------- # -def ep_app(endpoint): - """'keystone:shared-db' -> 'keystone'. Non-str -> None.""" - if not isinstance(endpoint, str): - return None - return endpoint.split(":", 1)[0] - - -def in_net(addr, net): - try: - return ipaddress.ip_address(addr) in net - except ValueError: - return False - - -# --------------------------------------------------------------------------- # -# Checks -# --------------------------------------------------------------------------- # -def check_ascii(R, text): - sec = "0. Structure / integrity" - bad = [] - for i, line in enumerate(text.splitlines(), 1): - for ch in line: - if ord(ch) > 127: - bad.append((i, repr(ch))) - break - if bad: - for ln, ch in bad[:20]: - R.add(sec, "WARN", "NON-ASCII", - "non-ASCII char %s on line %d (non-ASCII has caused silent daemon failures here)" % (ch, ln)) - if len(bad) > 20: - R.add(sec, "WARN", "NON-ASCII", "...and %d more non-ASCII line(s)" % (len(bad) - 20)) - else: - R.add(sec, "PASS", "ASCII", "file is pure ASCII") - - -def check_structure(R, doc): - sec = "0. Structure / integrity" - if not isinstance(doc, dict): - R.add(sec, "FAIL", "STRUCT-00", "top-level YAML is not a mapping") - return None, None - if _DUP_KEYS: - for k, ln in _DUP_KEYS: - R.add(sec, "FAIL", "DUPKEY", "duplicate key '%s' near line %d" % (k, ln)) - else: - R.add(sec, "PASS", "DUPKEY", "no duplicate keys") - - apps = doc.get("applications") - rels = doc.get("relations") - if not isinstance(apps, dict): - R.add(sec, "FAIL", "STRUCT-APPS", "no 'applications' mapping") - apps = {} - if not isinstance(rels, list): - R.add(sec, "FAIL", "STRUCT-RELS", "no 'relations' list") - rels = [] - - na, nr = len(apps), len(rels) - R.add(sec, "INFO" if na == EXPECTED_APPS else "WARN", "APP-COUNT", - "applications=%d (baseline %d)" % (na, EXPECTED_APPS)) - R.add(sec, "INFO" if nr == EXPECTED_RELATIONS else "WARN", "REL-COUNT", - "relations=%d (baseline %d)" % (nr, EXPECTED_RELATIONS)) - return apps, rels - - -def check_relations(R, apps, rels): - sec = "1. Relation integrity" - bad_shape = miss_colon = dangling = 0 - for r in rels: - if not (isinstance(r, list) and len(r) == 2): - R.add(sec, "FAIL", "REL-SHAPE", "relation not a 2-element list: %r" % (r,)) - bad_shape += 1 - continue - for e in r: - if not isinstance(e, str) or ":" not in e: - R.add(sec, "FAIL", "REL-COLON", "endpoint missing colon: %r in %r" % (e, r)) - miss_colon += 1 - else: - a = ep_app(e) - if a not in apps: - R.add(sec, "FAIL", "REL-DANGLE", - "endpoint references unknown app '%s' in %r" % (a, r)) - dangling += 1 - if not (bad_shape or miss_colon or dangling): - R.add(sec, "PASS", "REL-INT", - "all relations well-formed, colon-explicit, both ends resolve to apps") - - -def check_bindings_phantom(R, apps): - sec = "2. BUNDLEFIX-001 (phantom binding keys)" - hits = 0 - for name, spec in apps.items(): - b = (spec or {}).get("bindings") - if not isinstance(b, dict): - continue - bad = sorted(set(b.keys()) & PHANTOM_BINDING_KEYS) - if bad: - R.add(sec, "FAIL", "PHANTOM", - "%s has phantom per-endpoint binding key(s): %s" % (name, ", ".join(bad))) - hits += 1 - if not hits: - R.add(sec, "PASS", "PHANTOM", - "no app reintroduces a removed phantom binding key (%s)" - % ", ".join(sorted(PHANTOM_BINDING_KEYS))) - - -def check_vault(R, apps, rels): - sec = "3. BUNDLEFIX-002 (vault de-HA)" - v = apps.get("vault") - if v is None: - R.add(sec, "WARN", "VAULT", "no 'vault' app found") - return - opts = (v or {}).get("options") or {} - if "vip" in opts: - R.add(sec, "FAIL", "VAULT-VIP", "vault has a 'vip' option (must be de-HA'd): %r" % opts["vip"]) - else: - R.add(sec, "PASS", "VAULT-VIP", "vault has no vip") - if "os-public-hostname" in opts: - R.add(sec, "WARN", "VAULT-HOST", "vault has os-public-hostname (expected removed)") - if "vault-hacluster" in apps: - R.add(sec, "FAIL", "VAULT-HA", "vault-hacluster application is present (must be removed)") - else: - R.add(sec, "PASS", "VAULT-HA", "no vault-hacluster application") - for r in rels: - if isinstance(r, list) and any(isinstance(e, str) and e.startswith("vault:ha") for e in r): - R.add(sec, "FAIL", "VAULT-HAREL", "vault:ha relation present: %r" % (r,)) - - -def map_hacluster(apps, rels): - """principal -> hacluster_app_name, using charm==hacluster + the :ha relation.""" - hac_apps = {n for n, s in apps.items() if (s or {}).get("charm") == "hacluster"} - principal_of = {} - for r in rels: - if not (isinstance(r, list) and len(r) == 2): - continue - a0, a1 = ep_app(r[0]), ep_app(r[1]) - if a0 in hac_apps and a1 and a1 not in hac_apps: - principal_of[a1] = a0 - elif a1 in hac_apps and a0 and a0 not in hac_apps: - principal_of[a0] = a1 - return hac_apps, principal_of - - -def check_hacluster(R, apps, rels): - sec = "4. BUNDLEFIX-003 (hacluster cluster_count)" - hac_apps, principal_of = map_hacluster(apps, rels) - if not hac_apps: - R.add(sec, "WARN", "HAC", "no hacluster apps found") - return principal_of - principal_for_hac = {h: p for p, h in principal_of.items()} - ok = 0 - for h in sorted(hac_apps): - opts = (apps[h].get("options") or {}) - cc = opts.get("cluster_count") - prin = principal_for_hac.get(h) - nu = (apps.get(prin, {}) or {}).get("num_units") if prin else None - if cc is None: - R.add(sec, "FAIL", "HAC-CC", "%s missing cluster_count" % h) - continue - if not prin: - R.add(sec, "WARN", "HAC-PRIN", "%s has no principal via :ha relation" % h) - if isinstance(nu, int) and cc > nu: - R.add(sec, "FAIL", "HAC-OVER", - "%s cluster_count=%s > principal %s num_units=%s" % (h, cc, prin, nu)) - continue - if cc != 1: - R.add(sec, "WARN", "HAC-NE1", - "%s cluster_count=%s (testcloud baseline is 1)" % (h, cc)) - else: - ok += 1 - if ok: - R.add(sec, "PASS", "HAC", "%d hacluster app(s) cluster_count=1 and <= principal num_units" % ok) - - -def check_memcached(R, apps, rels): - sec = "5. BUNDLEFIX-004 (memcached)" - if "memcached" not in apps: - R.add(sec, "FAIL", "MEMCACHE-APP", "no 'memcached' application") - else: - R.add(sec, "PASS", "MEMCACHE-APP", "memcached application present") - found = False - for r in rels: - if not (isinstance(r, list) and len(r) == 2): - continue - s = set() - for e in r: - if isinstance(e, str): - s.add(e) - if {"nova-cloud-controller:memcache", "memcached:cache"} <= s: - found = True - R.add(sec, "PASS" if found else "FAIL", "MEMCACHE-REL", - "nova-cloud-controller:memcache <-> memcached:cache relation %s" - % ("present" if found else "MISSING")) - - -def check_router_bindings(R, apps): - sec = "6. BUNDLEFIX-005 (mysql-router metal binding)" - routers = [n for n, s in apps.items() if (s or {}).get("charm") == "mysql-router"] - if not routers: - R.add(sec, "WARN", "ROUTER", "no mysql-router apps found") - return - bad = 0 - for n in sorted(routers): - b = (apps[n].get("bindings") or {}) - # effective default space is the "" key; anchors already resolved by yaml - default = b.get("", None) - non_metal = {k: v for k, v in b.items() if v not in ("metal",)} - if default == "metal" and not non_metal: - continue - if default != "metal": - R.add(sec, "FAIL", "ROUTER-BIND", - "%s default space binding is %r (expected metal)" % (n, default)) - bad += 1 - elif non_metal: - R.add(sec, "WARN", "ROUTER-BIND", - "%s has non-metal endpoint binding(s): %r" % (n, non_metal)) - if not bad: - R.add(sec, "PASS", "ROUTER-BIND", - "%d mysql-router app(s) bound to metal" % len(routers)) - - -def check_vips(R, apps, rels): - sec = "7. BUNDLEFIX-006 / D-020 (dual provider+metal VIPs)" - _, principal_of = map_hacluster(apps, rels) - clustered = sorted(principal_of.keys()) - # set comparison vs expected D-020 clustered set - got = set(clustered) - exp = set(EXPECTED_CLUSTERED) - if got != exp: - if exp - got: - R.add(sec, "WARN", "VIP-SET", "expected-clustered apps NOT detected as clustered: %s" - % ", ".join(sorted(exp - got))) - if got - exp: - R.add(sec, "WARN", "VIP-SET", "clustered apps beyond the D-020 set: %s" - % ", ".join(sorted(got - exp))) - ok = 0 - for name in clustered: - opts = (apps[name].get("options") or {}) - vip = opts.get("vip") - if not vip: - R.add(sec, "FAIL", "VIP-MISS", "%s is clustered but has no vip" % name) - continue - parts = str(vip).split() - if len(parts) != 2: - R.add(sec, "FAIL", "VIP-DUAL", "%s vip is not dual (got %r)" % (name, vip)) - continue - prov, metal = parts - if not in_net(prov, PROVIDER_NET): - R.add(sec, "FAIL", "VIP-PROV", "%s provider vip %s not in %s" % (name, prov, PROVIDER_NET)) - continue - if not in_net(metal, METAL_NET): - R.add(sec, "FAIL", "VIP-METAL", "%s metal vip %s not in %s" % (name, metal, METAL_NET)) - continue - po, mo = int(prov.split(".")[-1]), int(metal.split(".")[-1]) - if po != mo: - R.add(sec, "FAIL", "VIP-MIRROR", "%s octets differ: provider .%d vs metal .%d" % (name, po, mo)) - continue - if not (VIP_OCTET_MIN <= mo <= VIP_OCTET_MAX): - R.add(sec, "FAIL", "VIP-RANGE", - "%s metal vip octet .%d outside reserved %d-%d" % (name, mo, VIP_OCTET_MIN, VIP_OCTET_MAX)) - continue - expected_octet = EXPECTED_CLUSTERED.get(name) - if expected_octet is not None and po != expected_octet: - R.add(sec, "WARN", "VIP-OCTET", - "%s vip octet .%d != D-020 map .%d" % (name, po, expected_octet)) - ok += 1 - if ok: - R.add(sec, "PASS", "VIP-DUAL", - "%d clustered API charm(s) have mirrored dual VIPs in the reserved range" % ok) - - -def check_osd(R, apps): - sec = "8. Anti-pattern: ceph-osd osd-devices" - osds = [n for n, s in apps.items() if (s or {}).get("charm") == "ceph-osd"] - if not osds: - R.add(sec, "WARN", "OSD", "no ceph-osd app found") - return - for n in osds: - dev = (apps[n].get("options") or {}).get("osd-devices") - if not dev or not isinstance(dev, str) or not dev.strip().startswith("/"): - R.add(sec, "FAIL", "OSD-DEV", "%s osd-devices not a real path: %r" % (n, dev)) - else: - note = "" - if "/dev/disk/by-" not in dev: - note = " (kernel-name; by-path/by-id is harder for bare metal -- Roosevelt note)" - R.add(sec, "PASS", "OSD-DEV", "%s osd-devices=%s%s" % (n, dev.strip(), note)) - - -def check_ovn(R, apps): - sec = "9. Anti-pattern: ovn-chassis mappings (MAC over NIC name)" - chassis = [n for n, s in apps.items() if (s or {}).get("charm") == "ovn-chassis"] - if not chassis: - R.add(sec, "WARN", "OVN", "no ovn-chassis app found") - return - for n in sorted(chassis): - opts = (apps[n].get("options") or {}) - bim = opts.get("bridge-interface-mappings") - if not bim: - R.add(sec, "INFO", "OVN-BIM", "%s has no bridge-interface-mappings (expected for octavia-side chassis)" % n) - continue - if MAC_RE.search(str(bim)): - R.add(sec, "PASS", "OVN-BIM", "%s bridge-interface-mappings is MAC-based" % n) - else: - R.add(sec, "WARN", "OVN-BIM", - "%s bridge-interface-mappings has no MAC (NIC-name? fragile): %r" % (n, bim)) - - -def check_os_networks(R, apps, rels): - sec = "10. D-020: spaces-native (no os-*-network pinning)" - _, principal_of = map_hacluster(apps, rels) - flagged = 0 - for name in sorted(principal_of): - opts = (apps[name].get("options") or {}) - for k in ("os-internal-network", "os-admin-network", "os-public-network"): - if k in opts: - R.add(sec, "WARN", "OS-NET", - "%s sets %s (D-020 found spaces-native resolve sufficient; verify intent)" % (name, k)) - flagged += 1 - if not flagged: - R.add(sec, "PASS", "OS-NET", "no clustered charm pins os-*-network (spaces-native, per D-020)") - - -def expected_channel(charm): - if charm in CHANNEL_MATRIX: - return CHANNEL_MATRIX[charm] - if charm in OPENSTACK_CORE_CHARMS: - return OPENSTACK_CORE_CHANNEL - return None - - -def check_channels_base(R, apps): - sec = "11. Channels / base (verified Caracal matrix; WARN-only)" - mismatch = 0 - for name, spec in sorted(apps.items()): - spec = spec or {} - charm = spec.get("charm") - ch = spec.get("channel") - exp = expected_channel(charm) - if exp and ch and ch != exp: - R.add(sec, "WARN", "CHANNEL", "%s (%s) channel=%s expected=%s" % (name, charm, ch, exp)) - mismatch += 1 - base = spec.get("base") - series = spec.get("series") - if base and base != EXPECTED_BASE: - R.add(sec, "WARN", "BASE", "%s base=%s expected=%s" % (name, base, EXPECTED_BASE)) - if series and series not in ("jammy",): - R.add(sec, "WARN", "SERIES", "%s series=%s expected=jammy" % (name, series)) - if not mismatch: - R.add(sec, "PASS", "CHANNEL", "no charm deviates from the known Caracal channel matrix") - - -def summary_tables(R, apps, rels): - sec = "12. Inventory (informational)" - _, principal_of = map_hacluster(apps, rels) - for name in sorted(principal_of): - vip = ((apps[name].get("options") or {}).get("vip")) - R.add(sec, "INFO", "CLUSTERED", "%-26s vip=%s" % (name, vip)) - routers = sorted(n for n, s in apps.items() if (s or {}).get("charm") == "mysql-router") - R.add(sec, "INFO", "ROUTERS", "%d mysql-router apps: %s" % (len(routers), ", ".join(routers))) - - -# --------------------------------------------------------------------------- # -# Main -# --------------------------------------------------------------------------- # -def main(): - ap = argparse.ArgumentParser(description="Comprehensive Caracal bundle reviewer (read-only).") - ap.add_argument("bundle", nargs="?", default="bundle.yaml") - ap.add_argument("--strict", action="store_true", help="treat WARN as failing for exit code") - ap.add_argument("--quiet", action="store_true", help="show only WARN/FAIL") - args = ap.parse_args() - - try: - with open(args.bundle, "r", encoding="utf-8", errors="replace") as fh: - text = fh.read() - except FileNotFoundError: - sys.stderr.write("ERROR: bundle not found: %s\n" % args.bundle) - return 2 - - try: - doc = yaml.load(text, Loader=DupKeyLoader) - except yaml.YAMLError as e: - sys.stderr.write("ERROR: YAML parse failed: %s\n" % e) - return 2 - - R = Reporter(quiet=args.quiet) - print("================ Caracal v1 bundle review: %s ================" % args.bundle) - - check_ascii(R, text) - apps, rels = check_structure(R, doc) - if apps is None: - R.emit() - return 1 - check_relations(R, apps, rels) - check_bindings_phantom(R, apps) - check_vault(R, apps, rels) - check_hacluster(R, apps, rels) - check_memcached(R, apps, rels) - check_router_bindings(R, apps) - check_vips(R, apps, rels) - check_osd(R, apps) - check_ovn(R, apps) - check_os_networks(R, apps, rels) - check_channels_base(R, apps) - summary_tables(R, apps, rels) - - R.emit() - fail = R.counts["FAIL"] > 0 - warn = R.counts["WARN"] > 0 - if fail or (args.strict and warn): - print("\nVERDICT: NOT CLEAN" + (" (--strict: WARN counts)" if (warn and not fail) else "")) - return 1 - print("\nVERDICT: CLEAN" + (" (with WARN review items)" if warn else "")) - return 0 - - -if __name__ == "__main__": - sys.exit(main()) diff --git a/runbooks/01-destroy-model.md b/runbooks/01-destroy-model.md deleted file mode 100644 index d158b84..0000000 --- a/runbooks/01-destroy-model.md +++ /dev/null @@ -1,99 +0,0 @@ -# Runbook 01 — Teardown of existing testcloud - -**Reference:** D-018 (skip graceful, MAAS-release-direct). Supersedes the -graceful-teardown approach formerly in D-013. - -**Pre-conditions:** - -- KVM snapshots of openstack0–3 exist as the safety net (pre-Magnum - baseline). With L3 full rebuild (D-017) we should not need them, but they - remain valid disaster recovery. -- Run from jumphost `vopenstack-jesse` as user `jessea123`. -- Authenticated Juju session active (`juju whoami` returns identity). -- MAAS CLI profile configured OR access to MAAS UI for releasing machines. -- This procedure destroys the entire `openstack` Juju model and wipes all 5 - MAAS-managed VMs. There is no undo short of restoring from snapshot. - -**Phase A — Pre-destroy capture (~30 sec)** - -```bash -BACKUP_DIR=~/backups/pre-caracal-destroy-$(date -u +%Y%m%dT%H%M%SZ) -mkdir -p "$BACKUP_DIR" -juju export-bundle > "$BACKUP_DIR/bundle-pre-destroy.yaml" -juju status --format=yaml > "$BACKUP_DIR/juju-status-pre-destroy.yaml" -juju models --format=yaml > "$BACKUP_DIR/juju-models-pre-destroy.yaml" -ls -la "$BACKUP_DIR" -``` - -This is reference material for diff-checking against the new Caracal bundle -later. Not used for restore. - -**Phase B — Force-destroy the Juju model (~1-2 min to return; ~5-10 min to fully reap in background)** - -```bash -juju destroy-model openstack --force --no-wait --destroy-storage --no-prompt -``` - -Flags: - -- `--force` — ignore charm hooks; don't wait for graceful shutdown -- `--no-wait` — return immediately; reaping continues in the background -- `--destroy-storage` — mark Juju-tracked persistent storage for cleanup -- `--no-prompt` — non-interactive - -The Juju controller on `juju.maas` is untouched. Only the `openstack` model -is destroyed. - -**Phase C — Release MAAS machines (parallel with Phase B; ~5 min)** - -Either path is acceptable. UI is faster for visual confirmation; CLI is -script-documented for Roosevelt. - -**Path 1 — MAAS UI:** Machines → select `openstack0`, `openstack1`, -`openstack2`, `openstack3`, `capi-mgmt` → Take action → Release. - -**Path 2 — MAAS CLI:** - -```bash -# Replace $PROFILE with your MAAS CLI profile name (e.g. "admin") -PROFILE=admin - -# Look up system IDs -maas $PROFILE machines read 2>/dev/null \ - | jq -r '.[] | select(.hostname | test("^(openstack[0-3]|capi-mgmt)$")) | "\(.hostname) \(.system_id) \(.status_name)"' - -# Release each by system_id -for SID in ; do - maas $PROFILE machine release "$SID" comment="Caracal rebuild teardown" -done -``` - -LXD VMs managed by MAAS are destroyed on release; the VMs go away and the -machine entries return to Ready state. - -**Phase D — Verification (~1 min)** - -```bash -# Juju side -juju models -# Expect: openstack model not listed - -# MAAS side — all 5 hostnames must report Ready -maas $PROFILE machines read 2>/dev/null \ - | jq -r '.[] | select(.hostname | test("^(openstack[0-3]|capi-mgmt)$")) | "\(.hostname) \(.status_name)"' -# Expect five lines, each ending in "Ready" -``` - -**If the Juju model is still listed as "destroying" after 10 minutes:** - -```bash -# Force-clean any orphan machine entries -juju machines -m openstack --format=yaml 2>/dev/null -# For each lingering machine: -juju remove-machine -m openstack --force -# Then attempt model removal again -juju destroy-model openstack --force --no-wait --no-prompt -``` - -**Exit criteria:** `juju models` does not show `openstack`. All 5 VMs show -`Ready` in MAAS. Proceed to `02-deploy.md`. diff --git a/runbooks/README.md b/runbooks/README.md new file mode 100644 index 0000000..2195bde --- /dev/null +++ b/runbooks/README.md @@ -0,0 +1,47 @@ +# v1 Deploy Runbook -- VR0 DC0 Omega Cloud (Caracal 2024.1, IPv4) + +The deploy is a gated sequence: run `phase-00` through `phase-08` in order. Each phase +ends in a hard gate (an explicit pass/fail check); do not start the next phase until the +current gate passes. The two appendices are reference, not steps. + +## Conventions + +- **RUN location.** Every command block is tagged with where it runs: `# RUN: jumphost` + (the `vopenstack-jesse` jumphost, with `juju` + the openstack CLI), `# RUN: mgmt VM` + (the in-cloud CAPI management VM, reached over SSH), or a charm unit via + `juju ssh -- '...' cause -> fix index, keyed by + D-NNN / DOCFIX-NNN / lesson. +- **appendix-B-asbuilt-version-lock.md** -- charm channels, the CAPI version + constellation, and the magnum-capi-helm driver pin. + +## History + +This `phase-NN` set supersedes the earlier `v1-do-doc-NN-*` execution documents (and the +older `NN-*.md` set and the `deprecated/` folder), which were removed in the repo +sanitation sweep. Git history preserves them. diff --git a/runbooks/appendix-A-troubleshooting.md b/runbooks/appendix-A-troubleshooting.md new file mode 100644 index 0000000..3b3a9ee --- /dev/null +++ b/runbooks/appendix-A-troubleshooting.md @@ -0,0 +1,366 @@ +# Appendix A -- Troubleshooting / Known-Issues Index + +Keyed by the same `D-NNN` / `DOCFIX-NNN` / `L-P6-N` identifiers used inline in the +phase runbooks. This is an OPERATIONAL index (symptom -> cause -> fix), NOT the +decision log: full rationale lives in `design-decisions.md` and the per-decision +files (`D-0NN-*.md`); the driver fix has its own `magnum-capi-helm-driver-fix-runbook`. +Each entry notes the phase(s) that reference it. ASCII-only. + +================================================================================ +## Remote execution / scripting +================================================================================ + +### DOCFIX-021 -- heredoc / stdin consumption (phase-06, phase-07) +- Symptom: a multi-line `juju ssh`/`ssh ... bash -s` or remote `sudo` block dies + early or behaves as if truncated; later commands in the heredoc never run. +- Cause: an inner `ssh`/`sudo`/`juju ssh` (or any stdin reader) consumes the rest + of the heredoc/pipe that was feeding the outer command. +- Fix: append ` openssl "Unable to load + certificate" -> keystone NO_CERTIFICATE_OR_CRL_FOUND. Fix: pull from the action JSON + (real newlines, no indent): `juju run vault/leader get-root-ca -m openstack + --format json | jq -r '[.. | strings | select(test("BEGIN CERTIFICATE"))][0]'`. + (Same class as DOCFIX-006: never trust action human output for a captured secret/cert.) + +### L-P6-4 -- admin-kubeconfig / secret transfer (phase-07) +- Risk: staging the cluster-admin kubeconfig (or any secret) in `/tmp`, or letting a + PTY mangle it in transit. +- Fix: pipe base64 straight into a root-written file with `umask 077`, then `chown` + to the service user and `chmod 0600` -- never touch `/tmp`. (Pattern in phase-07 7.2.) +- Hardening (Roosevelt): replace the cluster-admin kubeconfig with a scoped + ServiceAccount kubeconfig carrying only the RBAC the driver needs. + +================================================================================ +## k8s-snap bootstrap (mgmt cluster) +================================================================================ + +### DOCFIX-024 -- bootstrap config missing the cluster-config block (phase-06) +- Symptom: `k8s bootstrap` "succeeds" but the node never reaches Ready; network and + DNS are silently disabled; CoreDNS/Cilium absent. +- Cause: a bootstrap `--file` whose top level lacks a `cluster-config:` block leaves + ALL features (network, dns, ...) at disabled defaults. Setting only `pod-cidr` / + `service-cidr` / `extra-sans` does NOT enable them. +- Fix: include an explicit block: + cluster-config: + network: { enabled: true } + dns: { enabled: true } + (See phase-06 6.4 for the full config.) Retry: `snap remove k8s --purge` then re-bootstrap. + +================================================================================ +## CAPI provider install (mgmt cluster) +================================================================================ + +### DOCFIX-025a -- cert-manager Helm flag (phase-06) +- Symptom: cert-manager install fails / CRDs absent when using `--set installCRDs=true`. +- Cause: `installCRDs` was removed from the cert-manager chart (~v1.18). The current + flag is `crds.enabled=true`. +- Fix: `helm install cert-manager jetstack/cert-manager ... --set crds.enabled=true`. + +### D-034 -- CAPI install ordering (ORC before clusterctl init) (phase-06) +- Symptom: after `clusterctl init`, `capo-controller-manager` CrashLoopBackOff + (observed ~6 restarts / ~15 min) before self-healing. +- Cause: CAPO v0.14.4's `openstackserver` controller hard-depends on ORC's + `Image.openstack.k-orc.cloud` CRD at startup. `clusterctl init` installs CAPO; if + ORC is not yet present, CAPO crash-loops until it appears. +- Fix: install ORC (its manifest provides the `Image` CRD) BEFORE `clusterctl init`. + Hardened order: cert-manager -> ORC -> clusterctl init -> CAAPH -> janitor. +- Related rule: source every provider version from the chosen `capi-helm-charts` + tag's `dependencies.json` (read live with `jq`); do not hardcode semver. + (Full rationale: design-decisions D-034; driver-coherence amendment: D-042.) + +================================================================================ +## Networking / pod egress +================================================================================ + +### D-035 -- dual-homed mgmt node pod-egress reverse-path failure (phase-06) +- Symptom (the prior D-033 architecture): a pod's egress TCP connect to an external + VIP hangs; the agnhost probe never reaches Completed. SYN leaves the correct NIC and + the SYN-ACK arrives, but the reply is emitted back out the NIC instead of being + redirected into the pod via `cilium_host` -- silent, asymmetric breakage. (The + "do-07 pattern.") +- Cause: Cilium reverse-path handling on a node with multiple NICs. +- Fix (chosen): D-035 single-homed in-cloud tenant VM avoids it entirely; phase-06 + GATE 2 (agnhost pod -> Keystone VIP, must Complete) is the explicit proof. (The + transferable alternative -- Cilium device pinning -- is a Roosevelt note, not v1.) + +================================================================================ +## Magnum conductor +================================================================================ + +### D-037 -- conductor config-dir injection (NOT a systemd ExecStart drop-in) (phase-07) +- Symptom: the `[capi_helm]` conf.d drop-in is ignored; the conductor behaves as if it + was never written, even though a systemd drop-in "looks" applied. +- Cause: these OpenStack debs (openstack-pkg-tools) run the daemon through an LSB init + script wrapped by systemd `systemd-start`, NOT a direct `ExecStart=`. A systemd + drop-in appending `--config-dir` passes it as a positional arg to the init script, + which ignores it -- the flag never reaches the daemon. The args are assembled inside + the init script from `DAEMON_ARGS` (base `--config-file` first), extensible only via + `/etc/default/`. +- Fix: create `/etc/default/magnum-conductor` (0644; the charm does not manage it): + DAEMON_ARGS="$DAEMON_ARGS --config-dir /etc/magnum/magnum.conf.d" + Verify with the init script's own `show-args` (dry-run) AND `ps -ww -C + magnum-conductor -o args` on the live process -- behavioral, not string-presence. +- Residual: if a future charm hook ever writes `/etc/default/magnum-conductor`, the + append is lost and `[capi_helm]` silently stops being read. Re-check via show-args/ps. + +### L-P6-1 / L-P6-2 -- verify the launched cmdline, not the unit text (phase-07) +- Rule: never assume the systemd `ExecStart` shape for OpenStack debs, and never treat + "string present in the unit file" as "the daemon received the flag." Gate on the + assembled/launched cmdline (`show-args`, then `ps` on the live process). + +### L-P6-3 -- k8s version comes from the IMAGE, not a template label (phase-08) +- Symptom: cluster create fails in the driver before provisioning. +- Cause: the magnum-capi-helm driver reads `kube_version` from the Glance image + properties and routes on `os_distro`; it does NOT take k8s version from a template + label. +- Fix: the workload image (e.g. `ubuntu-jammy-kube-v1.32.13`) MUST carry + `kube_version` (e.g. v1.32.13) and `os_distro=ubuntu`. Verify before create (phase-08 8.0). + +================================================================================ +## Driver / cluster health +================================================================================ + +### D-042 -- driver contract-coherence; health "infrastructure: not found" (phase-07, phase-08, appendix-B) +- Symptom: `coe cluster show` reports `health_status = UNHEALTHY` deterministically + (survives a conductor restart); only the `infrastructure` sub-check fails + ("Infrastructure resource not found"); cluster + control-plane + nodegroup are Ready. +- Cause: driver 1.3.0 reads `apiVersion` off `spec.infrastructureRef` to build its + health GET, but the CAPI v1.13 (v1beta2 contract) ref carries apiGroup+kind+name with + NO apiVersion. COSMETIC -- the create path is unaffected (the chart templates the + resource versions); only the driver's direct health query breaks. +- Fix: upgrade to the RELEASED `magnum-capi-helm==1.4.0` (the "generalize-api-resources" + feature). 1.4.0 builds each health GET from an explicit api_version via its + `[capi_helm] api_resources` option, which DEFAULTS to v1beta1 for every CAPI kind -- + and CAPI v1.13.2 / CAPO v0.14.4 still serve v1beta1, so the default works (no override + needed; phase-07 7.3-7.6). Set a per-kind override only if a kind is v1beta2-only. + Rule (amends D-034): the Layer-B driver pin must be contract-coherent with the + Layer-A CAPI core. +- Operational caveat while unfixed: do NOT wire magnum auto-healing to `health_status` + (a persistent false UNHEALTHY could misfire); CAPI MachineHealthCheck heals independently. + +================================================================================ +## Cluster lifecycle / Octavia +================================================================================ + +### D-039 -- app-cred roles (load-balancer_member) / Octavia 403 (phase-08) +- Symptom: cluster create or delete wedges; CAPO gets 403 querying the Octavia LB. +- Cause: the Magnum-minted application credential lacks `load-balancer_member` + (a pre-D-039 frozen app-cred cannot query Octavia to confirm LB state). +- Fix: ensure the service path mints app-creds carrying `load-balancer_member` + (+ member, reader). Verify before acceptance (phase-08 prereqs). + +### stuck-delete -- wedged CAPI cluster delete (phase-08) +- Symptom: cluster stuck `DELETE_IN_PROGRESS`; helm release already gone; `Cluster` + and `OpenStackCluster` CRs stuck Deleting (often on an Octavia 403, see D-039). +- Recovery: clear the `OpenStackCluster` finalizer on the mgmt cluster -- + `kubectl -n patch openstackcluster - --type=merge + -p '{"metadata":{"finalizers":[]}}'`. The `Cluster` finalizer was only waiting on it, + so the Cluster auto-finalizes and deletes. Then manually clean orphaned neutron + resources in dependency order: router remove subnet -> router unset external-gateway + -> router delete -> subnet delete -> network delete -> security group delete. + +### LB-failover -- LB stuck provisioning_status=ERROR after a host event (phase-08) +- Symptom: the kube-api Octavia LB shows `operating_status ONLINE` but + `provisioning_status ERROR` after a host outage/OOM. +- Cause: a control-plane op on the amphora failed during the outage. +- Fix: `openstack loadbalancer failover ` in ADMIN-project scope (amphora / + failover ops 403 under tenant member scope). Watch ERROR -> PENDING_UPDATE -> ACTIVE + (~100s); a single STANDALONE amphora gives a brief blip; operating_status holds ONLINE. + +### uninitialized-taint -- workload addons Pending (phase-08) +- Symptom: new workload nodes are kubelet-Ready but addon pods (metrics-server, + node-feature-discovery, etc.) stay Pending; nodes carry + `node.cluster.x-k8s.io/uninitialized`. +- Cause: that taint is removed by the CAPI machine controller on the MANAGEMENT + cluster. If the mgmt cluster is down (see D-041), the taint persists. +- Fix: restore the mgmt cluster API; CAPI then removes the taint and addons schedule. + +### CNI-label -- network_driver vs the chart-default Calico (1.4.0) (phase-08) +- Note: under the as-FIRST-built driver 1.3.0 the legacy Magnum `network_driver` label + was IGNORED and the capi-helm `openstack-cluster` chart's default CNI (Calico) always + ran. Under the RELEASED 1.4.0 driver the `network_driver` template option IS honored + (it maps through to the chart). To keep the as-built CNI (Calico), the `capi-k8s-v1-32` + template OMITS `--network-driver` (phase-08); set `flannel` there only to intentionally + switch the CNI. (Mgmt cluster CNI is separately Cilium, via k8s-snap.) + +================================================================================ +## Hyperconverged host / mgmt-VM resilience +================================================================================ + +### D-040 -- host OOM from low reserved-host-memory (phase-08) +- Symptom: guests OOM-killed; a compute host may even present in Juju as + `State=down` (heavy swap thrash stalls OVS/OVN heartbeats and the machine agent). +- Cause: `reserved-host-memory` default 512 MB does not cover the co-located + LXD/Ceph/MySQL services on these hyperconverged hosts -> nova over-commits real RAM. +- Fix: `reserved-host-memory = 8192` on all compute units (baked into the hardened + bundle). Diagnose a suspected OOM-vs-reboot with `who -b` / `uptime` (no recent boot) + and `journalctl -k | grep -i oom`; the ovsdb "no response to inactivity probe ... + disconnecting" storm is the swap-thrash signature. + +### D-041 -- single-node mgmt cluster does not self-heal (phase-08) +- Symptom: after a host event the mgmt VM (`capi-mgmt-v2`) is SHUTOFF; FIP + unreachable; magnum cannot reach the mgmt API; workload addons go Pending (see + uninitialized-taint). +- Cause: the D-035 single-node mgmt cluster is a SPOF with no MachineHealthCheck + (unlike the workload cluster). +- Fix: `openstack server start capi-mgmt-v2` (API serves ~40s later; a brief TLS + handshake timeout on the first kubectl is expected). Follow-up: HA mgmt cluster for + Roosevelt. + +### juju-macaroon -- "cannot get discharge ... EOF" (phase-07, phase-08) +- Symptom: `juju ssh` (or other juju calls) fail mid-session with a discharge/EOF error. +- Cause: the juju macaroon expired during a long session. +- Fix: re-run `juju login`, then retry. + +================================================================================ +## Teardown / MAAS reset (phase-00) +================================================================================ + +### DOCFIX-016 -- never `maas list` (API-key leak) (phase-00, phase-01, phase-04) +- Risk: `maas list` prints the stored API key to stdout (and into any transcript/log). +- Fix: the profile name is known (`admin`); call `maas admin ...` directly. Never run + `maas list` in a runbook or paste block. + +### DOCFIX-017 -- no `maas whoami`; hardcode the eyeballed system_ids (phase-00) +- Risk: scripting machine selection via `maas whoami` + owner filters is + fragile and, in this lab, unnecessary. +- Fix: the four host system_ids are fixed and eyeball-verified + (openstack0=4na83t, openstack1=qdbqd6, openstack2=h8frng, openstack3=tmsafc) -- + iterate those literals. (The older 01-destroy-model.md used `maas list`/`whoami` and + released 5 VMs incl. the retired D-033 capi-mgmt; the current rebuild releases 4.) + +### R7 -- sudo for libvirt / qemu-img (phase-00, phase-01) +- The OSD qcow2 files (`/var/lib/libvirt/images/-1.qcow2`) are root:root / 600; + `qemu-img info|create`, `virsh domstate`, `stat`, `rm` against them all need `sudo`. + +### KI-P3-001 -- VIP / primary collision (phase-00, phase-04) +- Symptom: a charm `vip:` address equals a MAAS-auto-assigned machine/container + primary (observed: cinder public VIP .226 == magnum container 1/lxd/3 primary). +- Cause: MAAS auto-static allocation was not excluded over the VIP block (provider had + NO VIP reservation), so MAAS handed primaries .225/.226/.227 onto the .224-.236 VIPs. +- Fix (durable): on EVERY space carrying VIPs (provider AND metal) reserve the + front-loaded VIP /26 in MAAS, distinct from the primary range and any neutron + allocation_pool (phase-00 Phase 4). A reserved range stops future auto-assign onto + a configured VIP. Negative test post-deploy: no service vip == any unit primary. + +================================================================================ +## Deploy-time (phase-01) +================================================================================ + +### R14 -- VIP relocation .224-.236 -> .50-.60 (phase-01) +- The public + internal API VIPs were front-loaded out of the old high-end .224-.236 + block into .50-.60 (inside the reserved .2-.63 /26). Every bundle `vip:` is a dual + provider+metal pair "10.12.4.5x 10.12.8.5x" (D-020). Pre-deploy guard: total provider + VIPs=11, all in .50-.60, zero in the stale .10-.20 (phase-01 1.1). Any per-cloud + consumer of a VIP (the Horizon reverse proxy, monitoring) must be repointed. + +### R15 -- the .10 phantom resolver (phase-01) +- Symptom: an unreachable region resolver `10.12.8.10` appears in a node's resolver + list (sometimes as Current DNS Server) despite the subnet dns_servers override. +- Cause: MAAS advertises its region/rack controller as a DNS server on the + MAAS-managed metal VLAN, independent of the subnet field; the override does not purge it. +- Impact: NON-BLOCKING -- systemd-resolved deprioritizes .10 and falls through to .1. + Latent fragility if .1 ever drops. Understand/eliminate for Roosevelt (no libvirt split there). + +### L1 -- no `set -e` on count-gate blocks; guard greps `|| true` (phase-01) +- A guarded `grep -c` returning 0 is a VALID answer, not a failure. Under `set -e` a + zero-count grep aborts the block. Pre-deploy verify blocks run WITHOUT `set -e`, and + every count grep ends `|| true`. (`bash -n` would not catch this -- it is behavior.) + +### L3 -- metal-side dual-VIP eyeball check (phase-01) +- The provider-side VIP guard greps only the first token of each dual `vip:`. The metal + side (second token, `10.12.8.5x`) must be eyeballed to confirm all 11 sit in .8.50-.60, + clear of metal infra (.8.10 maas / .8.20 lxd / .8.21 capi / .8.30 juju). + +================================================================================ +## Vault / secrets (phase-02) +================================================================================ + +### DOCFIX-006 -- vault init is one-shot; stdout-only redirect loses the keys (phase-02) +- Symptom: `vault operator init ... > file` captures stdout only; if the key block went + to stderr (or the run is interrupted) you are left with an unusable/empty file and the + 5 shares + root token are GONE -- init runs exactly once and cannot be replayed. +- Fix: `vault operator init -key-shares=5 -key-threshold=3 2>&1 | tee ~/vault-init/init.txt` + VERBATIM; gate on `grep -c '^Unseal Key' == 5` and `Initial Root Token` present; then + save the file OFF-HOST before anything else. Never improvise this command. + +### DOCFIX-011 -- authorize-charm parameter is `token` (phase-02) +- The vault `authorize-charm` action takes `token` (a direct token string); there is no + `token-secret-id` variant in this charm rev. Confirm via `juju actions vault --schema`. + Authorize with a SHORT-LIVED CHILD token (juju run persists action params in the op log). + +### DOCFIX-014 -- generate-root-ca is required (phase-02) +- Symptom: after authorize-charm, vault stays BLOCKED "Missing CA cert". +- Fix: run `juju run vault/leader generate-root-ca` -- it mints the charm-pki-local + root and clears the block straight to active. (Omitting it leaves vault hung.) + +### L4 -- vault unseal via hidden prompt, not key-on-argv (phase-02) +- Use Vault's own `vault operator unseal` (no argument) so it prompts hidden; the key is + never on the command line / in a var / in `ps` / in scrollback. Do NOT use + `vault operator unseal $KEY` (visible in `ps` on the unit). Unseal is re-runnable, so + the verbatim-reference rule is looser here, but the security gain is real. + +### R3 -- "HA Enabled false" is correct for vault-on-mysql (phase-02) +- Expected post-unseal: Initialized true / Sealed false / Storage Type mysql / + **HA Enabled false**. Single-unit vault on the mysql backend is non-HA by design; any + reference to "HA Enabled true (etcd backend)" is STALE (etcd was dropped). + +================================================================================ +## Identity / openrc (phase-03) +================================================================================ + +### DOCFIX-018 -- IP-only OS_AUTH_URL (phase-03) +- This cloud is IP-only (no FQDN, no cloud DNS). The admin openrc must point at the + keystone PUBLIC endpoint by IP: `OS_AUTH_URL=https://10.12.4.50:5000/v3`, with the + vault root CA in `OS_CACERT` (B5 IP-SAN certs validate). No /etc/hosts, no FQDN. + +### DOCFIX-022 -- discover the admin project; do not hardcode it (phase-03) +- Symptom: with TLS working, keystone returns HTTP 401. +- Cause: wrong project scope. The scoping project name varies by charm rev (here it is + `admin`, living in domain `admin_domain`; an older doc's `OS_PROJECT_NAME=admin_domain` + 401s). Credential good, scope wrong. +- Fix: a candidate loop -- try each of "admin admin_domain"; the first that issues a + SCOPED token wins (phase-03 3.2). Costs 2 extra token requests; self-corrects across + revs instead of re-introducing the 401-by-hardcode. + +================================================================================ +## Octavia enablement (phase-05) +================================================================================ + +### L7 -- the openstack snap cannot read /tmp (phase-05, also phase-01 PKI sanity) +- Symptom: `openstack image create --file /tmp/...` -> "[Errno 2] No such file or + directory" even though `sha256sum` just read the same path. +- Cause: the openstack CLI snap is confined and cannot read `/tmp`; it CAN read `$HOME` + (home interface). +- Fix: stage any file the snap must read under `$HOME` (e.g. `$HOME/amphora-base/...`), + never `/tmp`. + +### octavia-configure-resources -- long-running action; o-hm0 transient is normal (phase-05) +- `configure-resources` is long-running: juju's default action wait may time out + ("timed out waiting for results") while the hook KEEPS RUNNING -- do NOT treat the + wait-timeout as failure or re-fire blindly. Use a bound `--wait` and confirm completion + via `juju show-operation ` (authoritative), not the streamed log. +- NORMAL (not faults) during/after: lb-mgmt-net is IPv6-ULA (fc00::/..) by design; a + "Virtual network for access to Amphorae is down" transient self-heals as o-hm0 comes + up; the lb-mgmt `network:distributed` port shows DOWN (logical OVN port, never chassis-bound). + +### amp-image-tag-mismatch -- LP#1937003 (phase-05) +- Octavia looks up the amphora image by `octavia amp-image-tag`; it MUST equal the tag + the retrofit stamps (`octavia-diskimage-retrofit amp-image-tag`), both `octavia-amphora`. + A mismatch means octavia cannot find the image even though it is built and ACTIVE. + The amphora pipeline gate asserts the two are equal before building (phase-05 5.2). + +================================================================================ +## Notes +================================================================================ +- This index covers phases 00-08. It grows the same way for any future phase: keyed by + D-NNN / DOCFIX-NNN / L-N / R-N / named-symptom, each entry symptom -> cause -> fix + with a "phase NN" back-reference, and decision rationale left to design-decisions.md. +- memcached track drift is recorded in appendix-B (B.1), not here (it is a + version-lock note, not a troubleshooting entry). diff --git a/runbooks/appendix-B-asbuilt-version-lock.md b/runbooks/appendix-B-asbuilt-version-lock.md new file mode 100644 index 0000000..add0350 --- /dev/null +++ b/runbooks/appendix-B-asbuilt-version-lock.md @@ -0,0 +1,139 @@ +# Appendix B -- As-Built Version / Channel / Revision Lock + +Source: `juju export-bundle` (model `openstack`) + the in-cloud mgmt-cluster +captures, 2026-06-09. ASCII-only. + +POLICY (D-002 + consolidation prompt): the bundle PINS CHANNELS, not revisions. +This appendix records the as-built REVISIONS as the known-good baseline. A fresh +deploy resolving a channel to a higher revision than below is EXPECTED -- treat +this as "last-known-good," verify against Charmhub at pre-flight, and refresh the +table on a successful validated deploy. + +## B.1 Charm channels + as-built revisions + +| Application | Charm | Channel (pinned) | As-built rev | +| ------------------------------- | -------------------------- | ------------------ | ------------ | +| barbican | barbican | 2024.1/stable | 209 | +| barbican-hacluster | hacluster | 2.4/stable | 131 | +| barbican-mysql-router | mysql-router | 8.0/stable | 1154 | +| barbican-vault | barbican-vault | 2024.1/stable | 75 | +| ceph-mon | ceph-mon | squid/stable | 268 | +| ceph-osd | ceph-osd | squid/stable | 632 | +| ceph-radosgw | ceph-radosgw | squid/stable | 600 | +| ceph-radosgw-hacluster | hacluster | 2.4/stable | 131 | +| cinder | cinder | 2024.1/stable | 733 | +| cinder-ceph | cinder-ceph | 2024.1/stable | 533 | +| cinder-hacluster | hacluster | 2.4/stable | 131 | +| cinder-mysql-router | mysql-router | 8.0/stable | 1154 | +| dashboard-mysql-router | mysql-router | 8.0/stable | 1136 | +| glance | glance | 2024.1/stable | 642 | +| glance-hacluster | hacluster | 2.4/stable | 131 | +| glance-mysql-router | mysql-router | 8.0/stable | 1154 | +| glance-simplestreams-sync | glance-simplestreams-sync | 2024.1/stable | 124 | +| keystone | keystone | 2024.1/stable | 778 | +| keystone-hacluster | hacluster | 2.4/stable | 131 | +| keystone-mysql-router | mysql-router | 8.0/stable | 1154 | +| magnum | magnum | 2024.1/stable | 70 | +| magnum-dashboard | magnum-dashboard | 2024.1/stable | 59 | +| magnum-hacluster | hacluster | 2.4/stable | 131 | +| magnum-mysql-router | mysql-router | 8.0/stable | 1154 | +| memcached | memcached | latest/stable | 39 | +| mysql-innodb-cluster | mysql-innodb-cluster | 8.0/stable | 159 | +| ncc-mysql-router | mysql-router | 8.0/stable | 1136 | +| neutron-api | neutron-api | 2024.1/stable | 650 | +| neutron-api-hacluster | hacluster | 2.4/stable | 131 | +| neutron-api-mysql-router | mysql-router | 8.0/stable | 1154 | +| neutron-api-plugin-ovn | neutron-api-plugin-ovn | 2024.1/stable | 178 | +| nova-cloud-controller | nova-cloud-controller | 2024.1/stable | 795 | +| nova-cloud-controller-hacluster | hacluster | 2.4/stable | 131 | +| nova-compute | nova-compute | 2024.1/stable | 827 | +| octavia | octavia | 2024.1/stable | 441 | +| octavia-dashboard | octavia-dashboard | 2024.1/stable | 120 | +| octavia-diskimage-retrofit | octavia-diskimage-retrofit | 2024.1/stable | 196 | +| octavia-hacluster | hacluster | 2.4/stable | 131 | +| octavia-mysql-router | mysql-router | 8.0/stable | 1154 | +| openstack-dashboard | openstack-dashboard | 2024.1/stable | 728 | +| openstack-dashboard-hacluster | hacluster | 2.4/stable | 131 | +| ovn-central | ovn-central | 24.03/stable | 311 | +| ovn-chassis | ovn-chassis | 24.03/stable | 396 | +| ovn-chassis-octavia | ovn-chassis | 24.03/stable | 396 | +| placement | placement | 2024.1/stable | 125 | +| placement-hacluster | hacluster | 2.4/stable | 131 | +| placement-mysql-router | mysql-router | 8.0/stable | 1154 | +| rabbitmq-server | rabbitmq-server | 3.9/stable | 295 | +| vault | vault | 1.8/stable | 372 | +| vault-mysql-router | mysql-router | 8.0/stable | 1136 | + +Notes: +- memcached is on `latest/stable` (rev 39) -- the only charm not on a versioned + track. AT PRE-FLIGHT run `juju info memcached` to list available tracks; if no + stable versioned track exists, either pin revision 39 explicitly in the bundle + or accept `latest/stable` knowingly. Flagged as a drift candidate. +- mysql-router subordinates show mixed as-built revisions (most 1154; the + ncc/dashboard/vault routers at 1136) on the SAME `8.0/stable` channel. This is + benign under channel-pinning (all resolve to current `8.0/stable` on redeploy); + recorded only for completeness. +- EXCLUDED from the bundle: the `k8s` charm (channel `1.32/stable`) deployed on + Juju machine 4 / MAAS `capi-mgmt` (10.12.4.100). That is the retired D-033 + out-of-cloud node, slated for Phase 7 teardown; the in-cloud mgmt cluster + (D-035) replaces it. It is intentionally absent here. + +## B.2 In-cloud management cluster + CAPI constellation (D-034 / D-035 / D-037) + +Node `capi-mgmt-v2` (FIP 10.12.7.40, internal 10.20.0.45), single-node, non-CAPI-managed: +- k8s-snap: channel `1.32-classic/stable`, rev 5326, k8s v1.32.13 (classic confinement) +- CAPI core + kubeadm-bootstrap + kubeadm-control-plane: v1.13.2 +- CAPO (infra provider): v0.14.4 +- cert-manager: v1.20.2 +- ORC: v2.5.0 [install BEFORE `clusterctl init` -- CAPO v0.14.4 hard-deps the ORC Image CRD] +- CAAPH (cluster-api-addon-provider): chart 0.12.0 (`helm --version`, from dependencies.json; deploys image 62f7c00) +- cluster-api-janitor-openstack: chart 0.11.0 (`helm --version`, from dependencies.json; deploys image d527847) +- cluster-autoscaler (per-workload): v1.30.4 +- Mgmt CNI: Cilium 1.17.12-ck0. Workload-cluster CNI: Calico (chart default). + +VERSION-SOURCE RULE (D-034): every provider ref above is read live from the chosen +`capi-helm-charts` release tag's `dependencies.json` via `jq`. DO NOT hardcode +semver in IaC -- this table is a snapshot for redeploy comparison only. + +## B.3 Magnum driver + chart (Layer B -- outside Juju channels, manually pinned) + +- magnum-capi-helm driver: 1.3.0 was the AS-FIRST-BUILT pin; the v1 TARGET is the + RELEASED `magnum-capi-helm==1.4.0` (D-042). 1.3.0 is contract-INCOHERENT with the + Layer-A core -- it reads `apiVersion` off the infrastructureRef, which CAPI v1.13 + (v1beta2 contract) no longer carries, so the driver's `infrastructure` health GET + returns "not found" (cosmetic only -- the create path is unaffected; the chart + templates resource versions). (1.3.0 also supersedes D-007's `1.1.0` and the late-May + `1.2.0` note -- both stale; Review-later: reconcile design-decisions.md.) +- DRIVER DECISION (D-042, amends D-034): pin the RELEASED `magnum-capi-helm==1.4.0` + (the "generalize-api-resources" feature; released line 1.0.0/1.1.0/1.2.0/1.2.1/1.3.0/ + 1.4.0). 1.4.0 resolves each resource query as + `api_resources.get(,{}).get("api_version", )`; the driver's CODE + defaults are v1beta1 for the CAPI core kinds, but the `api_resources` OPTION itself + defaults to an EMPTY map `{}` (the v1beta1 values are code-level fallbacks, NOT option + defaults). CAPI v1.13.2 / CAPO v0.14.4 serve v1beta1, so an empty map yields matching + v1beta1 lookups -- set `api_resources = {}` EXPLICITLY (phase-07 7.5: the option's + registered default is a dict and the driver `json.loads()` it; an explicit string `{}` + avoids the oslo coercion question). Override a kind only if it serves v1beta2-only. + Same pin for testcloud and Roosevelt. RULE: the Layer-B + driver pin MUST be contract-coherent with the Layer-A CAPI core; verify that + intersection at deploy. Install: phase-07 7.3-7.6. +- chart repo: https://azimuth-cloud.github.io/capi-helm-charts +- chart name: openstack-cluster ; default_helm_chart_version: 0.25.1 +- conf.d drop-in: /etc/magnum/magnum.conf.d/00-capi-helm.conf (D-037) +- note (CNI): the `capi-k8s-v1-32` template OMITS the Magnum `network_driver` field, so + the workload cluster gets the chart-default Calico (the as-built CNI). Whether 1.4.0 + honors `network_driver` is unverified and not relied on -- omitting the field is what + guarantees Calico (appendix-A: CNI-label; phase-08). +- v1 END STATE: 1.4.0 installed and `health_status = HEALTHY` (D-011). 1.3.0 is only a + TEMPORARY rollback/holding state (phase-07 Rollback), never a v1 completion. Either + way, do NOT wire magnum auto-heal to health_status (CAPI MachineHealthCheck handles + healing independently -- proven during the D-040 OOM recovery). + +## B.4 Pre-flight checklist (redeploy) + +1. `scripts/pre-flight-checks.sh` -- verify every channel above still resolves on Charmhub. +2. `juju info memcached` -- confirm track decision (see B.1 note). +3. Read CAPI constellation live from `dependencies.json` (D-034); compare to B.2. +4. Driver (D-042): pin the RELEASED `magnum-capi-helm==1.4.0` (contract-coherent with the + Layer-A CAPI core; `api_resources` defaults to v1beta1, which CAPI v1.13.2 serves). + Confirm 1.4.0 still resolves on PyPI and that the cluster serves v1beta1 (phase-07 7.3). diff --git a/runbooks/deprecated/00-pre-deploy.md b/runbooks/deprecated/00-pre-deploy.md deleted file mode 100644 index aa07d86..0000000 --- a/runbooks/deprecated/00-pre-deploy.md +++ /dev/null @@ -1,142 +0,0 @@ -# Runbook 00 — Pre-Deploy - -## Purpose - -Prepare for a clean Caracal rebuild of the VR0 DC0 Omega Cloud. Capture all -state needed for rollback, gracefully tear down dependent workloads, and verify -the destination environment is ready before destroying the existing OpenStack -model. - -## Prerequisites - -- SSH access to jumphost `vopenstack-jesse` as `jessea123` -- `admin-openrc` and `user1-openrc` available in `$HOME` -- Access to the Juju controller hosting the `openstack` model -- Access to the capi-mgmt.maas k3s cluster (kubeconfig present) -- NetBox IPv4 imports completed (per `netbox/ipv4-prefixes-import.py`) -- NetBox VLAN imports completed (per `netbox/vlans-import.py`) - -## Phase 1 — Verify NetBox readiness (gating) - -Run the verification path of the NetBox import scripts. Confirm all entries -appear correctly scoped to VR0 DC0. - -```bash -cd ~/vr0-dc0-caracal -NETBOX_URL=https://netbox.baldurkeep.com NETBOX_TOKEN= \ - python3 netbox/ipv4-prefixes-import.py --verify-only -NETBOX_URL=https://netbox.baldurkeep.com NETBOX_TOKEN= \ - python3 netbox/vlans-import.py --verify-only -``` - -Expected: all prefixes and VLANs report scope-OK, no MISSING entries. - -## Phase 2 — Capture current state - -Backups needed for potential rollback: - -```bash -# Vault unseal keys and root CA cert -juju ssh vault/0 -- sudo cat /var/snap/vault/common/vault.crt > ~/backups/$(date +%F)/vault-root-ca.crt -# (Unseal keys MUST be on file from initial Vault setup; verify presence) -ls -la ~/.vault-keys - -# Export current bundle -juju export-bundle --model openstack > ~/backups/$(date +%F)/bundle-pre-rebuild.yaml - -# Snapshot of current 'juju status' -juju status --model openstack --format=yaml > ~/backups/$(date +%F)/juju-status-pre-rebuild.yaml - -# Inventory of FIPs and tenant resources we might want to recreate -source ~/admin-openrc -openstack floating ip list -c "Floating IP Address" -c "Fixed IP Address" \ - -c "Project" -f csv > ~/backups/$(date +%F)/floating-ips.csv -openstack server list --all-projects -c ID -c Name -c Project -c Status -f csv \ - > ~/backups/$(date +%F)/servers.csv -openstack network list --all-projects -c ID -c Name -c Project -f csv \ - > ~/backups/$(date +%F)/networks.csv -openstack loadbalancer list -c id -c name -c project_id -c vip_address -f csv \ - > ~/backups/$(date +%F)/loadbalancers.csv -``` - -## Phase 3 — KVM snapshots of openstack0-3 - -From the jumphost (which is the hypervisor): - -```bash -for vm in openstack0 openstack1 openstack2 openstack3; do - sudo virsh snapshot-create-as --domain "$vm" \ - --name "pre-caracal-rebuild-$(date +%F)" \ - --description "Pre-Caracal rebuild baseline" \ - --atomic -done -sudo virsh snapshot-list openstack0 -``` - -These snapshots are the disaster-recovery point. - -## Phase 4 — Graceful CAPI workload teardown (D-013) - -Delete the CAPI workload cluster cleanly so its OpenStack resources (LBs, FIPs, -volumes, Octavia members) are released by CAPI controllers before model destroy. - -```bash -export KUBECONFIG=~/magnum-capi/phase3/capi-mgmt-cluster.kubeconfig -# (Adjust path if kubeconfig has moved) - -# Delete the workload cluster — CAPI handles tenant OpenStack cleanup -kubectl delete cluster capi-mgmt-cluster -n default -# Wait for finalizers; this may take ~10 minutes -kubectl wait --for=delete cluster/capi-mgmt-cluster -n default --timeout=15m -``` - -Verify on the OpenStack side that resources were released: - -```bash -source ~/admin-openrc -openstack server list --all-projects | grep -i capi || echo "No CAPI servers remaining" -openstack loadbalancer list | grep -i capi || echo "No CAPI LBs remaining" -openstack floating ip list -c "Floating IP Address" -c "Fixed IP Address" -f csv -``` - -## Phase 5 — Preserve capi-mgmt.maas itself - -The bootstrap k3s + CAPI controllers on `capi-mgmt.maas` are NOT destroyed — -they will be re-used post-rebuild as the Magnum CAPI mgmt plane. Verify the -controllers are still healthy: - -```bash -ssh capi-mgmt.maas -- sudo kubectl --kubeconfig /etc/rancher/k3s/k3s.yaml \ - get pods -A -``` - -Confirm: -- `capi-system` namespace pods Running -- `capo-system` (CAPI OpenStack provider) pods Running -- `cert-manager` pods Running -- `orc-system` (OpenStack Resource Controller) pods Running - -## Phase 6 — Final go/no-go checklist - -Do not proceed to `runbooks/01-destroy-model.md` until all of the following pass: - -- [ ] NetBox verification clean -- [ ] Vault unseal keys backed up and verified readable -- [ ] `bundle-pre-rebuild.yaml` exists and is non-empty -- [ ] `juju-status-pre-rebuild.yaml` shows desired-pre-destroy state captured -- [ ] All four KVM snapshots created (`virsh snapshot-list` confirms) -- [ ] CAPI workload cluster deletion completed (`kubectl get cluster` returns - "no resources found") -- [ ] OpenStack-side resources from CAPI workload are released (no orphaned LBs, - FIPs, volumes) -- [ ] capi-mgmt.maas k3s cluster controllers all Running - -## Notes - -- Snapshot disk space consumption can grow significantly during the rebuild - window. Verify free space on `/var/lib/libvirt/images` prior to running - the rebuild deploy. -- If Vault unseal keys cannot be located, STOP. A failed Vault re-init without - the original keys means lost issued certificates and is destructive to any - data sealed under the existing root key. This MUST be confirmed before model - destroy. diff --git a/runbooks/deprecated/01a-octavia-pki-generation.md b/runbooks/deprecated/01a-octavia-pki-generation.md deleted file mode 100644 index 65bd707..0000000 --- a/runbooks/deprecated/01a-octavia-pki-generation.md +++ /dev/null @@ -1,650 +0,0 @@ -# Runbook 01a — Octavia LBaaS PKI generation - -**Status:** Pre-deploy execution. Runs between `01-destroy-model.md` and `02-deploy.md`. -**Numbering rationale:** Octavia PKI artifacts must exist on the deploy host before -`juju deploy` is invoked (the values are referenced by the overlay file). Placing -this between destroy and deploy aligns generation with the "fresh rebuild" framing. - -**Cross-references:** -- D-007 (Octavia in bundle from day one) -- Bundle `octavia.options` PKI material section -- `overlays/octavia-pki.yaml` (gitignored — output of this runbook) -- Workstream 3a decision (2026-05-22): generate fresh, EC P-384 CAs, overlay-file approach - ---- - -## 1. Purpose & scope - -This runbook generates a complete two-tier PKI for Charmed Octavia's -amphora load-balancer trust domain: - -- **Issuing CA** — Octavia uses this to sign each amphora's server certificate - at LB-creation time. Octavia receives the **private key** and **passphrase**. -- **Controller CA** — amphorae's trust anchor for connections **from** the - Octavia controller. Octavia only receives the **cert** (no key needed at - runtime); the controller's identity is proved by: -- **Controller certificate** — signed by Controller CA, presented by the - Octavia controller to each amphora. Bundled as cert + key into a single - PEM blob. - -Five charm options consume the artifacts (`octavia` application): - -| Charm option | Content | Format | -|---|---|---| -| `lb-mgmt-issuing-cacert` | Issuing CA certificate | base64-encoded PEM | -| `lb-mgmt-issuing-ca-private-key` | Issuing CA encrypted private key | base64-encoded PEM (already encrypted with passphrase) | -| `lb-mgmt-issuing-ca-key-passphrase` | Issuing CA key passphrase | plain string (NOT base64) | -| `lb-mgmt-controller-cacert` | Controller CA certificate | base64-encoded PEM | -| `lb-mgmt-controller-cert` | Controller cert + key, concatenated | base64-encoded PEM bundle | - -**Scope:** v1 testcloud (VR0 DC0 Omega Cloud). Roosevelt deltas documented in -section 14. - -**Out of scope:** Octavia API TLS (issued by Vault via `octavia:certificates` -relation); rotation procedure (deferred to Roosevelt runbook). - ---- - -## 2. Decisions captured - -Per workstream 3a sign-off (2026-05-22): - -| Decision | Choice | Roosevelt parallel | -|---|---|---| -| Cert provenance | Generate fresh (no Bobcat-backup copy) | Vault PKI engine | -| CA key algorithm | EC P-384 | EC P-384 (Vault root) | -| Controller cert algorithm | EC P-256 | EC P-256 | -| CA validity | 10 years | 5-year intermediate, Vault-rotated | -| Controller cert validity | 2 years | 90 days, auto-rotated | -| Distribution method | Juju overlay file (gitignored) | Vault-injected at deploy | -| Storage path on jumphost | `$HOME/octavia-pki/` | Vault PKI mounts | -| Passphrase strength | 32 random bytes, base64-encoded (44 chars) | Vault-generated | - -**Naming convention:** - -- Issuing CA CN: `VR0 DC0 Omega Cloud Octavia Issuing CA` -- Controller CA CN: `VR0 DC0 Omega Cloud Octavia Controller CA` -- Controller cert CN: `octavia-controller.omega.dc0.vr0.cloud.neumatrix.local` -- Controller cert SANs: above CN, plus `octavia.omega.dc0.vr0.cloud.neumatrix.local`, plus `10.12.4.233` (the Octavia API VIP per workstream 2) -- Organization (O): `Neumatrix` - ---- - -## 3. Prerequisites - -- Executor is on jumphost `vopenstack-jesse` as `jessea123`. -- `openssl` version 3.x or later installed (`openssl version` to confirm). -- `$HOME` is writable (snap-confined `openstackclients` cannot read `/tmp`; - all paths must resolve under `$HOME`). -- Git repository `openstack-caracal-ipv4` cloned on jumphost at a known path - (referred to as `$REPO` throughout). Set this in the executor's shell: - ```bash - export REPO=$HOME/repos/openstack-caracal-ipv4 # adjust to actual clone path - ``` -- Repository is on `main` branch and clean (`cd $REPO && git status` shows clean tree). -- Previous workstream 2 commit has been pushed (bundle has the VIP assignments and - active hacluster stack — verify with `grep -c "^ vip: 10.12.4." "$REPO/bundle.yaml"`, - expect 12). - ---- - -## 4. Pre-flight: gitignore patch (DO THIS FIRST) - -**Critical:** the `.gitignore` patch goes in BEFORE any private key material -exists on disk. This minimizes the race window for an accidental commit. - -```bash -cd "$REPO" - -# Append to .gitignore (idempotent — check if already present first) -grep -q "octavia-pki.yaml" .gitignore || cat >> .gitignore <<'EOF' - -# Octavia PKI artifacts — NEVER commit -overlays/octavia-pki.yaml -octavia-pki/ -*.key -*.key.enc -passphrase.txt -EOF - -# Review the diff -git diff .gitignore - -# Commit and push BEFORE generating any keys -git add .gitignore -git commit -m "gitignore: octavia PKI artifacts and overlay (runbook 01a)" -git push origin main -``` - -**Verify the gitignore is effective:** - -```bash -# This should NOT show overlays/octavia-pki.yaml even as untracked -touch overlays/octavia-pki.yaml -git status --short overlays/ # expect: empty output for octavia-pki.yaml -rm overlays/octavia-pki.yaml -``` - -If the test file does show as untracked, **STOP** and fix the gitignore syntax before -generating any secrets. - ---- - -## 5. Workspace setup - -```bash -WORKDIR=$HOME/octavia-pki -mkdir -p "$WORKDIR"/{issuing-ca,controller-ca,controller,overlay-build} -chmod 700 "$WORKDIR" -cd "$WORKDIR" -echo "Working in: $WORKDIR" -``` - -Resulting layout: - -``` -$HOME/octavia-pki/ -├── issuing-ca/ # passphrase.txt, .key.enc, .cert.pem -├── controller-ca/ # passphrase.txt, .key.enc, .cert.pem -├── controller/ # .key, .csr, .cert.pem, .bundle.pem, .cnf -└── overlay-build/ # base64 intermediates → consumed by step 10 -``` - ---- - -## 6. Generate Issuing CA - -EC P-384 key encrypted with random 32-byte passphrase. Self-signed cert, 10y validity. - -```bash -cd "$WORKDIR/issuing-ca" - -# Generate passphrase (no trailing newline — required for clean YAML embedding) -openssl rand -base64 32 | tr -d '\n' > passphrase.txt -chmod 600 passphrase.txt - -# Sanity-check -test $(wc -c < passphrase.txt) -eq 44 || { echo "ERROR: passphrase length wrong"; exit 1; } - -# Generate EC P-384 private key, encrypted with passphrase -openssl genpkey -algorithm EC \ - -pkeyopt ec_paramgen_curve:P-384 \ - -aes-256-cbc \ - -pass file:passphrase.txt \ - -out issuing-ca.key.enc -chmod 600 issuing-ca.key.enc - -# Self-sign cert (10 years, SHA-384) -openssl req -new -x509 -sha384 \ - -key issuing-ca.key.enc \ - -passin file:passphrase.txt \ - -days 3650 \ - -subj "/CN=VR0 DC0 Omega Cloud Octavia Issuing CA/O=Neumatrix" \ - -out issuing-ca.cert.pem - -# Verify -openssl x509 -in issuing-ca.cert.pem -noout -dates -subject -openssl verify -CAfile issuing-ca.cert.pem issuing-ca.cert.pem -# Expect: issuing-ca.cert.pem: OK - -ls -la -``` - ---- - -## 7. Generate Controller CA - -Identical pattern; different CN. - -```bash -cd "$WORKDIR/controller-ca" - -openssl rand -base64 32 | tr -d '\n' > passphrase.txt -chmod 600 passphrase.txt -test $(wc -c < passphrase.txt) -eq 44 || { echo "ERROR: passphrase length wrong"; exit 1; } - -openssl genpkey -algorithm EC \ - -pkeyopt ec_paramgen_curve:P-384 \ - -aes-256-cbc \ - -pass file:passphrase.txt \ - -out controller-ca.key.enc -chmod 600 controller-ca.key.enc - -openssl req -new -x509 -sha384 \ - -key controller-ca.key.enc \ - -passin file:passphrase.txt \ - -days 3650 \ - -subj "/CN=VR0 DC0 Omega Cloud Octavia Controller CA/O=Neumatrix" \ - -out controller-ca.cert.pem - -openssl x509 -in controller-ca.cert.pem -noout -dates -subject -openssl verify -CAfile controller-ca.cert.pem controller-ca.cert.pem -# Expect: controller-ca.cert.pem: OK -``` - -**Why Controller CA's key is encrypted even though Octavia never uses it:** -The Controller CA key is needed for future rotations of the controller cert. -Encrypting it (with its own passphrase, separate from Issuing CA's) is defense -in depth — if the jumphost is compromised, the key still requires the -passphrase to be useful for forging controller certs. - ---- - -## 8. Generate Controller certificate - -EC P-256 key (no encryption — Octavia must read it at startup), CSR with SAN -extensions, signed by Controller CA, 2y validity. - -```bash -cd "$WORKDIR/controller" - -# Generate unencrypted EC P-256 key -openssl genpkey -algorithm EC \ - -pkeyopt ec_paramgen_curve:P-256 \ - -out controller.key -chmod 600 controller.key - -# CSR config with SAN extensions -cat > controller.cnf <<'EOF' -[req] -distinguished_name = req_distinguished_name -req_extensions = v3_req -prompt = no - -[req_distinguished_name] -CN = octavia-controller.omega.dc0.vr0.cloud.neumatrix.local -O = Neumatrix - -[v3_req] -keyUsage = critical, digitalSignature, keyEncipherment -extendedKeyUsage = clientAuth, serverAuth -subjectAltName = @alt_names - -[alt_names] -DNS.1 = octavia-controller.omega.dc0.vr0.cloud.neumatrix.local -DNS.2 = octavia.omega.dc0.vr0.cloud.neumatrix.local -IP.1 = 10.12.4.233 -EOF - -# Generate CSR -openssl req -new -sha256 \ - -key controller.key \ - -config controller.cnf \ - -out controller.csr - -# Sign with Controller CA (2 years) -openssl x509 -req -sha256 \ - -in controller.csr \ - -CA "$WORKDIR/controller-ca/controller-ca.cert.pem" \ - -CAkey "$WORKDIR/controller-ca/controller-ca.key.enc" \ - -passin file:"$WORKDIR/controller-ca/passphrase.txt" \ - -CAcreateserial \ - -days 730 \ - -extfile controller.cnf \ - -extensions v3_req \ - -out controller.cert.pem - -# Bundle cert + key (the lb-mgmt-controller-cert option expects both in one PEM) -cat controller.cert.pem controller.key > controller.bundle.pem -chmod 600 controller.bundle.pem -``` - -**Verify the chain and SAN:** - -```bash -# Chain verifies -openssl verify -CAfile "$WORKDIR/controller-ca/controller-ca.cert.pem" controller.cert.pem -# Expect: controller.cert.pem: OK - -# SAN extensions present -openssl x509 -in controller.cert.pem -noout -ext subjectAltName -# Expect: -# DNS:octavia-controller.omega.dc0.vr0.cloud.neumatrix.local, -# DNS:octavia.omega.dc0.vr0.cloud.neumatrix.local, -# IP Address:10.12.4.233 - -# Validity -openssl x509 -in controller.cert.pem -noout -dates -# Expect: notAfter ~2 years from today - -# Bundle integrity (cert + key match) -openssl x509 -in controller.bundle.pem -noout -pubkey > /tmp/cert.pub -openssl pkey -in controller.bundle.pem -pubout > /tmp/key.pub -diff /tmp/cert.pub /tmp/key.pub && echo "Bundle cert/key match" -rm /tmp/cert.pub /tmp/key.pub -``` - ---- - -## 9. Final chain verification - -A standalone block to confirm the full chain is sound before consuming for Octavia: - -```bash -cd "$WORKDIR" - -echo "=== Issuing CA ===" -openssl x509 -in issuing-ca/issuing-ca.cert.pem -noout -subject -dates -openssl verify -CAfile issuing-ca/issuing-ca.cert.pem issuing-ca/issuing-ca.cert.pem - -echo "" -echo "=== Controller CA ===" -openssl x509 -in controller-ca/controller-ca.cert.pem -noout -subject -dates -openssl verify -CAfile controller-ca/controller-ca.cert.pem controller-ca/controller-ca.cert.pem - -echo "" -echo "=== Controller cert ===" -openssl x509 -in controller/controller.cert.pem -noout -subject -dates -openssl verify -CAfile controller-ca/controller-ca.cert.pem controller/controller.cert.pem -``` - -All three "verify" lines must show `: OK`. If any do not, **STOP** and investigate -before proceeding. - ---- - -## 10. Base64-encode artifacts - -Each base64 file is a single line (no wrapping); each becomes one YAML value. - -```bash -cd "$WORKDIR/overlay-build" - -# Issuing CA cert (base64) -base64 -w0 "$WORKDIR/issuing-ca/issuing-ca.cert.pem" > issuing-cacert.b64 - -# Issuing CA private key (already encrypted PEM → base64) -base64 -w0 "$WORKDIR/issuing-ca/issuing-ca.key.enc" > issuing-ca-private-key.b64 - -# Controller CA cert -base64 -w0 "$WORKDIR/controller-ca/controller-ca.cert.pem" > controller-cacert.b64 - -# Controller cert + key bundle -base64 -w0 "$WORKDIR/controller/controller.bundle.pem" > controller-cert.b64 - -# Sanity-check sizes (expect 500-2000 chars each) -wc -c *.b64 -``` - ---- - -## 11. Assemble the overlay file - -```bash -# Read each artifact into shell variables -ISSUING_CACERT=$(cat "$WORKDIR/overlay-build/issuing-cacert.b64") -ISSUING_CA_KEY=$(cat "$WORKDIR/overlay-build/issuing-ca-private-key.b64") -ISSUING_CA_PASS=$(cat "$WORKDIR/issuing-ca/passphrase.txt") -CONTROLLER_CACERT=$(cat "$WORKDIR/overlay-build/controller-cacert.b64") -CONTROLLER_CERT=$(cat "$WORKDIR/overlay-build/controller-cert.b64") - -# Assemble overlay (note: passphrase is YAML-quoted; cert blobs are not — they're -# guaranteed-safe base64 without special chars) -mkdir -p "$REPO/overlays" -cat > "$REPO/overlays/octavia-pki.yaml" < - # lb-mgmt-controller-cert: - # lb-mgmt-issuing-ca-key-passphrase: - # lb-mgmt-issuing-ca-private-key: - # lb-mgmt-issuing-cacert: -``` - -**With this block:** - -```yaml - # ----- PKI material ------------------------------------------------- - # 5 lb-mgmt-* options are supplied via overlays/octavia-pki.yaml - # (gitignored). Generated per runbooks/01a-octavia-pki-generation.md. - # Deploy with: - # juju deploy ./bundle.yaml \ - # --overlay overlays/vr0-dc0-testcloud.yaml \ - # --overlay overlays/octavia-pki.yaml -``` - -Commit this bundle change separately from the overlay generation work: - -```bash -cd "$REPO" -git diff bundle.yaml -git add bundle.yaml -git commit -m "bundle: octavia PKI moves to overlay (runbook 01a) - -Remove inline placeholders + TODO(octavia-cert) block. PKI values now -supplied via overlays/octavia-pki.yaml (gitignored), generated per -runbooks/01a-octavia-pki-generation.md. Decision per workstream 3a -(2026-05-22): industry-best-practice secret handling on testcloud -to rehearse Roosevelt's Vault-PKI-backed posture." -git push origin main -``` - ---- - -## 13. Sensitive-file backup - -The Issuing CA private key + its passphrase are the crown jewels of the LB trust -domain. Loss → cannot sign new amphora certs (LBs gradually break). Exposure → -attacker can forge amphora identities and intercept tenant LB traffic. - -**Minimum backup for testcloud:** - -```bash -cd $HOME -BACKUP_NAME="octavia-pki-backup-$(date +%Y%m%d-%H%M%S).tar.gz" - -tar -czf "$BACKUP_NAME" -C $HOME octavia-pki/ - -# Encrypt with strong symmetric cipher -gpg --symmetric --cipher-algo AES256 --output "${BACKUP_NAME}.gpg" "$BACKUP_NAME" - -# Shred the unencrypted tar -shred -uvz "$BACKUP_NAME" - -ls -la "${BACKUP_NAME}.gpg" -``` - -**Move `${BACKUP_NAME}.gpg` off-host** (your decision — admin workstation -encrypted drive, password-manager attachment, dedicated secrets vault, etc.). -Do NOT leave it sitting in $HOME on the jumphost long-term — that's a single -point of compromise. - -**Roosevelt note:** Vault PKI engine stores all of this — no manual backup -required; Vault's own backup mechanism covers it. The procedure above is -testcloud-only. - ---- - -## 14. Cleanup of intermediates - -After successful deploy + verification (section 14), shred files that are not -needed for future rotation: - -```bash -# Optional: shred the base64 intermediates (regeneratable from PEM sources) -shred -uvz "$WORKDIR/overlay-build/"*.b64 -rmdir "$WORKDIR/overlay-build" - -# Optional: shred the CSR (regeneratable if needed) -shred -uvz "$WORKDIR/controller/controller.csr" - -# DO NOT shred any of the following — they are needed for future operations: -# - issuing-ca/{issuing-ca.cert.pem, issuing-ca.key.enc, passphrase.txt} -# - controller-ca/{controller-ca.cert.pem, controller-ca.key.enc, passphrase.txt} -# - controller/{controller.key, controller.cert.pem, controller.bundle.pem, controller.cnf} -# -# Specifically: -# - Issuing CA artifacts: required for signing new amphoras (Octavia uses them runtime) -# - Controller CA artifacts: required for signing new controller certs (rotation) -# - Controller cert/key: required to repopulate the overlay if jumphost is rebuilt -``` - ---- - -## 15. Post-deploy verification - -After `runbooks/02-deploy.md` completes (`juju deploy` with the overlay), -verify Octavia is healthy and the PKI plumbing works. - -```bash -# Octavia charm active/idle -juju status octavia -# Expect: octavia/0 active idle - -# Octavia services running -juju ssh octavia/0 -- sudo systemctl is-active octavia-api octavia-worker octavia-housekeeping -# Expect: 3x "active" - -# Confirm PKI files landed on the unit -juju ssh octavia/0 -- sudo ls -la /etc/octavia/certs/ -# Expect: server_ca.cert.pem, server_ca.key.pem, client_ca.cert.pem, client.cert-and-key.pem -# (filenames are charm-controlled; presence is what matters) - -# Confirm Octavia can use them — verbose health-check from the API -juju ssh octavia/0 -- sudo journalctl -u octavia-api --since "5 minutes ago" \ - | grep -iE "(cert|ssl|tls|amphora)" | head -20 -# Expect: no errors related to cert loading -``` - -**Smoketest — create a test LB once amphora image is available:** - -```bash -# After `octavia-diskimage-retrofit` has populated Glance with the amphora image, -# and the LBaaS Mgmt network is wired (these are downstream runbook steps), -# a test LB creation exercises the full PKI chain: - -source ~/admin-openrc -openstack loadbalancer create --name pki-smoketest --vip-subnet-id - -# Watch for amphora spawn (3-5 minutes typical) -watch -n5 'openstack loadbalancer show pki-smoketest' -# Wait for: provisioning_status=ACTIVE, operating_status=ONLINE - -# Octavia-worker log should show successful amphora handshake (signed by Issuing CA, -# trusted via Controller CA): -juju ssh octavia/0 -- sudo journalctl -u octavia-worker --since "10 minutes ago" \ - | grep -iE "(amphora|cert)" | tail -20 -# Expect: "amphora connection established" or similar -# Expect: no TLS handshake errors, no cert validation errors - -# Cleanup the smoketest LB -openstack loadbalancer delete pki-smoketest --cascade -``` - -If amphora handshake fails with cert errors, the most likely causes are: - -1. SAN mismatch — the controller's connection to amphora uses the cert's CN/SAN; - verify the controller cert SAN covers all addresses Octavia uses to reach amphorae. -2. Bundle/key mismatch — `lb-mgmt-controller-cert` bundle should contain BOTH the - cert and the matching private key; if they're for different keys, handshake fails. -3. Encrypted Issuing CA key + wrong passphrase — verify the passphrase string in - the overlay matches what was used at generation. - ---- - -## 16. Roosevelt deltas (forward-look) - -When this runbook is adapted for Roosevelt bare-metal deploy: - -| Aspect | Testcloud (v1) | Roosevelt | -|---|---|---| -| Issuing CA root | Self-signed | Intermediate signed by Vault root CA | -| CA storage | Filesystem on jumphost | Vault PKI engine, encrypted at rest | -| Controller cert validity | 2 years | 90 days | -| Rotation | Manual (this runbook re-run) | Automated via Vault + cron + bundle redeploy | -| Backup | gpg tarball, off-host | Vault's own backup mechanism | -| Amphora image signing | Out of scope for v1 | Image signed by Vault PKI as well | -| Procedure file | `runbooks/01a-octavia-pki-generation.md` | New runbook in Roosevelt repo | - -The procedure structure (generate Issuing CA → Controller CA → Controller cert → -encode → overlay → backup → deploy) remains identical. Roosevelt just sources -the CA root from Vault instead of self-signing. - ---- - -## 17. Rotation/renewal pointer - -For testcloud, the 2-year controller cert and 10-year CAs are intentionally -"set and forget" — they will outlive the cloud at this scale. - -If rotation IS needed before testcloud teardown (e.g., a key leak event), the -re-run procedure is: - -1. Generate new Controller cert signed by **existing** Controller CA (re-run - sections 8-9 only). -2. Regenerate the overlay (section 11) with the new Controller cert; leave all - other values unchanged. -3. `juju config octavia lb-mgmt-controller-cert=` (single-option - update; does not require full bundle redeploy). -4. Octavia services may need a restart: `juju ssh octavia/0 -- sudo systemctl restart octavia-api octavia-worker octavia-housekeeping`. -5. Existing amphorae will need to reconnect using the new cert; in-flight LBs - may briefly drop. This is acceptable for a security-event rotation. - -For Roosevelt, this whole procedure is replaced by Vault automated rotation — -see Roosevelt runbook (TBD). - ---- - -## 18. Change log - -| Date | Change | Reference | -|---|---|---| -| 2026-05-22 | Document created. Fresh-generate, EC P-384 CAs, EC P-256 controller cert, overlay-file distribution. | Workstream 3a | diff --git a/runbooks/deprecated/02-deploy.md b/runbooks/deprecated/02-deploy.md deleted file mode 100644 index 4a52845..0000000 --- a/runbooks/deprecated/02-deploy.md +++ /dev/null @@ -1,23 +0,0 @@ -# Runbook 02 — Deploy New Caracal Bundle - -**STATUS: PLACEHOLDER** — drafted alongside bundle.yaml. - -## Purpose - -Deploy the new Charmed OpenStack Caracal bundle and wait for the cloud to -settle in `active/idle`. - -## Prerequisites - -- Runbook 01 complete (model destroyed, MAAS state clean) -- `bundle.yaml` and `overlays/vr0-dc0-testcloud.yaml` drafted and reviewed -- `scripts/pre-flight-checks.sh` passes - -## TODO - -- [ ] `juju add-model openstack` -- [ ] `juju deploy ./bundle.yaml --overlay overlays/vr0-dc0-testcloud.yaml --trust` -- [ ] Wait for settle (`juju-wait` or `juju status --watch 30s`) -- [ ] Pause-points for Vault init (per Runbook 03) -- [ ] Acceptance: all charms `active/idle` modulo Vault (sealed) and any - charms waiting on Vault certificates diff --git a/runbooks/deprecated/03-vault-init.md b/runbooks/deprecated/03-vault-init.md deleted file mode 100644 index 40db379..0000000 --- a/runbooks/deprecated/03-vault-init.md +++ /dev/null @@ -1,24 +0,0 @@ -# Runbook 03 — Vault Initialization - -**STATUS: PLACEHOLDER** — drafted during deploy phase. - -## Purpose - -Initialize the Vault instance(s), unseal, authorize, and let certificate -relations resolve so dependent charms reach `active/idle`. - -## Prerequisites - -- Bundle deployed; Vault charm in `blocked` waiting for init -- etcd cluster in `active/idle` (Vault HA backend per D-006) -- easyrsa active (TLS bootstrap) - -## TODO - -- [ ] `juju run vault/leader generate-root-ca` — capture root CA cert -- [ ] `vault operator init -key-shares=5 -key-threshold=3` — capture keys -- [ ] Unseal with 3 of 5 keys -- [ ] `juju run vault/leader authorize-charm token=` -- [ ] Verify all `:certificates` relations complete (no charms stuck - waiting on certs) -- [ ] Store unseal keys in `~/.vault-keys/` (chmod 600); back up diff --git a/runbooks/deprecated/04-magnum-domain.md b/runbooks/deprecated/04-magnum-domain.md deleted file mode 100644 index b400a18..0000000 --- a/runbooks/deprecated/04-magnum-domain.md +++ /dev/null @@ -1,21 +0,0 @@ -# Runbook 04 — Magnum Keystone Domain Setup - -**STATUS: PLACEHOLDER** — drafted post-deploy. - -## Purpose - -Run the magnum charm's `domain-setup` action to create the Keystone domain, -trust role, and service user that Magnum requires for cluster operations. - -## Prerequisites - -- Magnum charm reached `active/idle` post Vault init -- Keystone reachable from jumphost via FQDN - -## TODO - -- [ ] `juju run magnum/leader domain-setup --wait=10m` -- [ ] Verify creation in Keystone: - `openstack domain show magnum` - `openstack user show magnum_domain_admin --domain magnum` -- [ ] Acceptance: domain present, trust role assigned, charm in active/idle diff --git a/runbooks/deprecated/04a-capi-bootstrap-cluster.md b/runbooks/deprecated/04a-capi-bootstrap-cluster.md deleted file mode 100644 index d98f20c..0000000 --- a/runbooks/deprecated/04a-capi-bootstrap-cluster.md +++ /dev/null @@ -1,1056 +0,0 @@ -# Runbook 04a — CAPI bootstrap cluster - -**Status:** Executes after `02-deploy.md` (cloud up + all charms active/idle) -and `03-vault-init.md` (Vault initialized + root CA available). Precedes -`05-magnum-capi-driver.md` (driver graft consumes the workload kubeconfig -produced here). - -**D-017 posture:** L3 full teardown and rebuild every deployment cycle. -Nothing is preserved across cycles. capi-mgmt is wiped to MAAS Ready on -teardown; rebuilt from scratch by this runbook. - -**Cross-references:** -- D-017 (CAPI bootstrap cluster lifecycle) -- D-007 (Magnum two-layer install) -- D-002 (channel matrix — informs Vault CA chain) -- Workstream 3b decision (2026-05-22): ship Vault CA (no tls-insecure); pivot mandatory - ---- - -## 1. Purpose & scope - -This runbook stands up the CAPI bootstrap cluster on `capi-mgmt.maas` and -pivots cluster state into a self-managing workload cluster. Output: - -1. **Workload K8s cluster** (`capi-mgmt-cluster`) running in tenant VMs on - the cloud, self-managing post-pivot. -2. **Workload kubeconfig** copied to jumphost at a known path. Consumed by - `runbooks/05-magnum-capi-driver.md` for the Magnum CAPI Helm driver - graft. -3. **No remaining state** on the bootstrap k3s VM after pivot. capi-mgmt - becomes a disposable jump host. - -**Scope:** v1 testcloud. Roosevelt deltas in section 20. - -**Out of scope:** - -- Magnum-side configuration (runbook 05). -- Workload cluster's tenant lifecycle (Magnum's job, not this runbook's). -- Backup / DR for the workload cluster (Roosevelt concern). - ---- - -## 2. Decisions captured - -Per workstream 3b sign-off (2026-05-22): - -| Decision | Choice | Roosevelt parallel | -|---|---|---| -| Version pinning | Pin-at-execution with discovery in §4 | Same pattern; pins captured in deploy record | -| Cloud TLS trust | Ship Vault CA to capi-mgmt + workload nodes (no `tls-insecure`) | Image-baked CA; CK8sConfig redundancy | -| `clusterctl move` pivot | Mandatory; workload cluster becomes self-managing | Same | -| K8s flavor | Canonical Kubernetes (CK8s) | Same | -| OpenStack auth | v3applicationcredential | Same | -| Pod CIDR | `10.244.0.0/16` | Same (does not conflict with cloud `10.12.0.0/16` or tenant pool `10.20.0.0/16`) | -| Service CIDR | `10.96.0.0/12` | Same | -| Workload cluster name | `capi-mgmt-cluster` | Same | -| Workload node SSH user | `ubuntu` (MAAS/cloud-init convention) | Same | - -**Naming convention:** - -- Keystone project for CAPI: `capi-mgmt` (in `admin_domain`) -- Keystone user for CAPI: `capo` (CAPO operator) -- App credential: `capo-app-cred` -- Workload image (Glance): `noble-amd64` (existing; do NOT duplicate as `ubuntu-24.04-capi` — Bobcat lesson) -- Workload flavor: `capi-mgmt-node` (4 vCPU / 4 GiB / 30 GB) — control plane node sizing - ---- - -## 3. Prerequisites - -| Prereq | Verification | -|---|---| -| Cloud deployed; all charms `active/idle` per D-011 | `juju status --color\| grep -v "active.*idle"` returns only the header | -| Vault initialized + unsealed | `juju ssh vault/leader -- sudo vault status` shows `Sealed=false` | -| Vault root CA available on jumphost | `test -f $HOME/vault-pki/root-ca.pem && openssl x509 -in $HOME/vault-pki/root-ca.pem -noout -subject` | -| Keystone reachable via FQDN | `curl -sf --cacert $HOME/vault-pki/root-ca.pem https://keystone.omega.dc0.vr0.cloud.neumatrix.local:5000/v3 \| jq .version.id` returns `"v3.14"` or current | -| capi-mgmt VM exists in MAAS as Ready | `maas $MAAS_PROFILE machines read \| jq '.[] \| select(.hostname=="capi-mgmt") \| .status_name'` returns `"Ready"` | -| Admin openrc available | `test -f $HOME/admin-openrc && source $HOME/admin-openrc && openstack token issue \| head -3` | -| Workspace path under $HOME (snap confinement) | `WORK=$HOME/capi-bootstrap; mkdir -p "$WORK"; cd "$WORK"; pwd` shows under home | - -**Set shell context for the runbook:** - -```bash -export REPO=$HOME/repos/openstack-caracal-ipv4 # adjust if your clone is elsewhere -export WORK=$HOME/capi-bootstrap # runbook scratch dir -export VAULT_CA=$HOME/vault-pki/root-ca.pem # Vault root CA (from runbook 03) -export CAPI_MGMT_METAL_IP=10.12.8.21 # capi-mgmt metal interface -export CAPI_MGMT_PROVIDER_IP=10.12.4.21 # capi-mgmt provider interface -export CLUSTER_NAME=capi-mgmt-cluster -mkdir -p "$WORK" -cd "$WORK" -``` - ---- - -## 4. Version discovery (set pins) - -Bobcat ran "dynamic latest." This runbook pins explicit versions captured at -execution time, with the discovery procedure documented inline so each -rebuild's pins are reproducible AND traceable. - -**GitHub API: authenticated vs unauthenticated.** Unauth has 60 req/hr; -authenticated has 5000. For multiple rebuilds in a day, set a token: - -```bash -# Optional but recommended — avoids rate-limit headaches during rebuild -export GITHUB_TOKEN= -# Or skip if you can tolerate ~10 API calls slowly -``` - -**Discover current stable releases:** - -```bash -cd "$WORK" - -# Helper: fetch latest stable release tag from a GitHub repo -gh_latest() { - local repo=$1 - local auth="" - [ -n "$GITHUB_TOKEN" ] && auth="-H Authorization: Bearer $GITHUB_TOKEN" - curl -sfL $auth "https://api.github.com/repos/$repo/releases/latest" \ - | jq -r '.tag_name' -} - -# Pin captures (one file per pin, for the deploy-record convention) -mkdir -p pins -gh_latest "kubernetes-sigs/cluster-api" | tee pins/CAPI_VERSION -gh_latest "kubernetes-sigs/cluster-api-provider-openstack" | tee pins/CAPO_VERSION -gh_latest "canonical/cluster-api-k8s" | tee pins/CK8S_VERSION -gh_latest "cert-manager/cert-manager" | tee pins/CERT_MANAGER_VERSION -gh_latest "k-orc/openstack-resource-controller" | tee pins/ORC_VERSION -gh_latest "k3s-io/k3s" | tee pins/K3S_VERSION -gh_latest "helm/helm" | tee pins/HELM_VERSION - -# Load into shell -export CAPI_VERSION=$(cat pins/CAPI_VERSION) -export CAPO_VERSION=$(cat pins/CAPO_VERSION) -export CK8S_VERSION=$(cat pins/CK8S_VERSION) -export CERT_MANAGER_VERSION=$(cat pins/CERT_MANAGER_VERSION) -export ORC_VERSION=$(cat pins/ORC_VERSION) -export K3S_VERSION=$(cat pins/K3S_VERSION) -export HELM_VERSION=$(cat pins/HELM_VERSION) - -# Display for the deploy log -cat pins/*_VERSION | paste -d= <(ls pins/) - -``` - -**Sanity check:** all values should look like `v1.X.Y` or `v0.X.Y`. If any -returned `null` or empty, the GitHub API call failed — most likely -rate-limited. Wait an hour or set `$GITHUB_TOKEN` and retry. - -**Capture pins to repo as deploy record:** - -The pin files in `$WORK/pins/` should be appended to a deploy-log artifact -(NOT committed to the repo — these are deploy-time captures). Suggested -location: `$HOME/deploy-records/$(date +%Y%m%d-%H%M)/capi-pins/`. - -```bash -DEPLOY_RECORD=$HOME/deploy-records/$(date +%Y%m%d-%H%M%S)/capi-pins -mkdir -p "$DEPLOY_RECORD" -cp pins/*_VERSION "$DEPLOY_RECORD/" -ls -la "$DEPLOY_RECORD/" -``` - ---- - -## 5. MAAS-deploy capi-mgmt - -Prerequisite: capi-mgmt MAAS machine is in `Ready` state (see §3). -Network config in MAAS: - -- **eth0** on metal fabric, DHCP → `10.12.8.21` (MAAS-pinned static lease) -- **eth1** on provider fabric, static → `10.12.4.21` - -Deploy Ubuntu 24.04 (Noble): - -```bash -# Get the capi-mgmt system_id from MAAS -CAPI_MGMT_SYSTEM_ID=$(maas $MAAS_PROFILE machines read \ - | jq -r '.[] | select(.hostname=="capi-mgmt") | .system_id') -echo "capi-mgmt system_id: $CAPI_MGMT_SYSTEM_ID" - -# Deploy -maas $MAAS_PROFILE machine deploy "$CAPI_MGMT_SYSTEM_ID" \ - distro_series=noble \ - hwe_kernel=ga-24.04 -``` - -Poll for `Deployed`: - -```bash -while true; do - STATUS=$(maas $MAAS_PROFILE machine read "$CAPI_MGMT_SYSTEM_ID" \ - | jq -r '.status_name') - echo "$(date -Is) capi-mgmt status: $STATUS" - [ "$STATUS" = "Deployed" ] && break - [ "$STATUS" = "Failed deployment" ] && { echo "FAILED"; exit 1; } - sleep 30 -done -``` - -Typical deploy time: 5-8 minutes on this hardware. - -**SSH reachability:** - -```bash -# MAAS .maas zone may not resolve from jumphost — use IP directly per handoff lessons -ssh -o StrictHostKeyChecking=accept-new ubuntu@$CAPI_MGMT_METAL_IP -- hostname -# Expect: capi-mgmt -``` - -> **Gotcha:** MAAS-deployed Ubuntu uses the `ubuntu` user, not `jessea123`. -> See handoff "recurring technical pitfalls." - ---- - -## 6. SSH bootstrap + Vault CA install - -On the jumphost, prepare a transport bundle of essentials: - -```bash -mkdir -p "$WORK/bootstrap-bundle" -cp "$VAULT_CA" "$WORK/bootstrap-bundle/vault-ca.crt" -chmod 644 "$WORK/bootstrap-bundle/vault-ca.crt" - -# Bundle pin files so capi-mgmt can read versions -cp -r "$WORK/pins" "$WORK/bootstrap-bundle/" -``` - -SCP and install Vault CA on capi-mgmt: - -```bash -scp -r "$WORK/bootstrap-bundle" ubuntu@$CAPI_MGMT_METAL_IP:/home/ubuntu/ - -ssh ubuntu@$CAPI_MGMT_METAL_IP <<'EOF' -set -euo pipefail - -# Install Vault CA as a system-trusted root -sudo cp /home/ubuntu/bootstrap-bundle/vault-ca.crt /usr/local/share/ca-certificates/ -sudo update-ca-certificates 2>&1 | tail -3 - -# Verify -openssl s_client -connect keystone.omega.dc0.vr0.cloud.neumatrix.local:5000 \ - -CApath /etc/ssl/certs -verify_return_error &1 \ - | grep -E "(Verify return code|subject=)" || \ - { echo "TLS chain verify failed against Keystone — investigate before proceeding"; exit 1; } - -# Update apt + base utilities -sudo apt-get update -qq -sudo apt-get install -y -qq jq curl yq - -# Confirm -which jq curl yq -EOF -``` - -**Expected:** - -- `update-ca-certificates` reports "1 added" -- `openssl s_client` shows `Verify return code: 0 (ok)` and a Keystone cert - whose chain terminates at the Vault CA - -> **Why this matters:** Bobcat used `tls-insecure=true` in cloud.conf which -> skipped this entire trust path. Our workstream 3b decision (ship Vault CA) -> means OCCM and CAPO will validate certs against this trust store. If TLS -> verify fails here, OCCM will crashloop later. - ---- - -## 7. k3s install - -On capi-mgmt: - -```bash -ssh ubuntu@$CAPI_MGMT_METAL_IP "K3S_VERSION=$K3S_VERSION CAPI_MGMT_METAL_IP=$CAPI_MGMT_METAL_IP bash -s" <<'REMOTE_EOF' -set -euo pipefail - -# Install k3s with explicit bind/advertise/SAN flags -curl -sfL https://get.k3s.io | \ - INSTALL_K3S_VERSION="$K3S_VERSION" \ - sh -s - server \ - --bind-address="$CAPI_MGMT_METAL_IP" \ - --advertise-address="$CAPI_MGMT_METAL_IP" \ - --node-ip="$CAPI_MGMT_METAL_IP" \ - --tls-san="$CAPI_MGMT_METAL_IP" \ - --tls-san=capi-mgmt.maas \ - --write-kubeconfig-mode=0644 \ - --disable=traefik - -# Wait for k3s API to respond -for i in $(seq 1 30); do - if sudo kubectl get nodes 2>/dev/null | grep -q "Ready"; then - echo "k3s ready"; break - fi - echo "Waiting for k3s API... ($i/30)" - sleep 5 -done - -sudo kubectl get nodes -sudo kubectl get pods -A -REMOTE_EOF -``` - -> **Gotcha:** `--bind-address=$IP` makes k3s listen ONLY on that IP — not -> also on 127.0.0.1. The default kubeconfig at -> `/etc/rancher/k3s/k3s.yaml` has `server: https://127.0.0.1:6443` and will -> NOT work as-is. Sed-rewrite below. - ---- - -## 8. Kubeconfig server-URL rewrite - -```bash -ssh ubuntu@$CAPI_MGMT_METAL_IP "CAPI_MGMT_METAL_IP=$CAPI_MGMT_METAL_IP bash -s" <<'REMOTE_EOF' -set -euo pipefail - -# Copy k3s kubeconfig to ubuntu user; rewrite server URL -mkdir -p /home/ubuntu/.kube -sudo cp /etc/rancher/k3s/k3s.yaml /home/ubuntu/.kube/config -sudo chown ubuntu:ubuntu /home/ubuntu/.kube/config -chmod 600 /home/ubuntu/.kube/config - -# Rewrite 127.0.0.1 → metal IP -sed -i "s|server: https://127.0.0.1:6443|server: https://$CAPI_MGMT_METAL_IP:6443|" \ - /home/ubuntu/.kube/config - -# Verify rewrite -grep "server:" /home/ubuntu/.kube/config -# Expect: server: https://10.12.8.21:6443 - -# Confirm kubectl works as ubuntu user (no sudo) -kubectl get nodes -REMOTE_EOF -``` - ---- - -## 9. helm + clusterctl install - -```bash -ssh ubuntu@$CAPI_MGMT_METAL_IP "HELM_VERSION=$HELM_VERSION CAPI_VERSION=$CAPI_VERSION bash -s" <<'REMOTE_EOF' -set -euo pipefail - -# helm install (get-helm-3 fetches the version we specify) -cd /tmp -curl -sfL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 \ - | DESIRED_VERSION="$HELM_VERSION" bash -helm version --short - -# clusterctl install -CLUSTERCTL_URL="https://github.com/kubernetes-sigs/cluster-api/releases/download/${CAPI_VERSION}/clusterctl-linux-amd64" -sudo curl -sfL "$CLUSTERCTL_URL" -o /usr/local/bin/clusterctl -sudo chmod +x /usr/local/bin/clusterctl -clusterctl version -REMOTE_EOF -``` - ---- - -## 10. clusterctl init (CAPI controllers + cert-manager + ORC + CAPO + CK8s) - -```bash -ssh ubuntu@$CAPI_MGMT_METAL_IP "CK8S_VERSION=$CK8S_VERSION CERT_MANAGER_VERSION=$CERT_MANAGER_VERSION ORC_VERSION=$ORC_VERSION CAPO_VERSION=$CAPO_VERSION bash -s" <<'REMOTE_EOF' -set -euo pipefail - -# Configure clusterctl with provider URLs -mkdir -p ~/.cluster-api -cat > ~/.cluster-api/clusterctl.yaml </dev/null || true -kubectl wait --for=condition=Available --timeout=5m \ - deployment --all -n capo-system -kubectl wait --for=condition=Available --timeout=5m \ - deployment --all -n cert-manager - -# Install ORC -kubectl apply -f "https://github.com/k-orc/openstack-resource-controller/releases/${ORC_VERSION}/orc.yaml" -kubectl wait --for=condition=Available --timeout=5m \ - deployment --all -n orc-system - -# Confirm all controllers -kubectl get pods -A | grep -v "Running\|Completed" | grep -v NAME -# Expected: empty output (all pods Running or no abnormal state) -REMOTE_EOF -``` - -> **Gotcha:** the actual namespace names (`capi-system`, `capo-system`, etc.) -> are conventions. If a controller fails to land in the expected namespace, -> `kubectl get deployment -A` lists all deployments — diagnose from there. - ---- - -## 11. Cloud-side prep (Keystone, Nova, Glance) - -Back on the jumphost: - -```bash -source $HOME/admin-openrc - -# Inventory existing resources FIRST (Bobcat lesson: don't create duplicates) -echo "=== Existing images ===" -openstack image list -c ID -c Name -f json | jq -r '.[] | "\(.Name)\t\(.ID)"' -echo "" -echo "=== Existing flavors ===" -openstack flavor list -c Name -c ID -c RAM -c VCPUs -c Disk -f json \ - | jq -r '.[] | "\(.Name)\tRAM=\(.RAM)\tCPU=\(.VCPUs)\tDisk=\(.Disk)\tID=\(.ID)"' -echo "" -echo "=== Existing keypairs ===" -openstack keypair list -echo "" -echo "=== Existing projects in admin_domain ===" -openstack project list --domain admin_domain -``` - -**Create / verify resources:** - -```bash -# Keystone project + user -openstack project show capi-mgmt --domain admin_domain 2>/dev/null \ - || openstack project create capi-mgmt --domain admin_domain --description "CAPI management plane" - -openstack user show capo --domain admin_domain 2>/dev/null \ - || openstack user create capo --domain admin_domain --password-prompt --description "CAPO operator" - -# Role assignments (CAPO needs member + load-balancer_member at minimum; -# admin works for testcloud — Roosevelt should use least-privilege) -openstack role add --user capo --user-domain admin_domain \ - --project capi-mgmt --project-domain admin_domain \ - member - -openstack role add --user capo --user-domain admin_domain \ - --project capi-mgmt --project-domain admin_domain \ - load-balancer_member 2>/dev/null || \ - echo "(load-balancer_member role may not exist if Octavia not deployed yet)" - -# Application credential — captured to file under $HOME (snap confinement) -APP_CRED_FILE=$WORK/capo-app-cred.json -openstack --os-username capo --os-user-domain-name admin_domain \ - --os-project-name capi-mgmt --os-project-domain-name admin_domain \ - application credential create capo-app-cred \ - --description "CAPO operator app credential" \ - -f json > "$APP_CRED_FILE" -chmod 600 "$APP_CRED_FILE" - -# Extract credential ID + secret -export APP_CRED_ID=$(jq -r '.id' "$APP_CRED_FILE") -export APP_CRED_SECRET=$(jq -r '.secret' "$APP_CRED_FILE") -echo "App cred ID: $APP_CRED_ID" -``` - -**Nova keypair (workload node SSH key):** - -```bash -# Generate fresh keypair locally (do NOT reuse jumphost personal key) -ssh-keygen -t ed25519 -N '' -f "$WORK/capi-workload-key" \ - -C "capi-workload-$(date +%Y%m%d)" -chmod 600 "$WORK/capi-workload-key" - -# Upload public key to Keystone as a Nova keypair -openstack keypair create --public-key "$WORK/capi-workload-key.pub" capi-workload-key -openstack keypair show capi-workload-key -``` - -**Workload image:** - -```bash -# Inventory check — use noble-amd64 if it exists (Bobcat lesson: do NOT create ubuntu-24.04-capi as a dup) -NOBLE_IMAGE_ID=$(openstack image show noble-amd64 -c id -f value 2>/dev/null || echo "") - -if [ -z "$NOBLE_IMAGE_ID" ]; then - echo "noble-amd64 image not found — upload required." - echo "(Pull from https://cloud-images.ubuntu.com/noble/current/noble-server-cloudimg-amd64.img" - echo " then: openstack image create --disk-format qcow2 --container-format bare \\" - echo " --public --file noble-server-cloudimg-amd64.img noble-amd64)" - exit 1 -fi -echo "Using image: noble-amd64 ($NOBLE_IMAGE_ID)" -export WORKLOAD_IMAGE_ID=$NOBLE_IMAGE_ID -``` - -**Workload flavor:** - -```bash -openstack flavor show capi-mgmt-node 2>/dev/null \ - || openstack flavor create capi-mgmt-node \ - --vcpus 4 --ram 4096 --disk 30 \ - --description "CAPI workload node (control plane sizing)" - -export WORKLOAD_FLAVOR=capi-mgmt-node -``` - ---- - -## 12. clouds.yaml + cloud.conf composition (with Vault CA, no tls-insecure) - -The workload cluster's OCCM (OpenStack Cloud Controller Manager) and CAPO both -need to call OpenStack APIs. Two files: - -- `clouds.yaml` — CAPO's view of how to reach OpenStack (used at cluster - creation time on capi-mgmt) -- `cloud.conf` — OCCM's view, injected into the workload cluster's k8s - Secret (used continuously by OCCM running in the workload cluster) - -**Compose clouds.yaml:** - -```bash -cat > "$WORK/clouds.yaml" < "$WORK/clouds.yaml.b64" -``` - -**Compose cloud.conf** (INI format, NOT YAML): - -```bash -cat > "$WORK/cloud.conf" < "$WORK/cloud.conf.b64" -``` - -> **Critical delta from Bobcat:** the `ca-file` line replaces `tls-insecure=true`. -> The path `/usr/local/share/ca-certificates/vault-ca.crt` exists on capi-mgmt -> (from §6) AND will be injected into workload nodes via CK8sConfig in §13. - -**base64-encode Vault CA for CK8sConfig injection:** - -```bash -base64 -w0 "$VAULT_CA" > "$WORK/vault-ca.crt.b64" -wc -c "$WORK/vault-ca.crt.b64" -``` - ---- - -## 13. Cluster template rendering (with Vault CA injection) - -The cluster template defines: - -- Cluster object -- OpenStackCluster (CAPO infrastructure) -- CK8sControlPlane -- CK8sConfigTemplate (control plane bootstrap — includes Vault CA injection) -- MachineDeployment + CK8sConfigTemplate (workers — includes Vault CA injection) -- Secrets for clouds.yaml and cloud.conf - -Variables (18 total): - -```bash -export CLUSTER_NAME=capi-mgmt-cluster -export CLUSTER_NAMESPACE=default -export KUBERNETES_VERSION=v1.31.4 # adjust to CK8s-supported -export CONTROL_PLANE_MACHINE_COUNT=1 # 3 for HA on Roosevelt -export WORKER_MACHINE_COUNT=2 # 3 on Roosevelt -export OPENSTACK_DNS_NAMESERVERS=10.12.4.227 # designate VIP -export OPENSTACK_FAILURE_DOMAIN=nova -export OPENSTACK_EXTERNAL_NETWORK_ID=$(openstack network show ext_net -c id -f value) -export OPENSTACK_IMAGE_NAME=noble-amd64 -export OPENSTACK_FLAVOR=capi-mgmt-node -export OPENSTACK_SSH_KEY_NAME=capi-workload-key -export POD_CIDR=10.244.0.0/16 -export SERVICE_CIDR=10.96.0.0/12 -export CLOUDS_YAML_B64=$(cat "$WORK/clouds.yaml.b64") -export CLOUD_CONF_B64=$(cat "$WORK/cloud.conf.b64") -export VAULT_CA_B64=$(cat "$WORK/vault-ca.crt.b64") -export CLUSTER_DOMAIN=cluster.local -export OPENSTACK_CLOUD=capi-mgmt - -# Sanity print -env | grep -E "^(CLUSTER|KUBERNETES|CONTROL_PLANE|WORKER|OPENSTACK|POD|SERVICE|VAULT|CLOUD)" \ - | grep -v "B64\|SECRET\|PASS" | sort -``` - -**Render the cluster template:** - -```bash -cat > "$WORK/cluster-template.yaml" <<'TEMPLATE_EOF' -apiVersion: v1 -kind: Secret -metadata: - name: ${CLUSTER_NAME}-cloud-config - namespace: ${CLUSTER_NAMESPACE} -type: Opaque -data: - clouds.yaml: ${CLOUDS_YAML_B64} - cloud.conf: ${CLOUD_CONF_B64} - cacert: ${VAULT_CA_B64} ---- -apiVersion: cluster.x-k8s.io/v1beta1 -kind: Cluster -metadata: - name: ${CLUSTER_NAME} - namespace: ${CLUSTER_NAMESPACE} -spec: - clusterNetwork: - pods: - cidrBlocks: - - ${POD_CIDR} - services: - cidrBlocks: - - ${SERVICE_CIDR} - serviceDomain: ${CLUSTER_DOMAIN} - infrastructureRef: - apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 - kind: OpenStackCluster - name: ${CLUSTER_NAME} - controlPlaneRef: - apiVersion: controlplane.cluster.x-k8s.io/v1beta2 - kind: CK8sControlPlane - name: ${CLUSTER_NAME}-control-plane ---- -apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 -kind: OpenStackCluster -metadata: - name: ${CLUSTER_NAME} - namespace: ${CLUSTER_NAMESPACE} -spec: - identityRef: - name: ${CLUSTER_NAME}-cloud-config - cloudName: ${OPENSTACK_CLOUD} - externalNetwork: - id: ${OPENSTACK_EXTERNAL_NETWORK_ID} - managedSecurityGroups: - allowAllInClusterTraffic: true - apiServerLoadBalancer: - enabled: true ---- -apiVersion: controlplane.cluster.x-k8s.io/v1beta2 -kind: CK8sControlPlane -metadata: - name: ${CLUSTER_NAME}-control-plane - namespace: ${CLUSTER_NAMESPACE} -spec: - replicas: ${CONTROL_PLANE_MACHINE_COUNT} - version: ${KUBERNETES_VERSION} - machineTemplate: - infrastructureTemplate: - apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 - kind: OpenStackMachineTemplate - name: ${CLUSTER_NAME}-control-plane - spec: - files: - - path: /usr/local/share/ca-certificates/vault-ca.crt - owner: root:root - permissions: "0644" - contentFrom: - secret: - name: ${CLUSTER_NAME}-cloud-config - key: cacert - preRunCommands: - - update-ca-certificates - extraKubeAPIServerArgs: - "--cloud-provider": external ---- -apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 -kind: OpenStackMachineTemplate -metadata: - name: ${CLUSTER_NAME}-control-plane - namespace: ${CLUSTER_NAMESPACE} -spec: - template: - spec: - flavor: ${OPENSTACK_FLAVOR} - image: - filter: - name: ${OPENSTACK_IMAGE_NAME} - sshKeyName: ${OPENSTACK_SSH_KEY_NAME} - identityRef: - name: ${CLUSTER_NAME}-cloud-config - cloudName: ${OPENSTACK_CLOUD} ---- -apiVersion: cluster.x-k8s.io/v1beta1 -kind: MachineDeployment -metadata: - name: ${CLUSTER_NAME}-md-0 - namespace: ${CLUSTER_NAMESPACE} -spec: - clusterName: ${CLUSTER_NAME} - replicas: ${WORKER_MACHINE_COUNT} - selector: - matchLabels: {} - template: - spec: - clusterName: ${CLUSTER_NAME} - version: ${KUBERNETES_VERSION} - bootstrap: - configRef: - apiVersion: bootstrap.cluster.x-k8s.io/v1beta2 - kind: CK8sConfigTemplate - name: ${CLUSTER_NAME}-md-0 - infrastructureRef: - apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 - kind: OpenStackMachineTemplate - name: ${CLUSTER_NAME}-md-0 ---- -apiVersion: infrastructure.cluster.x-k8s.io/v1beta1 -kind: OpenStackMachineTemplate -metadata: - name: ${CLUSTER_NAME}-md-0 - namespace: ${CLUSTER_NAMESPACE} -spec: - template: - spec: - flavor: ${OPENSTACK_FLAVOR} - image: - filter: - name: ${OPENSTACK_IMAGE_NAME} - sshKeyName: ${OPENSTACK_SSH_KEY_NAME} - identityRef: - name: ${CLUSTER_NAME}-cloud-config - cloudName: ${OPENSTACK_CLOUD} ---- -apiVersion: bootstrap.cluster.x-k8s.io/v1beta2 -kind: CK8sConfigTemplate -metadata: - name: ${CLUSTER_NAME}-md-0 - namespace: ${CLUSTER_NAMESPACE} -spec: - template: - spec: - files: - - path: /usr/local/share/ca-certificates/vault-ca.crt - owner: root:root - permissions: "0644" - contentFrom: - secret: - name: ${CLUSTER_NAME}-cloud-config - key: cacert - preRunCommands: - - update-ca-certificates -TEMPLATE_EOF - -# envsubst to render -envsubst < "$WORK/cluster-template.yaml" > "$WORK/cluster-rendered.yaml" - -# Validate as YAML -python3 -c "import yaml; list(yaml.safe_load_all(open('$WORK/cluster-rendered.yaml'))); print('YAML OK')" - -# Quick visual check — no leftover ${...} markers -grep -n '\${' "$WORK/cluster-rendered.yaml" || echo "No unsubstituted variables — good" -``` - -> **CK8sConfig field name caveat:** the exact field names (`files`, -> `preRunCommands`) and their `contentFrom.secret` schema are CK8s-version- -> dependent. If `clusterctl init` failed earlier with schema warnings, -> consult the CK8s release notes for the pinned `$CK8S_VERSION`. - ---- - -## 14. Apply + poll-to-Ready - -Transfer rendered template to capi-mgmt and apply: - -```bash -scp "$WORK/cluster-rendered.yaml" ubuntu@$CAPI_MGMT_METAL_IP:/home/ubuntu/cluster.yaml - -ssh ubuntu@$CAPI_MGMT_METAL_IP <<'EOF' -set -euo pipefail -kubectl apply -f /home/ubuntu/cluster.yaml -echo "Applied. Waiting for cluster Available status (15-min timeout)..." - -for i in $(seq 1 90); do - STATUS=$(kubectl get cluster capi-mgmt-cluster -o json 2>/dev/null \ - | jq -r '.status.phase // "Unknown"') - READY=$(kubectl get cluster capi-mgmt-cluster -o json 2>/dev/null \ - | jq -r '.status.conditions[]? | select(.type=="Ready") | .status' \ - | head -1) - echo "$(date -Is) phase=$STATUS ready=$READY" - [ "$READY" = "True" ] && { echo "Cluster Ready"; break; } - sleep 10 -done - -kubectl get cluster,machines,kubeadmcontrolplane,machinedeployment -A -EOF -``` - -**If the poll times out before Ready,** typical diagnosis: - -```bash -ssh ubuntu@$CAPI_MGMT_METAL_IP -- kubectl describe cluster capi-mgmt-cluster -ssh ubuntu@$CAPI_MGMT_METAL_IP -- kubectl get machines -A -ssh ubuntu@$CAPI_MGMT_METAL_IP -- kubectl logs -n capo-system deployment/capo-controller-manager --tail=100 -``` - -Common causes: - -- OpenStack API unreachable from capi-mgmt → check Vault CA install on capi-mgmt (§6) -- Image / flavor / network ID wrong in cluster template → re-check §11 variables -- Security group rules block kube-api LB → CAPO usually handles this; check OpenStackCluster status -- Application credential expired / wrong → re-check `$APP_CRED_ID` - ---- - -## 15. Extract workload kubeconfig - -```bash -ssh ubuntu@$CAPI_MGMT_METAL_IP -- clusterctl get kubeconfig capi-mgmt-cluster \ - > "$WORK/capi-mgmt-cluster.kubeconfig" -chmod 600 "$WORK/capi-mgmt-cluster.kubeconfig" - -# Sanity-check the workload cluster is reachable -kubectl --kubeconfig "$WORK/capi-mgmt-cluster.kubeconfig" get nodes -# Expect: 1 control plane + 2 workers, all Ready -``` - -If `get nodes` times out, the cluster's API LB may not have allocated its -external IP yet, or the firewall rules don't permit jumphost → workload API: - -```bash -# What IP is the cluster's API LB on? -ssh ubuntu@$CAPI_MGMT_METAL_IP -- kubectl get openstackcluster capi-mgmt-cluster \ - -o json | jq '.status.externalNetwork, .status.controlPlaneEndpoint' - -# Test reachability -curl -sk --max-time 10 "https://:6443/version" && echo " ← reachable" || echo "API LB unreachable" -``` - ---- - -## 16. `clusterctl init` on target (workload cluster) - -The workload cluster must have the same CAPI providers installed before `move`. - -```bash -# Run from jumphost using the workload kubeconfig -KUBECONFIG="$WORK/capi-mgmt-cluster.kubeconfig" clusterctl init \ - --core "cluster-api:${CAPI_VERSION}" \ - --infrastructure "openstack:${CAPO_VERSION}" \ - --bootstrap "canonical-kubernetes:${CK8S_VERSION}" \ - --control-plane "canonical-kubernetes:${CK8S_VERSION}" \ - --cert-manager-version "${CERT_MANAGER_VERSION}" - -# ORC into workload cluster too -kubectl --kubeconfig "$WORK/capi-mgmt-cluster.kubeconfig" apply \ - -f "https://github.com/k-orc/openstack-resource-controller/releases/${ORC_VERSION}/orc.yaml" - -# Wait for everything Available -kubectl --kubeconfig "$WORK/capi-mgmt-cluster.kubeconfig" wait \ - --for=condition=Available --timeout=5m \ - deployment --all -n capi-system -kubectl --kubeconfig "$WORK/capi-mgmt-cluster.kubeconfig" wait \ - --for=condition=Available --timeout=5m \ - deployment --all -n capo-system -kubectl --kubeconfig "$WORK/capi-mgmt-cluster.kubeconfig" wait \ - --for=condition=Available --timeout=5m \ - deployment --all -n cert-manager -kubectl --kubeconfig "$WORK/capi-mgmt-cluster.kubeconfig" wait \ - --for=condition=Available --timeout=5m \ - deployment --all -n orc-system -``` - -> **cert-manager double-install caveat:** if CK8s already installed -> cert-manager during workload bootstrap, the second `clusterctl init` may -> warn or skip. Check existing cert-manager version against `$CERT_MANAGER_VERSION` -> — if they differ, version-skew issues may surface post-pivot. Adjust the -> pin in §4 or accept the existing version. Roosevelt's standard practice -> is to install cert-manager via `clusterctl init` only (don't pre-install -> via CK8s) — same approach valid here if you want clean version control. - ---- - -## 17. `clusterctl move` pivot - -Move all CAPI CRs from bootstrap k3s → workload cluster: - -```bash -# Stage the target kubeconfig on capi-mgmt (where clusterctl move runs) -scp "$WORK/capi-mgmt-cluster.kubeconfig" ubuntu@$CAPI_MGMT_METAL_IP:/home/ubuntu/target.kubeconfig - -# Dry-run first to catch issues before commit -ssh ubuntu@$CAPI_MGMT_METAL_IP -- clusterctl move \ - --to-kubeconfig=/home/ubuntu/target.kubeconfig \ - --dry-run - -# Inspect dry-run output: list of objects to be moved. Should include: -# - Cluster, OpenStackCluster, OpenStackClusterTemplate -# - Secrets (cloud-config) -# - Machine objects, OpenStackMachineTemplate -# - CK8sControlPlane, CK8sConfigTemplate -# - MachineDeployment -# Should NOT include cert-manager state (cert-manager manages its own state -# on each cluster independently) -``` - -**If dry-run looks correct, execute the move:** - -```bash -ssh ubuntu@$CAPI_MGMT_METAL_IP -- clusterctl move \ - --to-kubeconfig=/home/ubuntu/target.kubeconfig - -# Move can take several minutes. Output ends with: "moved successfully" -``` - ---- - -## 18. Post-pivot verification - -```bash -echo "=== Bootstrap k3s (should now be empty of cluster CRs) ===" -ssh ubuntu@$CAPI_MGMT_METAL_IP -- kubectl get cluster -A -# Expect: No resources found (or only a header) - -ssh ubuntu@$CAPI_MGMT_METAL_IP -- kubectl get machines -A -# Expect: No resources found - -ssh ubuntu@$CAPI_MGMT_METAL_IP -- kubectl get openstackcluster -A -# Expect: No resources found - -echo "" -echo "=== Workload cluster (should now own its own cluster CRs) ===" -kubectl --kubeconfig "$WORK/capi-mgmt-cluster.kubeconfig" get cluster -A -# Expect: capi-mgmt-cluster shown, phase=Provisioned - -kubectl --kubeconfig "$WORK/capi-mgmt-cluster.kubeconfig" get machines -A -# Expect: 3 machines (1 control-plane + 2 workers), all Running - -kubectl --kubeconfig "$WORK/capi-mgmt-cluster.kubeconfig" get openstackcluster -A - -echo "" -echo "=== CAPI controllers in workload ===" -kubectl --kubeconfig "$WORK/capi-mgmt-cluster.kubeconfig" get pods -A \ - | grep -E "(capi|capo|orc|cert-manager)" | grep -v "Running\|Completed" -# Expect: empty (all controller pods Running) - -echo "" -echo "=== OCCM not crash-looping (CRITICAL — main goal of TLS-verify work) ===" -kubectl --kubeconfig "$WORK/capi-mgmt-cluster.kubeconfig" get pods -n kube-system \ - -l k8s-app=openstack-cloud-controller-manager -# Expect: 1 pod Running, NOT CrashLoopBackOff - -kubectl --kubeconfig "$WORK/capi-mgmt-cluster.kubeconfig" logs -n kube-system \ - -l k8s-app=openstack-cloud-controller-manager --tail=50 \ - | grep -iE "(tls|cert|error)" | head -20 -# Expect: no TLS/cert errors; OCCM should be healthy -``` - -> **If OCCM crash-loops with "x509: certificate signed by unknown authority":** -> Vault CA distribution failed. Check (a) `/usr/local/share/ca-certificates/vault-ca.crt` -> exists on workload nodes; (b) `update-ca-certificates` ran (check `/etc/ssl/certs/ca-certificates.crt` -> for the Vault CA's subject); (c) the secret reference in CK8sConfigTemplate -> matched the secret name. SSH into a worker via the jumphost key (`ssh -i -> $WORK/capi-workload-key ubuntu@`) to diagnose. - ---- - -## 19. Handoff to runbook 05 - -The workload kubeconfig at `$WORK/capi-mgmt-cluster.kubeconfig` is the input to -`runbooks/05-magnum-capi-driver.md`. Copy it to a stable path: - -```bash -mkdir -p $HOME/magnum-capi -cp "$WORK/capi-mgmt-cluster.kubeconfig" $HOME/magnum-capi/capi-mgmt-cluster.kubeconfig -chmod 600 $HOME/magnum-capi/capi-mgmt-cluster.kubeconfig -echo "Workload kubeconfig staged at: $HOME/magnum-capi/capi-mgmt-cluster.kubeconfig" -``` - -> **Important — post-pivot semantic shift from Bobcat:** Magnum's -> `kubeconfig_file` setting (under `[capi_helm]` in -> `/etc/magnum/magnum.conf.d/99-capi.conf`, per D-007) now points to the -> workload cluster, not the bootstrap k3s. Bobcat had Magnum pointing at -> bootstrap k3s because the pivot was never executed. With pivot mandatory, -> Magnum's CAPI calls flow: -> -> ``` -> Magnum/leader → workload cluster API → CAPI controllers (running in workload) -> → create new Cluster CRs (tenant Magnum clusters) -> ``` -> -> The bootstrap k3s on capi-mgmt is now disposable. If you wanted, you could -> destroy capi-mgmt entirely at this point — the workload cluster manages -> itself. (Roosevelt may actually do this for cost savings.) For v1 testcloud, -> leave capi-mgmt running so its k3s can be inspected for diagnostics. - ---- - -## 20. Roosevelt deltas (forward-look) - -| Aspect | Testcloud (v1) | Roosevelt | -|---|---|---| -| Workload image | Default `noble-amd64` from cloud-images.ubuntu.com | Custom image baked with Vault CA pre-installed (no runtime install step) | -| Vault CA distribution | CK8sConfig `files:` + `preRunCommands:` (this runbook) | Image-baked + CK8sConfig (defense in depth) | -| App credential lifetime | No expiry set (testcloud) | Short-lived rotating credentials via Vault auth method | -| Workload cluster control plane | 1 node | 3 nodes (HA) | -| Workload cluster workers | 2 nodes | Per-tenant sizing; HPA-driven | -| `clusterctl init --cert-manager-version` | Pin from §4 | Pin to Vault PKI cert-manager profile (separate Roosevelt prep) | -| capi-mgmt VM lifecycle post-pivot | Kept running for diagnostics | Destroyed (cost savings; pivot makes it disposable) | -| Version pinning record | `$HOME/deploy-records//capi-pins/` | Same pattern, captured in Vault as audit artifact | -| Authentication to GitHub API | Optional PAT | Mandatory PAT (avoid rate-limit during automated rebuilds) | - ---- - -## 21. Rotation/refresh of pins - -The pins captured in §4 will age. Recommended cadence: - -- **Per rebuild:** re-discover all pins (Step 1 of next execution will catch - natural drift). -- **Out-of-band patch:** if a CVE drops for any pinned component, run §4 - discovery alone and capture the new pin into `$DEPLOY_RECORD/`. Then for - the affected component only, follow the upgrade procedure from its - upstream docs (does NOT necessarily require this whole runbook re-run). - -For Roosevelt, this becomes a tracked maintenance window task. - ---- - -## 22. Change log - -| Date | Change | Reference | -|---|---|---| -| 2026-05-22 | Document created. Vault CA distribution (no tls-insecure), mandatory `clusterctl move` pivot, pin-at-execution version model. | Workstream 3b | diff --git a/runbooks/deprecated/05-magnum-capi-driver.md b/runbooks/deprecated/05-magnum-capi-driver.md deleted file mode 100644 index e1414e1..0000000 --- a/runbooks/deprecated/05-magnum-capi-driver.md +++ /dev/null @@ -1,529 +0,0 @@ -# Runbook 05 — Magnum CAPI Helm driver install - -**Status:** Executes after `04-magnum-domain.md` (Keystone wiring) and -`04a-capi-bootstrap-cluster.md` (workload cluster + kubeconfig staged). -Final post-deploy step to make Magnum capable of creating CAPI-managed -tenant K8s clusters. - -**Cross-references:** -- D-007 Layer B (Magnum two-layer install) -- D-017 (CAPI bootstrap cluster lifecycle) -- Runbook 04a §19 (workload kubeconfig handoff) -- Workstream 3c decision (2026-05-22): magnum-capi-helm 1.1.0 from PyPI; workload-cluster kubeconfig (NOT bootstrap k3s) - -**Known doc inconsistency (tracked for cleanup):** -D-007's Layer B currently states the kubeconfig points at "capi-mgmt.maas -bootstrap k3s". That language is correct for Bobcat (no pivot) but obsolete -post-workstream-3b (pivot mandatory). This runbook uses the workload cluster -kubeconfig as the canonical target. D-007 patch to follow in a workstream-3 -cleanup commit. - ---- - -## 1. Purpose & scope - -Graft the CAPI Helm driver onto the Charmed Magnum deployment so that -`openstack coe cluster create` provisions tenant K8s clusters via CAPI (in -the workload cluster) instead of via the deprecated Heat driver. - -**Output of this runbook:** - -- `magnum-capi-helm==1.1.0` installed on the magnum unit's system Python. -- `/etc/magnum/kubeconfig` populated with the workload cluster's - kubeconfig (post-pivot CAPI controller plane). -- `/etc/magnum/magnum.conf.d/99-capi.conf` configured with - `enabled_drivers = k8s_capi_helm_v1` and `[capi_helm] kubeconfig_file=`. -- Systemd overrides on `magnum-api` and `magnum-conductor` that replace - the init.d wrapper's ExecStart with explicit `--config-dir` invocation. -- Both services running cleanly with the CAPI driver loaded. - -**Scope:** v1 testcloud. Roosevelt deltas in §12. - -**Out of scope:** -- Magnum domain setup (runbook 04) -- Workload cluster lifecycle (runbook 04a) -- Smoketest tenant cluster creation is OPTIONAL (§11) — full validation - framework belongs in runbook 08. - ---- - -## 2. Decisions captured - -| Decision | Choice | Reason | -|---|---|---| -| Driver pin | `magnum-capi-helm==1.1.0` from PyPI | D-007 correction (stackhpc fork archived Dec 2024; canonical project on opendev/PyPI; 1.1.0 is last Caracal-cycle release) | -| Install method | `pip3 install --break-system-packages` | PEP 668 — Ubuntu 22.04+ requires explicit override for system-site-packages install | -| Install scope | System Python on magnum unit (not venv) | Magnum charm uses system-packaged python at `/usr/lib/python3/dist-packages/magnum/`; driver must import from same site | -| Kubeconfig target | Workload cluster (post-pivot) | Workstream 3b — bootstrap k3s is empty post-pivot; CAPI controllers live in workload | -| Kubeconfig source | `$HOME/magnum-capi/capi-mgmt-cluster.kubeconfig` (staged by 04a §19) | Documented handoff | -| Driver entry-point name | `k8s_capi_helm_v1` | Per upstream magnum-capi-helm 1.1.0; verify in §10 | -| Conf.d filename | `99-capi.conf` | Numeric prefix ensures it loads AFTER any charm-managed conf, so `enabled_drivers` override wins | -| File encoding | ASCII-only | Non-ASCII in conf.d causes silent magnum daemon failures (handoff lesson; cf. Horizon `local_settings.d` issue) | -| Trustee credential | Existing magnum-shared user (charm-managed) | Roosevelt will use app-credential pattern | - ---- - -## 3. Prerequisites - -| Prereq | Verification | -|---|---| -| Magnum charm active/idle | `juju status magnum \| grep magnum/0` shows `active idle` | -| Magnum domain setup completed (runbook 04) | `openstack domain show magnum \| grep enabled` returns `True` | -| Workload cluster reachable from jumphost | `kubectl --kubeconfig $HOME/magnum-capi/capi-mgmt-cluster.kubeconfig get nodes` returns Ready nodes | -| CAPI controllers running in workload cluster | `kubectl --kubeconfig $HOME/magnum-capi/capi-mgmt-cluster.kubeconfig get pods -n capi-system \| grep -v Running \| grep -v NAME` empty | -| Workload kubeconfig staged at expected path | `test -r $HOME/magnum-capi/capi-mgmt-cluster.kubeconfig && stat -c %a $HOME/magnum-capi/capi-mgmt-cluster.kubeconfig` shows `600` | -| `juju exec` works to magnum/leader (use exec, NOT ssh, for non-interactive — handoff lesson) | `juju exec --unit magnum/leader -- hostname` returns the unit hostname | - -**Set shell context:** - -```bash -export WORK=$HOME/magnum-capi -export WORKLOAD_KUBECONFIG=$WORK/capi-mgmt-cluster.kubeconfig -export DRIVER_VERSION=magnum-capi-helm==1.1.0 # per D-007 correction -cd "$WORK" -``` - -> **`juju ssh` vs `juju exec` choice:** the handoff lessons explicitly call -> out that `juju ssh` hangs when stdout is redirected (PTY allocation issue). -> This runbook uses `juju exec` for all non-interactive command execution and -> reserves `juju ssh` only for cases where you actually want an interactive -> shell. - ---- - -## 4. Pre-flight: capture current state - -Capture the magnum unit's state BEFORE making changes. Useful for diagnosis -if anything goes wrong, and as a record of what was changed. - -```bash -mkdir -p "$WORK/pre-state" - -# Service unit files (as managed by charm) -juju exec --unit magnum/leader -- \ - 'sudo systemctl cat magnum-api magnum-conductor 2>&1' \ - > "$WORK/pre-state/systemd-units.txt" - -# Currently-enabled drivers -juju exec --unit magnum/leader -- \ - 'sudo grep -r enabled_drivers /etc/magnum/ 2>/dev/null || echo "(no enabled_drivers found — charm default applies)"' \ - > "$WORK/pre-state/drivers-pre.txt" - -# Python site-packages — see what's already installed -juju exec --unit magnum/leader -- \ - 'sudo pip3 list 2>/dev/null | grep -iE "magnum|cluster|helm|kubernetes" || true' \ - > "$WORK/pre-state/pip-pre.txt" - -# conf.d state -juju exec --unit magnum/leader -- \ - 'sudo ls -la /etc/magnum/magnum.conf.d/ 2>/dev/null || echo "(no conf.d directory)"' \ - > "$WORK/pre-state/confd-pre.txt" - -# Service running state -juju exec --unit magnum/leader -- \ - 'sudo systemctl is-active magnum-api magnum-conductor' \ - > "$WORK/pre-state/service-state-pre.txt" - -# Display the captured state -cat "$WORK/pre-state/"*.txt -``` - -> **What to look for in pre-state:** the charm-managed `enabled_drivers` value -> probably includes Heat-based drivers (`heat_kubernetes`, etc.). The 99-capi.conf -> override in §7 replaces this with the single CAPI driver. The pre-state -> capture documents what was active before the override took effect. - ---- - -## 5. Install magnum-capi-helm 1.1.0 - -```bash -juju exec --unit magnum/leader -- \ - "sudo pip3 install $DRIVER_VERSION --break-system-packages" -``` - -**Verify install:** - -```bash -juju exec --unit magnum/leader -- \ - 'sudo pip3 show magnum-capi-helm | head -10' -# Expect: Name: magnum-capi-helm -# Version: 1.1.0 -# Location: /usr/lib/python3/dist-packages - -juju exec --unit magnum/leader -- \ - 'sudo python3 -c "import magnum_capi_helm; print(magnum_capi_helm.__file__)"' -# Expect: /usr/lib/python3/dist-packages/magnum_capi_helm/__init__.py -``` - -**Check that the driver entry point is registered:** - -```bash -juju exec --unit magnum/leader -- \ - 'sudo python3 -c " -from stevedore import driver -mgr = driver.DriverManager( - namespace=\"magnum.drivers\", - name=\"k8s_capi_helm_v1\", - invoke_on_load=False -) -print(\"Driver class:\", mgr.driver) -"' -# Expect: Driver class: -# (or similar — the actual class path is package-version-dependent) -``` - -> If the entry point check fails with "No 'k8s_capi_helm_v1' driver found", -> the driver name in 1.1.0 may differ from what D-007 documented. Inspect the -> installed package's `entry_points.txt`: -> -> ```bash -> juju exec --unit magnum/leader -- \ -> 'sudo cat /usr/lib/python3/dist-packages/magnum_capi_helm*.dist-info/entry_points.txt 2>/dev/null' -> ``` -> -> Find the entry under `[magnum.drivers]` — use that exact name in §7. - ---- - -## 6. Stage workload kubeconfig on magnum unit - -```bash -# Transfer kubeconfig from jumphost to magnum unit -juju scp "$WORKLOAD_KUBECONFIG" magnum/leader:/tmp/kubeconfig - -# Install with correct ownership/mode in one atomic step -juju exec --unit magnum/leader -- \ - 'sudo install -m 0640 -o root -g magnum /tmp/kubeconfig /etc/magnum/kubeconfig && sudo rm /tmp/kubeconfig' -``` - -**Verify:** - -```bash -juju exec --unit magnum/leader -- \ - 'sudo ls -la /etc/magnum/kubeconfig' -# Expect: -rw-r----- 1 root magnum ... /etc/magnum/kubeconfig - -# Confirm magnum user can read it -juju exec --unit magnum/leader -- \ - 'sudo -u magnum cat /etc/magnum/kubeconfig | head -3' -# Expect: apiVersion: v1 / clusters: / - cluster: - -# Confirm kubectl can use it from the magnum unit (sanity check on API reachability) -juju exec --unit magnum/leader -- \ - 'sudo -u magnum kubectl --kubeconfig /etc/magnum/kubeconfig get nodes 2>&1 | head -10' -# Expect: NAME ... STATUS=Ready for control plane + workers -# OR: kubectl not installed (acceptable — magnum-capi-helm uses Python client, not kubectl) -``` - -> **Why mode 0640 and group magnum:** kubeconfig contains auth tokens. Mode -> 0600 (owner-only) wouldn't let the `magnum` system user (which runs -> magnum-api/conductor) read it. Mode 0640 with `group: magnum` is the -> minimum-permission setup that works. NOT 0644 — keeps it off other users -> on the unit. - ---- - -## 7. Configure `/etc/magnum/magnum.conf.d/99-capi.conf` - -Generate the conf locally first (snap confinement does not apply to plain -bash on jumphost, but we keep paths under `$HOME` for consistency), then -transfer. - -**ASCII-only verification is critical** — the handoff documents non-ASCII -characters in `conf.d` files causing silent daemon failures (cf. Horizon -`local_settings.d`). Use plain straight quotes, ASCII dashes, no smart -typography. - -```bash -# Write locally -cat > "$WORK/99-capi.conf" <<'EOF' -[DEFAULT] -enabled_drivers = k8s_capi_helm_v1 - -[capi_helm] -kubeconfig_file = /etc/magnum/kubeconfig -EOF - -# Verify it is pure ASCII (no UTF-8 sneakers) -file "$WORK/99-capi.conf" -# Expect: ASCII text -# If it says "UTF-8 Unicode text", STOP and rewrite by hand — even one stray -# em-dash or smart quote will silently break magnum - -# Hex dump check (paranoid mode) -xxd "$WORK/99-capi.conf" | grep -v "^[0-9a-f]*: [0-9a-f ]* [a-zA-Z0-9 \[\]=._/]*$" | head -5 -# Expect: empty output (all bytes are printable ASCII) -``` - -**Stage and install:** - -```bash -juju scp "$WORK/99-capi.conf" magnum/leader:/tmp/99-capi.conf - -juju exec --unit magnum/leader -- \ - 'sudo mkdir -p /etc/magnum/magnum.conf.d && sudo install -m 0644 -o root -g root /tmp/99-capi.conf /etc/magnum/magnum.conf.d/99-capi.conf && sudo rm /tmp/99-capi.conf' - -# Verify -juju exec --unit magnum/leader -- \ - 'sudo ls -la /etc/magnum/magnum.conf.d/ && sudo cat /etc/magnum/magnum.conf.d/99-capi.conf' -# Expect: file listed; content matches what was written -``` - ---- - -## 8. Systemd override on magnum-api + magnum-conductor - -The Charmed Magnum unit files use a wrapper pattern: - -``` -ExecStart=/etc/init.d/magnum-api systemd-start -``` - -The wrapper does NOT pass `--config-dir` to magnum-api, so `/etc/magnum/magnum.conf.d/` -is never loaded. The 99-capi.conf would have no effect. - -Override with explicit `--config-file` + `--config-dir` invocation. - -**Generate override files locally:** - -```bash -cat > "$WORK/magnum-api-override.conf" <<'EOF' -[Service] -ExecStart= -ExecStart=/usr/bin/magnum-api --config-file=/etc/magnum/magnum.conf --config-dir=/etc/magnum/magnum.conf.d -EOF - -cat > "$WORK/magnum-conductor-override.conf" <<'EOF' -[Service] -ExecStart= -ExecStart=/usr/bin/magnum-conductor --config-file=/etc/magnum/magnum.conf --config-dir=/etc/magnum/magnum.conf.d -EOF - -# ASCII check -file "$WORK/magnum-api-override.conf" "$WORK/magnum-conductor-override.conf" -# Expect: ASCII text x2 -``` - -> **The empty `ExecStart=` line is critical.** Systemd accumulates ExecStart -> directives by default; an empty assignment is required to CLEAR the inherited -> directive before setting the replacement. Without the empty line, the unit -> would have BOTH the init.d wrapper AND the new direct invocation, and would -> likely fail to start. - -**Install on the unit:** - -```bash -juju scp "$WORK/magnum-api-override.conf" magnum/leader:/tmp/magnum-api-override.conf -juju scp "$WORK/magnum-conductor-override.conf" magnum/leader:/tmp/magnum-conductor-override.conf - -juju exec --unit magnum/leader -- \ - 'sudo mkdir -p /etc/systemd/system/magnum-api.service.d /etc/systemd/system/magnum-conductor.service.d && \ - sudo install -m 0644 -o root -g root /tmp/magnum-api-override.conf /etc/systemd/system/magnum-api.service.d/override.conf && \ - sudo install -m 0644 -o root -g root /tmp/magnum-conductor-override.conf /etc/systemd/system/magnum-conductor.service.d/override.conf && \ - sudo rm /tmp/magnum-api-override.conf /tmp/magnum-conductor-override.conf' - -# Reload systemd to pick up the overrides -juju exec --unit magnum/leader -- 'sudo systemctl daemon-reload' - -# Verify the overrides are effective (systemctl cat shows combined unit + overrides) -juju exec --unit magnum/leader -- 'sudo systemctl cat magnum-api | grep -A1 ExecStart' -# Expect: TWO ExecStart= lines — the empty clear-line and the new /usr/bin/magnum-api invocation -juju exec --unit magnum/leader -- 'sudo systemctl cat magnum-conductor | grep -A1 ExecStart' -# Expect: TWO ExecStart= lines as above for magnum-conductor -``` - -> **Charm reconciliation note:** the Magnum charm may rewrite its own systemd -> units on config changes or upgrades. The drop-in override at -> `/etc/systemd/system/magnum-api.service.d/override.conf` is OUTSIDE the -> charm's writable zone and should survive. Verify after any `juju refresh` or -> `juju config magnum` command by re-running the `systemctl cat` check above. - ---- - -## 9. Restart services + verify health - -```bash -juju exec --unit magnum/leader -- \ - 'sudo systemctl restart magnum-api magnum-conductor' - -# Wait briefly for services to initialize -sleep 5 - -# Check active state -juju exec --unit magnum/leader -- \ - 'sudo systemctl is-active magnum-api magnum-conductor' -# Expect: active (x2) - -# Examine recent journal for errors (the critical step — magnum's silent failure -# mode means we must read logs, not just trust is-active) -juju exec --unit magnum/leader -- \ - 'sudo journalctl -u magnum-api --since "2 minutes ago" --no-pager | tail -50' -juju exec --unit magnum/leader -- \ - 'sudo journalctl -u magnum-conductor --since "2 minutes ago" --no-pager | tail -50' -``` - -**Look for these red flags in the logs:** - -| Symptom | Likely cause | Remediation | -|---|---|---| -| `ImportError: No module named magnum_capi_helm` | §5 pip install failed | Re-run §5; check pip3 output | -| `EntryPointError: No 'k8s_capi_helm_v1' driver` | Driver entry-point name mismatch | Verify name per §5 footnote; update §7 | -| Service repeatedly restarts (look for "Started" appearing twice in 10s) | Likely a config error in 99-capi.conf | Re-check ASCII-only; check magnum.conf.d permissions | -| `kubeconfig_file` not honored | --config-dir not being passed | §8 override not active; re-run `systemctl daemon-reload` | -| Silent: no error but driver also not loading | Non-ASCII char snuck into a conf | `file /etc/magnum/magnum.conf.d/99-capi.conf` — if it says UTF-8, regenerate | - ---- - -## 10. CAPI driver enablement check - -Verify the driver is actually loaded by Magnum and reachable via the API. - -```bash -source $HOME/admin-openrc - -# List supported COE drivers via the Magnum API -openstack coe cluster template list -f json -# (empty templates list is fine — we are checking the endpoint responds) - -# Direct check on the unit: scan the service's loaded drivers -juju exec --unit magnum/leader -- \ - 'sudo journalctl -u magnum-conductor --since "5 minutes ago" --no-pager | grep -iE "driver|enabled" | head -20' -# Expect: a line mentioning k8s_capi_helm_v1 having been loaded -# (Magnum logs the loaded drivers at startup) - -# Definitive check: try creating a cluster template that requires the CAPI driver -openstack coe cluster template create magnum-capi-driver-check \ - --image noble-amd64 \ - --keypair capi-workload-key \ - --external-network ext_net \ - --master-flavor capi-mgmt-node \ - --flavor capi-mgmt-node \ - --coe kubernetes \ - --network-driver calico \ - --labels kube_tag=v1.31.4 - -openstack coe cluster template show magnum-capi-driver-check -c name -c coe -c labels -``` - -> **If template create fails with "driver not enabled" or similar:** the -> Magnum API process is not loading the conf.d. Verify the systemd override -> took effect — `sudo systemctl show magnum-api -p ExecStart` on the unit -> should show the explicit `--config-dir` invocation. If it still shows the -> init.d wrapper, the daemon-reload + restart did not pick up the override. - -**Cleanup the driver-check template:** - -```bash -openstack coe cluster template delete magnum-capi-driver-check -``` - ---- - -## 11. Optional smoketest — create a tenant CAPI cluster - -This step is **optional**. Full validation belongs in runbook 08. Use this -smoketest only if you want immediate confirmation that the entire chain -(Magnum API -> conductor -> magnum-capi-helm -> CAPI controllers in workload -cluster -> tenant K8s cluster on tenant VMs) works end-to-end. - -```bash -# Create a cluster template tuned for testcloud smoketest -openstack coe cluster template create magnum-smoketest-template \ - --image noble-amd64 \ - --keypair capi-workload-key \ - --external-network ext_net \ - --master-flavor capi-mgmt-node \ - --flavor capi-mgmt-node \ - --coe kubernetes \ - --network-driver calico \ - --labels boot_volume_size=20,kube_tag=v1.31.4,octavia_provider=ovn - -# Create a 1+1 cluster (minimum for smoketest) -openstack coe cluster create magnum-smoketest \ - --cluster-template magnum-smoketest-template \ - --master-count 1 \ - --node-count 1 - -# Poll for status (15-20 min typical; CAPI provisions tenant VMs end-to-end) -for i in $(seq 1 60); do - STATUS=$(openstack coe cluster show magnum-smoketest -c status -f value) - echo "$(date -Is) status=$STATUS" - case "$STATUS" in - CREATE_COMPLETE) echo "Smoketest passed"; break ;; - CREATE_FAILED) echo "Smoketest FAILED"; openstack coe cluster show magnum-smoketest; exit 1 ;; - esac - sleep 30 -done - -# Retrieve the smoketest cluster's kubeconfig -openstack coe cluster config magnum-smoketest --dir "$WORK/smoketest-kubeconfig" - -# Sanity-check the smoketest cluster -KUBECONFIG="$WORK/smoketest-kubeconfig/config" kubectl get nodes -KUBECONFIG="$WORK/smoketest-kubeconfig/config" kubectl get pods -A | head -20 - -# Cleanup the smoketest cluster -openstack coe cluster delete magnum-smoketest -openstack coe cluster template delete magnum-smoketest-template -``` - -> **What success looks like:** the CAPI controllers in the workload cluster -> receive the new Cluster CR (created by magnum-capi-helm in response to the -> Magnum API call), CAPO talks to OpenStack to provision tenant VMs, the -> tenant VMs join the new K8s cluster, and the new cluster has 1 control -> plane + 1 worker Ready. Octavia provides the API server LB (visible as a -> Floating IP in the tenant project). - ---- - -## 12. Roosevelt deltas (forward-look) - -| Aspect | Testcloud (v1) | Roosevelt | -|---|---|---| -| Driver pin source | PyPI `magnum-capi-helm==1.1.0` | Internal mirror with checksum verification | -| Driver pin record | Implicit in this runbook | Captured in Vault as audit artifact alongside CAPI pins | -| Kubeconfig source | Workload cluster (post-pivot per 04a §17) | Same | -| Kubeconfig rotation | Manual on capi-mgmt rebuild | Automated when workload cluster cert rotates | -| Trustee credential | Charm-default magnum-shared user | Per-tenant app credentials via Vault auth method | -| Magnum HA | num_units=1 (per D-009 testcloud) | num_units=3 with hacluster + provider VIP | -| Driver upgrade discipline | Manual re-run of §5 | Tracked maintenance window; Vault audit log | -| Systemd override | Drop-in at `/etc/systemd/system/magnum-*.service.d/override.conf` | Same — but provided via a charm overlay package, not manual file install | -| ASCII-only enforcement | Manual check (§7, §8) | Pre-flight lint in `scripts/pre-flight-checks.sh` | - ---- - -## 13. Documented runtime gotchas (carry-forward from handoff) - -These gotchas burned cycles during the Bobcat Magnum CAPI work. Each is -explicitly handled in this runbook; collecting them here for visibility: - -1. **PEP 668 `--break-system-packages`** (§5). Ubuntu 22.04+ refuses - `pip install` against system Python by default. The flag is required for - the magnum-capi-helm install path used by Charmed Magnum. -2. **`juju ssh` hangs on stdout redirect.** PTY allocation issue. - This runbook uses `juju exec` for all non-interactive command execution. -3. **Heredoc nesting in `juju ssh` is fragile.** This runbook writes - conf files locally first and uses `juju scp` + `juju exec install` to - transfer — single-level only. -4. **Non-ASCII characters in `conf.d` files cause silent daemon failures.** - §7 and §8 both include `file ` ASCII verification before transfer. -5. **`openstack -f value -c X -c Y` outputs in alphabetical field order, - not flag order.** This runbook uses single-column queries or `-f json | - jq` throughout. -6. **Charm-managed `enabled_drivers` is overridden, not appended.** The - `enabled_drivers = k8s_capi_helm_v1` line in 99-capi.conf REPLACES the - charm-default value (which would include the deprecated Heat drivers). -7. **The systemd override empty `ExecStart=` line is required** to clear - the inherited ExecStart before setting the replacement (§8). -8. **Snap-confined `openstack` CLI cannot read `/tmp`.** This runbook stages - files under `$WORK=$HOME/magnum-capi`. The smoketest in §11 also writes - to `$WORK/smoketest-kubeconfig`. - ---- - -## 14. Change log - -| Date | Change | Reference | -|---|---|---| -| 2026-05-22 | Document created. magnum-capi-helm 1.1.0 from PyPI; workload-cluster kubeconfig (post-pivot per workstream 3b); systemd override pattern; ASCII-only conf.d. | Workstream 3c | diff --git a/runbooks/deprecated/06-tenant-setup.md b/runbooks/deprecated/06-tenant-setup.md deleted file mode 100644 index 3915229..0000000 --- a/runbooks/deprecated/06-tenant-setup.md +++ /dev/null @@ -1,41 +0,0 @@ -# Runbook 06 — Tenant Resource Recreation - -**STATUS: PLACEHOLDER** — drafted post-deploy. - -## Purpose - -Recreate the standard testcloud tenant resources (domain, project, user, -networks, images, keypairs, flavors) using a proper IPAM-aligned design -per D-010 + D-016 (not the ad-hoc `user1` pattern from the original test -cloud). - -## Prerequisites - -- Cloud fully deployed and validated -- DNS zones populated (Runbook 07 may precede this if Designate-via-tenant - DNS is in scope at tenant create time) -- NetBox IPv4 tenant pool prefix present (per D-016; default `10.20.0.0/16`) - -## TODO - -- [ ] Create domain `domain1` -- [ ] Create project `project1` in domain `domain1` -- [ ] Create user `user1` in project1 (member role + load-balancer_member - role for Octavia) -- [ ] Tenant network with CIDR carved from NetBox IPv4 tenant pool - - Suggested convention: `10.20..0/24` per D-016 - - project1 → `10.20.1.0/24` - - Per D-016 hybrid model, the per-project /24 is Neutron-managed and - NOT added back to NetBox -- [ ] Tenant router connected to ext_net (Provider 10.12.4.0/22) -- [ ] Glance image: noble-amd64 (cloud-init enabled) -- [ ] Flavor m1.small (1 vCPU, 2 GiB RAM, 20 GiB root) -- [ ] Keypair for user1 -- [ ] openrc files: `~/admin-openrc`, `~/user1-openrc` -- [ ] Application credentials for user1 (audit trail) -- [ ] Take second KVM snapshot (per D-012 Snapshot 2) - -## v1 vs. v2 note - -In v1, tenant networks are IPv4-only. v2 adds IPv6 tenant subnets carved -from the v2 IPv6 tenant pool (currently reservation status in NetBox). diff --git a/runbooks/deprecated/07-dns-zones.md b/runbooks/deprecated/07-dns-zones.md deleted file mode 100644 index 3b780de..0000000 --- a/runbooks/deprecated/07-dns-zones.md +++ /dev/null @@ -1,36 +0,0 @@ -# Runbook 07 — Designate Zones and Records (v1: A records only) - -**STATUS: PLACEHOLDER** — drafted post-deploy. - -## Purpose - -Create the cloud's DNS zones in Designate, populate API VIP A records -(v1: IPv4 only), and configure Neutron defaults to push Designate as -tenant DNS resolver. - -## Prerequisites - -- Designate charm in `active/idle` -- Keystone, Neutron API reachable -- API VIP hostnames already in `/etc/hosts` on all OpenStack nodes - (per D-008 Layer 0 bootstrap) - -## TODO - -- [ ] Create primary zone: - `openstack zone create --email admin@neumatrix.local \ - omega.dc0.vr0.cloud.neumatrix.local.` -- [ ] Populate API VIP **A** records for each public service: - - keystone, glance, nova, neutron, cinder, placement, octavia, - barbican, magnum, horizon, designate - - **v1: A records only** (IPv4 VIPs from the Provider API VIP range - 10.12.4.224-.254) - - **v2 will add AAAA records when IPv6 Provider VIPs become active** -- [ ] Configure Neutron defaults: - `juju config neutron-api default-dns-domain=omega.dc0.vr0.cloud.neumatrix.local.` - `juju config neutron-api dns-domain=omega.dc0.vr0.cloud.neumatrix.local.` -- [ ] Configure Neutron DHCP to push Designate as resolver: - `juju config neutron-api dns-servers=` -- [ ] Verify from a test tenant VM: - `nslookup keystone.omega.dc0.vr0.cloud.neumatrix.local` - resolves to Provider API VIP diff --git a/runbooks/deprecated/08-validate.md b/runbooks/deprecated/08-validate.md deleted file mode 100644 index 7f7acfc..0000000 --- a/runbooks/deprecated/08-validate.md +++ /dev/null @@ -1,33 +0,0 @@ -# Runbook 08 — Validation (Roosevelt-Rehearsal Bar) - -**STATUS: PLACEHOLDER** — drafted with scripts/validate.sh. - -## Purpose - -Execute the validation criteria from D-011 and confirm the cloud is ready to -be considered a successful rebuild. - -## Prerequisites - -- All prior runbooks complete - -## Validation criteria (per D-011) - -- [ ] All charms `active/idle` in `juju status` -- [ ] All public API VIPs respond on FQDN from jumphost -- [ ] All public API VIPs respond on FQDN from a tenant VM (Option B path) -- [ ] Octavia LB pattern passes: create LB, two members, round-robin verified, - failover verified, recovery verified -- [ ] Magnum CAPI cluster create end-to-end: cluster template + cluster create, - OCCM does not crash-loop, cluster reaches CREATE_COMPLETE -- [ ] Vault unseal + auto-unseal-after-reboot pattern: reboot vault unit, - confirm auto-unseal via etcd (or manual unseal per HA pattern) -- [ ] Designate resolves API hostnames from tenant subnet -- [ ] Snapshot 1 (post-deploy, pre-tenant) taken (per D-012) -- [ ] Snapshot 2 (post-tenant) taken (per D-012) - -## TODO - -- [ ] Run `scripts/validate.sh` and capture output -- [ ] Document any divergences from validation criteria in - `docs/design-decisions.md` change log diff --git a/runbooks/deprecated/README.md b/runbooks/deprecated/README.md deleted file mode 100644 index b884f0e..0000000 --- a/runbooks/deprecated/README.md +++ /dev/null @@ -1,28 +0,0 @@ -# Deprecated v1 Runbooks - -The runbooks in this directory have been superseded by the -`runbooks/v1-do-doc-NN-*.md` execution documents (or, in the case of -`07-dns-zones.md`, deferred to v2 entirely per D-019). - -They are preserved here so the audit trail from the early v1 drafting -phase remains accessible. **Do not execute them.** The v1 deploy is -gated through the do-document set. - -## Replacement map - -| Deprecated runbook | Replacement | -|---|---| -| `00-pre-deploy.md` | superseded by D-017 + D-018 (no per-cycle backups; direct MAAS teardown); `v1-do-doc-01-prep.md` covers prep | -| `01a-octavia-pki-generation.md` | `v1-do-doc-02-pki.md` | -| `02-deploy.md` | `v1-do-doc-04-deploy.md` | -| `03-vault-init.md` | `v1-do-doc-05-vault-init.md` | -| `04-magnum-domain.md` | `v1-do-doc-06-magnum-domain.md` | -| `04a-capi-bootstrap-cluster.md` | `v1-do-doc-07-capi-bootstrap.md` | -| `05-magnum-capi-driver.md` | `v1-do-doc-08-magnum-driver.md` | -| `06-tenant-setup.md` | `v1-do-doc-09-tenant.md` | -| `07-dns-zones.md` | **deferred to v2 per D-019** (no v1 replacement) | -| `08-validate.md` | `v1-do-doc-10-validate.md` | - -`01-destroy-model.md` is **not** in this directory - it remains active in -`runbooks/` and is referenced as a conditional sub-procedure by -`v1-do-doc-03-destroy.md`. \ No newline at end of file diff --git a/runbooks/phase-00-teardown-maas-reset.md b/runbooks/phase-00-teardown-maas-reset.md new file mode 100644 index 0000000..94bf9ba --- /dev/null +++ b/runbooks/phase-00-teardown-maas-reset.md @@ -0,0 +1,243 @@ +# Phase 00 -- Teardown + MAAS Reset + +Destroy the `openstack` Juju model and reset the four MAAS hosts to a clean, +deploy-ready state: OSD secondary disks wiped, storage-class NICs linked, and the +MAAS VIP/FIP address carve in place. This is the rebuild-prep window -- it runs +BEFORE phase-01, because the VIP block must be MAAS-reserved before the bundle +deploys onto it, and `link-subnet` only works on a Ready (not Deployed) machine. + +Decisions: D-018 (skip graceful; MAAS-release-direct; supersedes D-013), D-017 +(full rebuild every cycle, nothing preserved), KI-P3-001 (the VIP carve fix). +Troubleshooting: appendix-A -- DOCFIX-016 (never `maas list` -- API-key leak), +DOCFIX-017 (no `maas whoami`; hardcode the eyeballed system_ids), R7 (sudo for +libvirt/qemu-img), KI-P3-001. + +!!! DESTRUCTIVE. Phase 1 (destroy-model + release) and Phase 2 (OSD wipe) are + irreversible short of the KVM snapshots (the D-017 safety net). Each destructive + step is DISCRETE and individually gated -- do not batch. + +CAPI-MGMT NOTE: this teardown releases the FOUR openstack hosts only. The MAAS +`capi-mgmt` VM is the RETIRED D-033 out-of-cloud node; the in-cloud `capi-mgmt-v2` +tenant VM (phase-06) replaces it. Leave `capi-mgmt` Ready (its separate Phase-7 +teardown is out of scope here). (The older 01-destroy-model.md released 5 VMs incl. +capi-mgmt -- that was the D-033 era; do NOT release it on the current rebuild.) + +--- + +## Prerequisites +- KVM snapshots of openstack0-3 exist (safety net). Authenticated juju session + (`juju whoami`). MAAS CLI logged in as profile `admin`. +- Run from jumphost `vopenstack-jesse` (user `jessea123`, sudo; also the libvirt hypervisor). + +## Constants and env-literals +- MAAS profile: `admin` (DOCFIX-016: NEVER `maas list` -- it prints the API key). +- system_ids (hardcode; DOCFIX-017, no `maas whoami`): openstack0=`4na83t`, + openstack1=`qdbqd6`, openstack2=`h8frng`, openstack3=`tmsafc`. +- MAAS subnet ids: 1=provider 10.12.4.0/22, 2=metal 10.12.8.0/22, 6=data 10.12.12.0/22, + 7=storage 10.12.16.0/22, 8=replication 10.12.20.0/22. +- per-host storage NIC octet = 40 + index: data 10.12.12.4N, storage 10.12.16.4N, replication 10.12.20.4N. + +## Run-location legend +- `# RUN: jumphost` -- `juju` + `maas admin`; the jumphost is also the libvirt hypervisor (sudo). + +--- + +## Phase 0 -- Pre-flight (READ-ONLY; run before teardown) +`# RUN: jumphost` +```bash +( { + echo "=== 0a. five network spaces (hard blocker if absent) ===" + juju spaces # expect metal 10.12.8.0/22 | provider 10.12.4.0/22 | data 10.12.12.0/22 | storage 10.12.16.0/22 | replication 10.12.20.0/22 + + echo "=== 0b. VIP ipranges (note the front-loaded ones to KEEP + the stale .224-.254 to remove) ===" + maas admin ipranges read \ + | jq -r '.[] | "id=\(.id)\ttype=\(.type)\t\(.start_ip)-\(.end_ip)\tsubnet=\(.subnet.cidr // "?")\t\(.comment // "")"' | sort + # KEEP: provider 10.12.4.2-.63, metal 10.12.8.2-.63 (bundle VIPs live here), provider FIP 10.12.5.0-10.12.7.254. + # STALE: metal 10.12.8.224-.254 (old scheme) -> its id feeds Phase 4 (this arc: id=2). + + echo "=== 0c. storage-class NIC link state on all four hosts (drives Phase 3) ===" + for SID in 4na83t qdbqd6 h8frng tmsafc; do echo " -- $SID --" + maas admin interfaces read "$SID" | jq -r '.[] | select(.name|test("^enp(8|9|10)s0$")) + | " \(.name)\tid=\(.id)\tlinks=\([.links[]?|{(.subnet.cidr):.ip_address}])"' + done # enp8s0(data) is the one KNOWN unlinked + a HARD deploy prereq; enp9s0/enp10s0 usually already linked +} ) +``` +```bash +# 0d. OSD-wipe pre-flight gate -- post-teardown these are "shut off"; vdb is root:root / 600. (R7: sudo) +for host in openstack0 openstack1 openstack2 openstack3; do + f="/var/lib/libvirt/images/${host}-1.qcow2" + printf '%-46s state=%s owner=%s mode=%s\n' "$f" \ + "$(sudo virsh -c qemu:///system domstate "$host" 2>/dev/null)" \ + "$(sudo stat -c '%U:%G' "$f" 2>/dev/null)" "$(sudo stat -c '%a' "$f" 2>/dev/null)" +done # expect (AFTER Phase 1 release): 4 lines, state=shut off, owner=root:root, mode=600 +``` + +## Phase 1 -- Teardown (D-018) DISCRETE / DESTRUCTIVE +`# RUN: jumphost` +```bash +# A. pre-destroy capture (reference only; NOT for restore) +TS=$(date -u +%Y%m%dT%H%M%SZ); BACKUP_DIR=$HOME/backups/pre-caracal-destroy-$TS; mkdir -p "$BACKUP_DIR" +juju export-bundle > "$BACKUP_DIR/bundle-pre-destroy.yaml" +juju status --format=yaml > "$BACKUP_DIR/juju-status-pre-destroy.yaml" +for f in "$BACKUP_DIR"/*.yaml; do [ -s "$f" ] || echo "WARNING: $f empty"; done +echo "$BACKUP_DIR" > "$HOME/.last-pre-caracal-destroy-backup"; ls -la "$BACKUP_DIR" +``` +```bash +# B. destroy the openstack model (returns ~1-2 min; reaping ~5-10 min background). Controller untouched. +juju destroy-model openstack --force --no-wait --destroy-storage --no-prompt +``` +```bash +# C. release the FOUR openstack hosts by system_id (DOCFIX-017: hardcoded ids, no whoami). NOT capi-mgmt. +for SID in 4na83t qdbqd6 h8frng tmsafc; do + echo "Releasing $SID..."; maas admin machine release "$SID" comment="Caracal rebuild teardown $TS" +done +``` +```bash +# D. verify +juju models # expect: no 'openstack' (allow a few min) +maas admin machines read \ + | jq -r '.[] | select(.hostname|test("^openstack[0-3]$")) | "\(.hostname)\t\(.status_name)"' | sort + # expect four lines, each ending "Ready" +``` +GATE: `juju models` shows no `openstack`; openstack0-3 all Ready. (`link-subnet` is +REJECTED on a Deployed machine -- Phases 2-3 REQUIRE Ready.) If the model is still +`destroying` after ~10 min: `juju machines -m openstack --format=yaml`, then +`juju remove-machine -m openstack --force ` for each lingering id, then re-run the +destroy-model in B. + +## Phase 2 -- OSD secondary-disk wipe (clean-slate Ceph) DISCRETE / DESTRUCTIVE +`# RUN: jumphost (libvirt host; R7 sudo)` Only after Phase 0d is GREEN (all "shut +off") AND explicit go. vda (the OS disk) is NOT touched -- MAAS reinstalls it on +deploy; only vdb (the OSD target) is recreated blank. +```bash +OWNER="root:root"; MODE="600" +for host in openstack0 openstack1 openstack2 openstack3; do + f="/var/lib/libvirt/images/${host}-1.qcow2" + echo "=== Wiping $f ===" + sudo rm -f "$f" + sudo qemu-img create -f qcow2 "$f" 512G + sudo chown "$OWNER" "$f"; sudo chmod "$MODE" "$f" + sudo ls -lh "$f" +done +# verify +for host in openstack0 openstack1 openstack2 openstack3; do + sudo qemu-img info "/var/lib/libvirt/images/${host}-1.qcow2" | grep -E 'virtual size|disk size' +done +``` +GATE: 4 files, ~200 KiB actual / 512 GiB virtual, root:root mode 600. + +## Phase 3 -- Storage-class NIC links (idempotent; machines Ready) +`# RUN: jumphost` Links every storage-class NIC to its space's subnet. enp8s0 (data) +is the one KNOWN unlinked and a HARD deploy prereq (nova-compute:neutron-plugin->data, +octavia:ovsdb-cms->data, chassis data bindings). enp9s0/enp10s0 back the C2 Ceph +public/cluster bindings; this links them too only if not already linked. +```bash +declare -A NIC_CIDR=( [enp8s0]=10.12.12.0/22 [enp9s0]=10.12.16.0/22 [enp10s0]=10.12.20.0/22 ) +declare -A HOST_OCTET=( [4na83t]=40 [qdbqd6]=41 [h8frng]=42 [tmsafc]=43 ) +declare -A HN=( [4na83t]=openstack0 [qdbqd6]=openstack1 [h8frng]=openstack2 [tmsafc]=openstack3 ) + +for SID in 4na83t qdbqd6 h8frng tmsafc; do + echo "=== ${HN[$SID]} ($SID) ===" + IFJSON=$(maas admin interfaces read "$SID") + for NIC in enp8s0 enp9s0 enp10s0; do + cidr="${NIC_CIDR[$NIC]}"; prefix="${cidr%.0/22}"; ip="${prefix}.${HOST_OCTET[$SID]}" + ifid=$(echo "$IFJSON" | jq -r --arg n "$NIC" '.[]|select(.name==$n)|.id') + if [ -z "$ifid" ]; then echo " $NIC: NOT FOUND -- inspect 'maas admin interfaces read $SID'"; continue; fi + linked=$(echo "$IFJSON" | jq -r --arg c "$cidr" --argjson id "$ifid" \ + '[.[]|select(.id==$id).links[]?|select(.subnet.cidr==$c)]|length') + if [ "$linked" != "0" ]; then echo " $NIC id=$ifid already on $cidr -- SKIP"; continue; fi + subid=$(maas admin subnets read | jq -r --arg c "$cidr" '.[]|select(.cidr==$c)|.id') + echo " $NIC id=$ifid -> $ip (subnet id=$subid, $cidr)" + maas admin interface link-subnet "$SID" "$ifid" mode=STATIC subnet="$subid" ip_address="$ip" + done +done + +# verify -- every host should now show data/storage/replication links +for SID in 4na83t qdbqd6 h8frng tmsafc; do + echo "=== ${HN[$SID]} ($SID) ===" + maas admin interfaces read "$SID" \ + | jq -r '.[] | select(.name|test("^enp(8|9|10)s0$")) | " \(.name)\t\([.links[]?|{(.subnet.cidr):.ip_address}])"' +done +``` +GATE: each host's enp8s0/enp9s0/enp10s0 shows a 10.12.{12,16,20}.4N STATIC link. + +## Phase 4 -- MAAS VIP/FIP address carve (mutation; confirm-first) +`# RUN: jumphost` The bundle's VIPs live in the front-loaded /26 blocks; the FIP +pool (phase-04) lives at 10.12.5.0-10.12.7.254. These MAAS reservations persist +across teardown, so on a repeat rebuild they usually already exist -- verify, create +only if absent, and delete the stale old-scheme reservation. (KI-P3-001: a reserved +range stops MAAS auto-static landing a primary on a configured VIP.) +```bash +# 4a. verify current state +maas admin ipranges read | jq -r '.[] | "id=\(.id)\t\(.type)\t\(.start_ip)-\(.end_ip)\tsubnet=\(.subnet.cidr // "?")\t\(.comment // "")"' | sort +# want present: provider .4.2-.63 (subnet 1), metal .8.2-.63 (subnet 2), provider FIP .5.0-.7.254. +# want absent : metal .8.224-.254 (stale). +``` +```bash +# 4b. create the front-loaded VIP reservations ONLY if absent (idempotent; carve doc section 8) +( { + RANGES="$(maas admin ipranges read)" + [ -n "$RANGES" ] || { echo "ipranges read failed/empty -- ABORT (do not create blind)"; exit 1; } + # provider VIPs 10.12.4.2-.63 (subnet 1) + if printf '%s' "$RANGES" | jq -e '.[]|select(.start_ip=="10.12.4.2" and .end_ip=="10.12.4.63")' >/dev/null; then + echo "provider .4.2-.63 present -- SKIP" + else + maas admin ipranges create type=reserved subnet=1 start_ip=10.12.4.2 end_ip=10.12.4.63 \ + comment="OpenStack public API HA VIPs (front-loaded /26; supersedes .224-.236)" + fi + # metal VIPs 10.12.8.2-.63 (subnet 2) + if printf '%s' "$RANGES" | jq -e '.[]|select(.start_ip=="10.12.8.2" and .end_ip=="10.12.8.63")' >/dev/null; then + echo "metal .8.2-.63 present -- SKIP" + else + maas admin ipranges create type=reserved subnet=2 start_ip=10.12.8.2 end_ip=10.12.8.63 \ + comment="OpenStack internal/admin API HA VIPs (front-loaded /26; supersedes D-020 .224-.254)" + fi +} ) +``` +```bash +# 4c. delete the stale .224-.254 metal reservation -- CONFIRM the id from 4a first (this arc: id=2) +# maas admin iprange delete +``` +GATE: `ipranges read` shows provider FIP + provider VIPs .4.2-.63 + metal VIPs +.8.2-.63; the metal .8.224-.254 reservation is gone; the metal DHCP dynamic +(10.12.9.0-10.12.11.254) is unchanged. + +## Phase 5 -- Post-prep verification (READ-ONLY gate before deploy) +`# RUN: jumphost` +```bash +( { + juju spaces # 5 spaces present + maas admin machines read | jq -r '.[]|select(.hostname|test("^openstack[0-3]$"))|"\(.hostname)\t\(.status_name)"' | sort # all Ready + for SID in 4na83t qdbqd6 h8frng tmsafc; do echo "-- $SID --" + maas admin interfaces read "$SID" | jq -r '.[]|select(.name|test("^enp(8|9|10)s0$"))|" \(.name)\t\([.links[]?|{(.subnet.cidr):.ip_address}])"' + done # data/storage/replication links on all four + for host in openstack0 openstack1 openstack2 openstack3; do + sudo qemu-img info "/var/lib/libvirt/images/${host}-1.qcow2" | grep -E 'virtual size|disk size' + done # OSD 512G blank +} ) +``` + +--- + +## EXIT GATE (phase-00 complete) +- `juju models` shows no `openstack`; openstack0-3 all Ready. +- OSD vdb files 512 GiB blank (root:root, 600) on all four hosts. +- enp8s0/enp9s0/enp10s0 linked (10.12.{12,16,20}.4N STATIC) on all four. +- MAAS carve: front-loaded VIP /26 reserved on provider + metal; FIP pool reserved; + stale .224-.254 gone. +- Clean slate ready for phase-01 (deploy). NOTE: the deploy uses ONE overlay + (octavia-pki only) -- NOT the vr0-dc0-testcloud overlay (R10; that overlay's intent + is folded into the hardened base bundle). + +## As-built reference (rebuild-prep arc -- audit trail) +- Teardown D-018: `juju destroy-model openstack --force --no-wait --destroy-storage + --no-prompt`; release the four hosts by system_id (capi-mgmt left Ready). +- OSD wipe proven 2026-05-22, re-run 2026-05-30: 512G blank, root:root, 600. +- NIC links: enp8s0 found UNLINKED this arc (the hard prereq); enp9s0/enp10s0 already + linked. Reference enp8s0 ids (arc): openstack1=26, openstack2=32, openstack3=38; + openstack0 resolved dynamically (the block does not depend on these). +- MAAS carve: front-loaded .2-.63 reservations created earlier and persistent; stale + metal .224-.254 was iprange id=2 (deleted after confirmation). + +## Next +phase-01 -- bundle deploy. diff --git a/runbooks/phase-01-bundle-deploy.md b/runbooks/phase-01-bundle-deploy.md new file mode 100644 index 0000000..a6a28b9 --- /dev/null +++ b/runbooks/phase-01-bundle-deploy.md @@ -0,0 +1,297 @@ +# Phase 01 -- Bundle Deploy + +Deploy the hardened bundle + the octavia-pki overlay onto the freshly-prepped MAAS +machines, and verify it settles to the expected PRE-vault-init state (zero errors, +vault awaiting init, the TLS consumers awaiting vault certs). Vault init is phase-02. + +Decisions: B5 (IP-only), D-019 (no designate), D-020 (dual provider+metal VIPs), +R14 (VIPs front-loaded to .50-.60), Section-G NIC bindings. Troubleshooting: +appendix-A -- R14 (VIP relocation), R15 (.10 phantom resolver), L1 (no `set -e` on +count-gate blocks), L3 (metal-side dual-VIP eyeball check), DOCFIX-016 (maas list leak). + +--- + +## Prerequisites (must be true entering phase-01) +- phase-00 done: 4 machines Ready/power=off; MAAS carve applied (front-loaded VIP /26 + reserved, FIP pool reserved, stale iprange gone); enp8s0 data NIC linked on ALL four + hosts; OSD `/dev/vdb` wiped blank. +- `overlays/octavia-pki.yaml` present (Step 1.0). +- Hardened `bundle.yaml` in the working dir (channels pinned; VIPs `.50-.60`; + reserved-host-memory 8192; image-conversion; use-policyd-override). + +## Constants and env-literals +- MAAS system_ids: openstack0=`4na83t`, openstack1=`qdbqd6`, openstack2=`h8frng`, openstack3=`tmsafc`. +- MAAS subnet ids: 1=provider 10.12.4.0/22, 2=metal 10.12.8.0/22, 6=data 10.12.12.0/22, + 7=storage 10.12.16.0/22, 8=replication 10.12.20.0/22, 9=lbaas 10.12.32.0/22. +- expected plan: 50 apps, 97 relations, 4 machines (bundle 8/9/10/11 -> juju 0/1/2/3), 24 LXD. + +## Run-location legend +- `# RUN: jumphost` -- `juju` + `maas admin` (MAAS profile is `admin`; never `maas list` -- DOCFIX-016). + +--- + +## Step 1.0 -- Octavia PKI overlay (secret-handling prereq) DISCRETE +`overlays/octavia-pki.yaml` carries the 5 lb-mgmt-* PKI keys (controller CA/cert, +issuing CA key+passphrase+cert). It is the ONLY overlay in the deploy command and is +secret-safe + ASCII. PRIMARY path: reuse the existing validated overlay (the CAs are +10y, so it survives rebuilds). REGENERATION path (fresh CAs): run the discrete secret +procedure inlined as "Step 1.0-GEN" at the end of this phase. Either way, confirm the +overlay parses and contains exactly the 5 keys (sanity block below) before deploying. +```bash +# RUN: jumphost -- sanity only (does NOT print key material) +[ -f overlays/octavia-pki.yaml ] && grep -cE 'lb-mgmt-' overlays/octavia-pki.yaml # expect 5 keys +LC_ALL=C grep -nP '[^\x00-\x7F]' overlays/octavia-pki.yaml && echo "NON-ASCII" || echo "ASCII clean" +``` + +## Step 1.1 -- Pre-deploy verify (read-only; 4 checks) +`# RUN: jumphost` One consolidated read-only block. NO `set -e` (a guarded count of +0 is a valid answer, not a failure -- appendix-A: L1); count greps are `|| true`. +```bash +( { + echo "=== CHECK 1: bundle VIPs (quote-tolerant, octet-anchored) ===" + grep -nE '^[[:space:]]+vip:' bundle.yaml + TOT=$(grep -cE '^[[:space:]]+vip:[[:space:]]*"?10\.12\.4\.' bundle.yaml || true) + HI=$(grep -cE '^[[:space:]]+vip:[[:space:]]*"?10\.12\.4\.(5[0-9]|60)("|$|[[:space:]])' bundle.yaml || true) + LO=$(grep -cE '^[[:space:]]+vip:[[:space:]]*"?10\.12\.4\.(1[0-9]|20)("|$|[[:space:]])' bundle.yaml || true) + echo " provider VIPs total=$TOT in .50-.60=$HI in .10-.20(stale)=$LO (want 11/11/0)" + # metal side is the second token of each dual vip; eyeball that all 11 are .8.50-.60, + # clear of metal infra .8.10(maas)/.8.20(lxd)/.8.21(capi)/.8.30(juju) -- appendix-A: L3. + + echo "=== CHECK 2: enp8s0 data NIC linked on ALL FOUR hosts (10.12.12.0/22) ===" + for SID in 4na83t qdbqd6 h8frng tmsafc; do + echo -n " $SID: " + maas admin interfaces read "$SID" | jq -r '.[] | select(.name=="enp8s0") + | [.links[]? | select(.subnet.cidr=="10.12.12.0/22") | .ip_address] | join(",")' + done # expect 10.12.12.40 / .41 / .42 / .43 (select by .subnet.cidr -> robust to id drift) + + echo "=== CHECK 3: subnet DNS resolvers ===" + for ID in 1 2 6 7 8 9; do maas admin subnet read "$ID" | jq -c '{id,cidr,dns_servers}'; done + # expect subnet 1 (provider) -> [10.12.4.1]; 2/6/7/8/9 -> [10.12.8.1] + + echo "=== CHECK 4a: nodes Ready / power off ===" + maas admin machines read | jq -r '.[] | select(.system_id|IN("4na83t","qdbqd6","h8frng","tmsafc")) + | "\(.hostname) \(.status_name) power=\(.power_state)"' +} ) +``` +```bash +# CHECK 4b: OSD /dev/vdb blank (run on each host; sudo required -- appendix-A: R7) +for h in openstack0 openstack1 openstack2 openstack3; do + echo "== $h ==" + ssh jessea123@$h "sudo qemu-img info /var/lib/libvirt/images/${h}-1.qcow2 | grep -E 'virtual size|disk size'" &1 | tee /tmp/jmodels.txt + if grep -qE '(^|[[:space:]]|/)openstack([[:space:]*]|$)' /tmp/jmodels.txt; then + echo "ABORT: an 'openstack' model already exists (teardown is phase-00)"; + elif [ ! -f overlays/octavia-pki.yaml ]; then + echo "ABORT: overlays/octavia-pki.yaml missing (Step 1.0)"; + else + juju add-model openstack + juju deploy ./bundle.yaml --overlay overlays/octavia-pki.yaml -m openstack --dry-run + fi +} ) +``` +GATE (from the plan): 50 apps, 97 relations, 4 machines (8/9/10/11 -> 0/1/2/3), 24 LXD; +ceph-osd/0-3 one per node; nova-compute/0-2 on machines 1/2/3 ONLY (machine 0 = +OSD+LXD host, no compute); channels match the matrix; relations include +`octavia:certificates - vault:certificates`, `vault:shared-db - vault-mysql-router`, +`mysql-innodb-cluster:certificates - vault:certificates`; NO `vault:ha`, NO designate +(D-019). Only the two benign R11 warnings (L34 `name`, L55 `variables`). + +## Step 1.3 -- Deploy (VIP-guarded) +`# RUN: jumphost` Re-run the VIP guard inline (the dry-run never echoes vip values), +then deploy only if 11/11/0. +```bash +( { + TOT=$(grep -cE '^[[:space:]]+vip:[[:space:]]*"?10\.12\.4\.' bundle.yaml || true) + HI=$(grep -cE '^[[:space:]]+vip:[[:space:]]*"?10\.12\.4\.(5[0-9]|60)("|$|[[:space:]])' bundle.yaml || true) + LO=$(grep -cE '^[[:space:]]+vip:[[:space:]]*"?10\.12\.4\.(1[0-9]|20)("|$|[[:space:]])' bundle.yaml || true) + if [ "$TOT" = 11 ] && [ "$HI" = 11 ] && [ "$LO" = 0 ]; then + juju deploy ./bundle.yaml --overlay overlays/octavia-pki.yaml -m openstack + else + echo "ABORT: VIP guard failed (total=$TOT hi=$HI lo=$LO; want 11/11/0)" + fi +} ) +``` + +## Step 1.4 -- DNS gate during deploy (as machines come up) +`# RUN: jumphost` Run when machine 0 reaches `started`, then per LXD unit as they +appear (flag BEFORE the target; logic inside the remote quotes; no outer 2>/dev/null): +```bash +juju ssh -m openstack 0 -- 'resolvectl status | grep -i "DNS Server"; getent hosts api.snapcraft.io && echo OK || echo FAIL' +# repeat for ceph-mon/0, mysql-innodb-cluster/0 as they appear +``` +GATE: each returns OK (api.snapcraft.io resolves -> the snap install storm proceeds +clean). FINDING (non-blocking, R15): the unreachable region resolver `10.12.8.10` +(MAAS region/rack controller, advertised on the metal VLAN independent of the subnet +field) may still appear in a node's resolver list -- resolution succeeds because +systemd-resolved deprioritizes `.10` and falls through to `.1`. Latent fragility if +`.1` ever drops; understand/eliminate for Roosevelt. (appendix-A: R15.) + +--- + +## EXIT GATE (phase-01 complete) +- Deploy settled to the PRE-vault-init end state: + * ZERO units in `error`. + * mysql-innodb-cluster x3 ACTIVE ("Cluster is ONLINE"). + * vault/0 BLOCKED "Vault needs to be initialized" (the phase-02 trigger, not a fault). + * Waiting on vault certs (expected pre-init): ovn-central x3, ovn-chassis x3 + (incl nova-compute subordinates), ovn-chassis-octavia, neutron-api-plugin-ovn, barbican-vault. + * octavia BLOCKED "Awaiting configure-resources" (D-021); gss unknown (pre-run). +- Section-G NIC payoff confirmed (no subset/binding errors): ceph-mon -> storage 10.12.16.x; + octavia -> data 10.12.12.1; nova-compute -> data 10.12.12.4x; vault -> metal 10.12.8.x. +- Proceed to phase-02 (vault init). + +## As-built reference (2026-06-03 second redeploy -- audit trail) +- `juju deploy ./bundle.yaml --overlay overlays/octavia-pki.yaml -m openstack` on maas/default (cred maas-api). +- Plan: 50 apps / 97 relations / 4 machines / 24 LXD; placement as above. +- Pre-deploy verify: VIPs 11/11/0; enp8s0 -> 10.12.12.40-43 (all 4); subnet DNS as above; nodes Ready; OSD blank. +- Settled: zero errors; mysql /0 R/W (10.12.8.173), /1 (.179) /2 (.185) R/O; vault blocked needs-init. + +## Next +phase-02 -- vault bring-up. + +--- + +## Step 1.0-GEN -- Octavia management-PKI generation (regeneration path) DISCRETE / SECRET +Run ONLY if you are not reusing an existing `overlays/octavia-pki.yaml`. Produces the +two-tier EC PKI for Charmed Octavia's amphora trust domain and writes the overlay. +Decisions (Workstream 3a, 2026-05-22): fresh generation; EC P-384 CAs (SHA-384, 10y); +EC P-256 controller cert (2y); overlay-file distribution (gitignored); artifacts under +`$HOME/octavia-pki/`; passphrases = 32 random bytes base64 (44 chars). SECRET step -- +do NOT echo key material; the only printed values are cert dates/subjects and verify OK. + +The five `octavia` charm options the overlay sets: +- `lb-mgmt-issuing-cacert` = base64(issuing CA cert) +- `lb-mgmt-issuing-ca-private-key` = base64(issuing CA ENCRYPTED key) +- `lb-mgmt-issuing-ca-key-passphrase` = the issuing CA passphrase (PLAIN string, NOT base64) +- `lb-mgmt-controller-cacert` = base64(controller CA cert) +- `lb-mgmt-controller-cert` = base64(controller cert + key, concatenated) + +### 1.0-GEN.0 -- workspace (openssl 3.x; $HOME only -- snap home-confinement, never /tmp) +```bash +# RUN: jumphost +WORKDIR="$HOME/octavia-pki" +mkdir -p "$WORKDIR"/issuing-ca "$WORKDIR"/controller-ca "$WORKDIR"/controller +chmod 700 "$WORKDIR" +openssl version # expect OpenSSL 3.x +``` + +### 1.0-GEN.a -- Issuing CA (EC P-384, AES-256 encrypted key, self-signed 10y) +```bash +( { + WORKDIR="$HOME/octavia-pki"; cd "$WORKDIR/issuing-ca" || exit 1 # dir from 1.0-GEN.a + openssl rand -base64 32 | tr -d '\n' > passphrase.txt + chmod 600 passphrase.txt + test "$(wc -c < passphrase.txt)" -eq 44 || { echo "ABORT: issuing passphrase length != 44"; exit 1; } + openssl genpkey -algorithm EC -pkeyopt ec_paramgen_curve:P-384 \ + -aes-256-cbc -pass file:passphrase.txt -out issuing-ca.key.enc + chmod 600 issuing-ca.key.enc + openssl req -new -x509 -sha384 -key issuing-ca.key.enc -passin file:passphrase.txt \ + -days 3650 -subj "/CN=VR0 DC0 Omega Cloud Octavia Issuing CA/O=Neumatrix" \ + -out issuing-ca.cert.pem + openssl x509 -in issuing-ca.cert.pem -noout -dates -subject + openssl verify -CAfile issuing-ca.cert.pem issuing-ca.cert.pem # expect: OK +} ) +``` + +### 1.0-GEN.b -- Controller CA (EC P-384, AES-256 encrypted key, self-signed 10y; own passphrase) +The controller CA key is encrypted (its own passphrase) for future controller-cert +rotation -- Octavia never receives this key, only the controller CA cert. +```bash +( { + WORKDIR="$HOME/octavia-pki"; cd "$WORKDIR/controller-ca" || exit 1 # dir from 1.0-GEN.a + openssl rand -base64 32 | tr -d '\n' > passphrase.txt + chmod 600 passphrase.txt + test "$(wc -c < passphrase.txt)" -eq 44 || { echo "ABORT: controller passphrase length != 44"; exit 1; } + openssl genpkey -algorithm EC -pkeyopt ec_paramgen_curve:P-384 \ + -aes-256-cbc -pass file:passphrase.txt -out controller-ca.key.enc + chmod 600 controller-ca.key.enc + openssl req -new -x509 -sha384 -key controller-ca.key.enc -passin file:passphrase.txt \ + -days 3650 -subj "/CN=VR0 DC0 Omega Cloud Octavia Controller CA/O=Neumatrix" \ + -out controller-ca.cert.pem + openssl x509 -in controller-ca.cert.pem -noout -dates -subject + openssl verify -CAfile controller-ca.cert.pem controller-ca.cert.pem # expect: OK +} ) +``` + +### 1.0-GEN.c -- Controller cert (EC P-256 UNENCRYPTED, SAN, signed by Controller CA, 2y) +The P-256 key is unencrypted -- Octavia reads it at startup. SAN carries the controller +FQDN, the octavia API FQDN, and the Octavia API VIP 10.12.4.233. +```bash +( { + WORKDIR="$HOME/octavia-pki"; cd "$WORKDIR/controller" || exit 1 # dir from 1.0-GEN.a + openssl genpkey -algorithm EC -pkeyopt ec_paramgen_curve:P-256 -out controller.key + chmod 600 controller.key + cat > controller.cnf <<'CNF' +[req] +distinguished_name = req_distinguished_name +req_extensions = v3_req +prompt = no + +[req_distinguished_name] +CN = octavia-controller.omega.dc0.vr0.cloud.neumatrix.local +O = Neumatrix + +[v3_req] +keyUsage = critical, digitalSignature, keyEncipherment +extendedKeyUsage = clientAuth, serverAuth +subjectAltName = @alt_names + +[alt_names] +DNS.1 = octavia-controller.omega.dc0.vr0.cloud.neumatrix.local +DNS.2 = octavia.omega.dc0.vr0.cloud.neumatrix.local +IP.1 = 10.12.4.233 +CNF + openssl req -new -sha256 -key controller.key -config controller.cnf -out controller.csr + openssl x509 -req -sha256 -in controller.csr \ + -CA ../controller-ca/controller-ca.cert.pem \ + -CAkey ../controller-ca/controller-ca.key.enc \ + -passin file:../controller-ca/passphrase.txt \ + -CAcreateserial -days 730 \ + -extfile controller.cnf -extensions v3_req \ + -out controller.cert.pem + openssl verify -CAfile ../controller-ca/controller-ca.cert.pem controller.cert.pem # expect: OK + openssl x509 -in controller.cert.pem -noout -ext subjectAltName # DNS x2 + IP present + openssl x509 -in controller.cert.pem -noout -dates + cat controller.cert.pem controller.key > controller.bundle.pem + chmod 600 controller.bundle.pem +} ) +``` + +### 1.0-GEN.d -- Write overlays/octavia-pki.yaml (base64 blobs + plaintext passphrase) +Four values are base64(PEM); the issuing-CA passphrase is a PLAIN string. The file is +gitignored. Set `$REPO` to the jumphost clone (the dir holding bundle.yaml + overlays/). +```bash +( { + WORKDIR="$HOME/octavia-pki"; cd "$WORKDIR" || exit 1 # dir from 1.0-GEN.a + REPO="${REPO:-$HOME/openstack-caracal-ipv4}" # adjust to the actual clone path + mkdir -p "$REPO/overlays" + ISS_CERT=$(base64 -w0 issuing-ca/issuing-ca.cert.pem) + ISS_KEY=$(base64 -w0 issuing-ca/issuing-ca.key.enc) + ISS_PASS=$(cat issuing-ca/passphrase.txt) + CON_CACERT=$(base64 -w0 controller-ca/controller-ca.cert.pem) + CON_CERT=$(base64 -w0 controller/controller.bundle.pem) + cat > "$REPO/overlays/octavia-pki.yaml" <