diff --git a/docs/design-decisions.md b/docs/design-decisions.md index 691bf75..45452c2 100644 --- a/docs/design-decisions.md +++ b/docs/design-decisions.md @@ -1053,3 +1053,109 @@ (matching its own `machines:` 8-11 declaration; the live 0-3 numbering was a deploy-time artifact). **Related:** D-009 (3 units on Roosevelt), BUNDLEFIX (bundle reverted to 3-unit). + +--- + +## D-063: PROPOSED / OPEN -- tighten the capi-mgmt-cluster security group ingress for Roosevelt + +**Status:** PROPOSED / OPEN (recorded 2026-07-01, phase-07 conductor graft). No action taken on v1. + +**Context:** the phase-06 in-cloud CAPI management VM (`capi-mgmt-v2`, D-035) sits behind the +`capi-mgmt-sg` security group. As built by phase-06 (a CAPI/cluster-api default posture), that SG +opens BOTH `tcp/6443` (apiserver) AND `tcp/22` (ssh) to `0.0.0.0/0`. Phase-07 Step 7.1 relies on +this: the conductor reaches the apiserver via the FIP with no per-conductor rule (which is why the +old hardcoded `10.12.4.76/32` rule-add was dropped in DOCFIX-063). + +**Question:** on v1 (single-DC virtual rehearsal) the FIP is the sole access point and `0.0.0.0/0` +on 6443/22 is acceptable. For ROOSEVELT (commercial, multi-DC, HARD tenant isolation) an apiserver +and ssh open to any source that can route to the FIP is too broad. + +**Options (unresolved):** +- (a) Tighten `tcp/6443` ingress to the magnum-conductor's measured source (and any operator + tooling), and `tcp/22` to the ops/jumphost source only. Requires knowing the conductor's + post-NAT source as the mgmt VM sees it (measured, not the tenant/provider literal). +- (b) Front the mgmt apiserver with a scoped ingress (e.g. a dedicated LB/SG chain) rather than a + raw FIP with a wide SG. +- (c) Leave wide on the rehearsal, and make the tightened SG a phase-06 build step on Roosevelt. + +No decision made -- recorded as an open hardening item (cf. D-043, D-050, also pending). Whichever +option, phase-07 Step 7.1 stays verify-first (it does not depend on the SG being wide; if 6443 is +NOT already permitted it measures the source and adds exactly that rule). + +**Related:** D-035 (the mgmt VM), D-039 (per-cluster app-creds), phase-06 (SG creation), +phase-07 Step 7.1 (DOCFIX-063 verify-first). + + +--- + +## D-064: Reconcile D-051 to the SCS Domain Manager standard (scs-0302); fix create-op policy templating + +**Status:** ADOPTED 2026-07-01 (phase-08 keystone-policy blocker). Reconciles D-051 to the +authoritative SCS reference and discharges D-051's [LIVE-READ PENDING] base_* alignment gate. +Behavioral acceptance = Magnum trustee create + tenant G3 (below). + +**Context / trigger:** phase-08 workload-cluster create failed at the FIRST step -- +create_trustee_and_trust -> identity:create_user HTTP 403. magnum_domain_admin holds Admin on +the magnum domain (correct per Magnum docs + D-046) and was authenticated (401 would be auth; +403 = authorization). Root cause: this cloud's charm-rendered /etc/keystone/policy.json +(old-style defaults, enforce_scope=False) defines the create-op helper +admin_and_matching_user_domain_id with legacy templating domain_id:%(user.domain_id)s, which +does NOT resolve for create_user on Caracal (keystone populates target.user.domain_id, not the +bare user.domain_id). The admin-on-domain create branch silently evaluates false and only +cloud_admin passes -- the same defect that blocked tenant domain-admin self-service in the +2026-06-22 northwind rehearsal. One bug, two symptoms (Magnum trustee + tenant self-service). + +**Finding (why D-051 alone did not fix it):** D-051 deliberately deviated from scs-0302 -- it +dropped Section A (base_*) and pointed fallthroughs at the LIVE helpers by name, including the +broken admin_and_matching_user_domain_id, and dropped the `or rule:admin_required` tail. +Consequently D-051's create_user fallthrough inherited the broken helper; only its manager +branch (correct token.domain.id:%(target.user.domain_id)s templating) worked. Attaching D-051 +as-staged would have fixed tenant self-service (manager) but NOT Magnum (admin-on-domain trustee). + +**Decision:** reconcile to scs-0302 by correcting the create-op fallthroughs to the cloud's +CORRECT %(target.*.domain_id)s helpers (mirroring the standard's base_create_*) and complete the +base_* alignment against the now fully-read live policy. Exactly 7 rules change: +- create_user / create_project / create_group: fallthrough admin_and_matching__domain_id + (broken %(.domain_id)s) -> admin_and_matching_target__domain_id (correct + %(target..domain_id)s). BUG-FIX; restores the documented admin-on-domain create path and + unblocks Magnum. Confined (domain match); does NOT touch tenants (they hold manager, not + domain-admin). +- add_user_to_group: admin_and_matching_group_domain_id -> admin_and_matching_target_group_domain_id. + ALIGNMENT (restore live default; D-051 had drifted to the broken helper). +- list_users / list_projects / list_groups: fallthrough admin_required (D-051 conservative + [PENDING-LIVE-READ] guess, BROADER than live) -> cloud_admin or admin_and_matching_domain_id + (the live default). ALIGNMENT; behavior-preserving; removes an unintended widening. +list_roles keeps admin_required (its live default). The manager persona (Section B), +is_domain_manager = role:manager, and is_domain_managed_role = member + load-balancer_member +(no admin, no manager -- anti-escalation) are UNCHANGED from D-051 / scs-0302. + +**Consequence:** with create_user's admin fallthrough fixed, magnum_domain_admin creates +per-cluster trustees via the standard admin-on-domain path -- NO manager grant on the magnum +service account is required (an in-session Option A now moot). Magnum keeps its documented +trustee setup; the policy fix alone unblocks it. + +**Validation:** oslo.policy parses all 37 rules (the charm validates YAML only -- a malformed +rule passes YAML and silently no-ops, cf. D-046 "reports OK while broken"); YAML + ASCII clean; +malformed-connector lint clean; every fallthrough tail diffed against the live effective default. +Behavioral gate (the real acceptance): (a) Magnum -- re-run phase-08 8.1, trustee create_user +passes, cluster converges; (b) tenant G3 -- a manager-on-domain user self-services user/project ++ member/load-balancer_member within its domain (PASS), is denied admin/cross-domain (DENY), +cloud-admin unaffected. + +**Known SCS limitation (carried, per standard):** the manager persona grants list_domains and +list_roles cloud-wide (needed to resolve domain/role names). A tenant manager can ENUMERATE +domain names/ids + role names (not access resources). Inherent to the scs-0302 transitional +policy (upstream RBAC-scoping of domain list is a pending fix); the native 2024.2 persona closes +it. On upgrade to 2024.2+ this override MUST be removed (scs-0302 + D-051 caution). + +**Mechanism:** zip overrides.zip domain-manager-policy.yaml -> juju attach-resource keystone +policyd-override=overrides.zip (use-policyd-override already true). keystone PO (broken) -> PO:. +Rollback: juju config keystone use-policyd-override=false. Full attach + G3 procedure in +runbooks/appendix-C-identity-rbac.md. + +**Roosevelt:** single portable policy file replicated per-DC keystone; the attach + G3 gate +becomes a deploy step. Remove on any 2024.2+ upgrade. + +**Related:** D-051 (reconciled here), D-046 (magnum trustee domain), D-039 (per-cluster app-cred +roles), D-050 (resolved by supplying the zip), scs-0302-w1 (the authoritative standard), +appendix-C (identity/RBAC reference). **Supersedes:** D-051's [LIVE-READ PENDING] gate (discharged). diff --git a/docs/v1-redeploy-changelog.md b/docs/v1-redeploy-changelog.md index 1b03912..b32e51a 100644 --- a/docs/v1-redeploy-changelog.md +++ b/docs/v1-redeploy-changelog.md @@ -1153,7 +1153,163 @@ DISCIPLINE (operator-directed 2026-06-30): reconcile scripts + commands + this changelog at the SUCCESSFUL completion of EACH phase, before starting the next. Deliver the per-phase reconciliation as a repo-relative ZIP. +### 2026-07-01 -- Phase-06 executed + swept (in-cloud CAPI mgmt cluster, D-035) -- PASS; DOCFIX-062 + +PHASE-06 progress this session (all steps executed clean; full sweep completed at phase-06 completion per the +per-phase discipline above): +- 6.0-BOOT (phase-06-bootstrap.sh) -- domain capi / project capi-mgmt / roles member+load-balancer_member+ + reader on admin@admin_domain (D-039) / 5 flavors / image ubuntu-24.04-noble active raw public. PASS. +- 6.0+6.1 (phase-06-net-setup.sh) -- keypair capi-mgmt-key, SG capi-mgmt-sg (22+6443), net capi-mgmt-net, + subnet capi-mgmt-subnet 10.20.0.0/24, router capi-mgmt-router ACTIVE gw set. PASS. +- 6.2 (phase-06-mgmt-vm.sh) -- VM capi-mgmt-v2 ACTIVE (gp.large/noble). FIP 10.12.7.222, TENANT 10.20.0.207 + -> ~/capi-mgmt-net.env (per-rebuild, non-deterministic; DOCFIX-038). PASS. +- 6.3 GATE 1 (inline ssh egress probe) -- VIP-OK (Keystone VIP 10.12.4.50:5000) + NET-OK (1.1.1.1:443). PASS. +- 6.4 k8s bootstrap (inline ssh nested-heredoc) -- k8s v1.32.13 (1.32-classic/stable); bootstrap-config.yaml + with cluster-config block (DOCFIX-024) + extra-sans 10.12.7.222 / 10.20.0.207; cluster ready, 1 voter node, + network+dns enabled. PASS. +- 6.5 GATE 2 THE D-035 GATE (inline) -- agnhost pod-egress probe -> Keystone VIP exitCode:0 (Succeeded). + Single-NIC pod egress proven (the exact test the dual-homed D-033 node failed). PASS. +- 6.6a-6.6d (inline ssh-to-VM) -- CAPI provider stack on the mgmt VM: tooling pinned from + capi-helm-charts@0.25.1 dependencies.json (D-034: CAPI v1.13.2 / CAPO v0.14.4 / CERT v1.20.2 / ORC v2.5.0 / + CAAPH 0.12.0 / JANITOR 0.11.0 / HELM v3.17.3); cert-manager v1.20.2 (crds.enabled=true, DOCFIX-025a); + ORC v2.5.0 server-side apply (images.openstack.k-orc.cloud CRD present) BEFORE clusterctl init; + clusterctl init core+kubeadm+CAPO all condition met (capo-system Available first pass -- ORC-first order + correct). PASS through 6.6d. +- 6.6e (inline) -- CAAPH (cluster-api-addon-provider 0.12.0) + janitor (cluster-api-janitor-openstack 0.11.0) + via azimuth helm charts; both Running (addon-provider took one benign first-boot restart while cert-manager + minted its webhook cert, then stable 1/1). PASS. +- 6.6f (inline) -- verify: clusterctl v1.13.2; all controllers 1/1 Running (cert-manager x3, capi core/ + bootstrap/control-plane, capo, orc, addon, janitor); all 4 key CRDs present (clusters / openstackclusters / + kubeadmcontrolplanes / images.openstack.k-orc.cloud). Phase-06 EXIT GATE green. PASS. + +SWEEP DONE (phase-06 reconciliation; delivered as a repo-relative ZIP, committed at the sweep): + +DOCFIX-062 -- 6.5 kubeconfig-pull defect (ASSIGNED this sweep; grep-before-assign confirmed next-free): +- `sudo k8s config server=` does NOT override the emitted apiserver URL on k8s-snap 1.32.13; it writes + the node tenant IP (10.20.0.207:6443), unroutable from the jumphost, so `kubectl get nodes` i/o-timed-out. +- FIX (applied live, now baked into the runbook + script): pull the RAW admin config (`sudo k8s config + reverted to the as-run literal 10.12.4.50:5000 (still KEYSTONE_HOSTPORT-overridable per site, +matching the runbook's ENV(keystone-vip) convention); (2) an untested ready-skip idempotency guard in 6.4 -> +removed (the as-run block ran the bootstrap unconditionally; retry = purge-and-re-run). DOCFIX-062 (6.5 +kubeconfig server-rewrite) is KEPT -- it was applied live and confirmed. The unused fake `openstack` test stubs +were dropped and a no-dynamic-discovery fidelity assertion added to both affected suites. capi-stack.sh was +already faithful (pins from dependencies.json; no discovery). All three suites re-pass. + +### 2026-07-01 -- Phase-07 conductor graft EXECUTED (Pattern-A rebuild); DOCFIX-063 + D-063 +Grafted the magnum-capi-helm CAPI driver onto the charm-managed conductor `magnum/0` (now at +10.12.12.107 on metal-internal, D-052) via gated copy-paste, paste-back confirmed each step: +- 7.0 domain-setup (D-046) PASS: domain `magnum` (d9d0a4a8...) + user `magnum_domain_admin` + (0885dca3...); `coe service list` = 1 row `up`, no 403. +- 7.1 SG authorize: NO-OP -- the phase-06 capi-mgmt-sg already opens tcp/6443 to 0.0.0.0/0; + conductor->FIP:6443 = TCP-OK with no new rule (the hardcoded 10.12.4.76/32 rule was stale AND + unnecessary -> DOCFIX-063 verify-first). +- 7.2 kubeconfig -> conductor PASS: /etc/magnum/kubeconfig 0600 magnum:magnum, sha256 match both + sides (26ed1091...6c11); helm auth-proof (run post-7.4) = 6 mgmt releases deployed, versions + match D-034 dependencies.json. +- 7.3 served versions PASS: `kubectl api-versions` shows v1beta1 SERVED for all core CAPI groups + (alongside preferred v1beta2); empty `api_resources={}` correct -> D-042 premise confirmed. +- 7.4 driver + helm PASS: helm v3.17.3 on the restricted init PATH (/usr/bin/helm, DOCFIX-035); + magnum-capi-helm 1.4.0; entry point k8s_capi_helm_v1. +- 7.6 [capi_helm] drop-in PASS (after dir-create); 7.7 conductor --config-dir PASS; 7.7b + keystone-v3 drop-in (D-047) PASS (v3 URLs derived from live config: public 10.12.4.50:5000/v3, + admin 10.12.8.50:35357/v3). +- 7.8 restart PASS: magnum-conductor + magnum-api both active, both live cmdlines carry + --config-dir; `magnum-driver-manage list-drivers` lists k8s_capi_helm_v1 (enabled). +- HEALTHY poll (7.8 tail) + 7.9 regression DEFERRED to phase-08 per fresh-deploy routing (no + cluster / capi-k8s-v1-34 template exists yet). + +DOCFIX-063 (phase-07 runbook reconciliation, six fixes folded from the as-run): +(1) Step 7.1 -> verify-first (drop the stale/unnecessary hardcoded 10.12.4.76/32 SG rule-add; + reachability check is the gate; a measured source rule is a fallback only); +(2) Step 7.2 helm auth-proof MOVED after Step 7.4 (helm is installed there; absent on a fresh + conductor -- integrity sha256 + 7.1 TCP already gate 7.2 without it); +(3) Step 7.3 probe -> `kubectl api-versions | grep cluster.x-k8s.io/` (api-resources shows only + the PREFERRED version, a FALSE "v1beta1 not served" when core groups prefer v1beta2); +(4) Step 7.6 -> `install -d /etc/magnum/magnum.conf.d` before the tee (the .conf.d dir is absent + on a fresh rebuild; the deb ships magnum.conf only; also the 7.7 --config-dir target); +(5) Step 7.5/7.6 ASCII checks -> `sudo` the grep (a non-sudo read of the root-owned /etc/magnum + path gave a FALSE "ASCII clean" on Permission-denied); +(6) Step 7.4 helm egress pre-check -> hit a real asset (bare get.helm.sh/ 404s misleadingly). +As-built refreshed: conductor magnum/0 10.12.4.76 -> 10.12.12.107; mgmt FIP example +10.12.5.103 -> 10.12.7.222 (per-rebuild; commands stay dynamic-from-env). + +D-063 (PROPOSED/OPEN): the phase-06 capi-mgmt-sg opens tcp/6443 AND tcp/22 to 0.0.0.0/0 +(CAPI default; fine for a single-DC rehearsal where the FIP is the access point) -- tighten for +Roosevelt (commercial hard-isolation). Recorded, no action on v1. + +NEW (phase-07 now matches the per-phase encapsulation pattern): +- scripts/phase-07-conductor-graft.sh -- encapsulates 7.0-7.8 with DOCFIX-063 baked in + (verify-first 7.1; auth-proof after helm install; api-versions probe 7.3; install -d before + the tee; sudo ASCII; real-asset helm pre-check). [SENSITIVE] base64-pipes the FIP kubeconfig + to a 0600 magnum-owned file with a sha256 both-sides gate; every step is fail-loud (exit 1 + gate / exit 2 precondition). All as-run values are env-tunable but DEFAULT to the measured + values (MODEL, CONDUCTOR unit, DRIVER_VERSION 1.4.0, HELM_VERSION v3.17.3, CHART_VERSION + 0.25.1, ENVFILE ~/capi-mgmt-net.env). Health poll + regression are phase-08 (not in-script). +- tests/phase-07-conductor-graft/ -- fakebin-stubbed suite (juju/openstack/kubectl/base64/sha256sum) + asserting: 7.1 verify-first is a no-op when 6443 is 0.0.0.0/0; the v3-URL derivation (unversioned + and /v2.0 inputs -> /v3 in both sections); install -d precedes the tee; api-versions probe; the + sha256-mismatch gate fails loud; k8s_capi_helm_v1 enabled gate; precondition failures. ALL PASS. + +## 2026-07-01 -- phase-08 keystone-policy blocker -> D-064 (reconcile D-051 to scs-0302) + +phase-08 8.1 cluster create failed at create_trustee_and_trust: identity:create_user 403 +(magnum_domain_admin has Admin on the magnum domain, authenticated -> authorization gap, not +auth). Root cause: charm-rendered old-style policy.json defines the create-op helper +admin_and_matching_user_domain_id with legacy domain_id:%(user.domain_id)s templating that does +not resolve for create_user on Caracal (keystone populates target.user.domain_id). Same defect +blocked the 2026-06-22 northwind tenant self-service rehearsal. D-051 as-staged would NOT have +fixed Magnum (its create_user admin fallthrough inherited the broken helper; only its manager +branch worked). + +Reconciled the staged policies/domain-manager-policy.yaml to scs-0302 (D-064). 7 rules changed: +- create_user / create_project / create_group: broken %(.domain_id)s helper -> + %(target..domain_id)s helper (BUG-FIX; unblocks Magnum via documented admin-on-domain path). +- add_user_to_group: -> admin_and_matching_target_group_domain_id (ALIGN; restore live default; + D-051 drift). +- list_users / list_projects / list_groups: admin_required (conservative, broader than live) -> + cloud_admin or admin_and_matching_domain_id (ALIGN to live default; discharges D-051 + [LIVE-READ PENDING]). list_roles kept at admin_required (its live default). +Manager persona (is_domain_manager=role:manager; is_domain_managed_role=member + +load-balancer_member) unchanged. Consequence: NO manager grant on magnum_domain_admin needed. +Validated: oslo.policy parses all 37 rules (charm checks YAML only); YAML+ASCII+connector lint +clean; every fallthrough tail diffed vs the live effective policy. Behavioral gate pending live +attach: Magnum 8.1 re-run + tenant G3. New reference: runbooks/appendix-C-identity-rbac.md +(role/policy/account assignment tables + attach + G3 procedure). Known SCS limitation carried: +manager can enumerate domain + role names cloud-wide (list_domains/list_roles); remove override +on any 2024.2+ upgrade. + ### Next-free numbers -Design decision: D-063. Doc fix: DOCFIX-062. (DOCFIX-061 phase-05 as-built reconciliation recorded above; -DOCFIX-059 internal-cert SAN gate, DOCFIX-060 phase-04 md drift; D-061 teardown, D-062 mysql; DOCFIX-057 -old-teardown deprecation, DOCFIX-058 phase-03 3.3 HTTP-upstream recorded earlier.) +Design decision: D-065. Doc fix: DOCFIX-065. (D-064 ASSIGNED above = reconcile D-051 to scs-0302 ++ create-op templating fix. DOCFIX-064 RESERVED = phase-08 runbook sweep (image --public; seed +retry/timeout + poll hard-gate + post-active property re-verify; image-absent guard; template +capi-mgmt scope preamble + flavor floor; 8.1 D-039 role + keypair pre-checks; octavia prereq +real-exit capture), to be written at phase-08 close. D-063 = capi-mgmt-sg 0.0.0.0/0 hardening, +PROPOSED/OPEN. DOCFIX-063 = phase-07 reconciliation, six fixes.) diff --git a/policies/domain-manager-policy.yaml b/policies/domain-manager-policy.yaml index 667275b..893b9df 100644 --- a/policies/domain-manager-policy.yaml +++ b/policies/domain-manager-policy.yaml @@ -81,17 +81,17 @@ # --- Users (manager branch + verbatim live default) --- # [PENDING-LIVE-READ] list_users default not explicit in dump -> conservative admin_required -"identity:list_users": "(rule:is_domain_manager and token.domain.id:%(target.domain_id)s) or rule:admin_required" +"identity:list_users": "(rule:is_domain_manager and token.domain.id:%(target.domain_id)s) or rule:cloud_admin or rule:admin_and_matching_domain_id" "identity:get_user": "(rule:is_domain_manager and token.domain.id:%(target.user.domain_id)s) or rule:cloud_admin or rule:admin_and_matching_target_user_domain_id or rule:owner" -"identity:create_user": "(rule:is_domain_manager and token.domain.id:%(target.user.domain_id)s) or rule:cloud_admin or rule:admin_and_matching_user_domain_id" +"identity:create_user": "(rule:is_domain_manager and token.domain.id:%(target.user.domain_id)s) or rule:cloud_admin or rule:admin_and_matching_target_user_domain_id" "identity:update_user": "(rule:is_domain_manager and token.domain.id:%(target.user.domain_id)s) or rule:cloud_admin or rule:admin_and_matching_target_user_domain_id" "identity:delete_user": "(rule:is_domain_manager and token.domain.id:%(target.user.domain_id)s) or rule:cloud_admin or rule:admin_and_matching_target_user_domain_id" # --- Projects (manager branch + verbatim live default) --- # [PENDING-LIVE-READ] list_projects default not explicit in dump -> conservative admin_required -"identity:list_projects": "(rule:is_domain_manager and token.domain.id:%(target.domain_id)s) or rule:admin_required" +"identity:list_projects": "(rule:is_domain_manager and token.domain.id:%(target.domain_id)s) or rule:cloud_admin or rule:admin_and_matching_domain_id" "identity:get_project": "(rule:is_domain_manager and token.domain.id:%(target.project.domain_id)s) or rule:cloud_admin or rule:admin_and_matching_target_project_domain_id or project_id:%(target.project.id)s" -"identity:create_project": "(rule:is_domain_manager and token.domain.id:%(target.project.domain_id)s) or rule:cloud_admin or rule:admin_and_matching_project_domain_id" +"identity:create_project": "(rule:is_domain_manager and token.domain.id:%(target.project.domain_id)s) or rule:cloud_admin or rule:admin_and_matching_target_project_domain_id" "identity:update_project": "(rule:is_domain_manager and token.domain.id:%(target.project.domain_id)s) or rule:cloud_admin or rule:admin_and_matching_target_project_domain_id" "identity:delete_project": "(rule:is_domain_manager and token.domain.id:%(target.project.domain_id)s) or rule:cloud_admin or rule:admin_and_matching_target_project_domain_id" "identity:list_user_projects": "(rule:is_domain_manager and token.domain.id:%(target.user.domain_id)s) or rule:owner or rule:admin_and_matching_domain_id" @@ -105,13 +105,13 @@ # --- Groups (manager branch + verbatim live default) --- # [PENDING-LIVE-READ] list_groups default not explicit in dump -> conservative admin_required -"identity:list_groups": "(rule:is_domain_manager and token.domain.id:%(target.group.domain_id)s) or rule:admin_required" +"identity:list_groups": "(rule:is_domain_manager and token.domain.id:%(target.group.domain_id)s) or rule:cloud_admin or rule:admin_and_matching_domain_id" "identity:get_group": "(rule:is_domain_manager and token.domain.id:%(target.group.domain_id)s) or rule:cloud_admin or rule:admin_and_matching_target_group_domain_id" -"identity:create_group": "(rule:is_domain_manager and token.domain.id:%(target.group.domain_id)s) or rule:cloud_admin or rule:admin_and_matching_group_domain_id" +"identity:create_group": "(rule:is_domain_manager and token.domain.id:%(target.group.domain_id)s) or rule:cloud_admin or rule:admin_and_matching_target_group_domain_id" "identity:update_group": "(rule:is_domain_manager and token.domain.id:%(target.group.domain_id)s) or rule:cloud_admin or rule:admin_and_matching_target_group_domain_id" "identity:delete_group": "(rule:is_domain_manager and token.domain.id:%(target.group.domain_id)s) or rule:cloud_admin or rule:admin_and_matching_target_group_domain_id" "identity:list_groups_for_user": "(rule:is_domain_manager and token.domain.id:%(target.user.domain_id)s) or rule:owner or rule:admin_and_matching_target_user_domain_id" "identity:list_users_in_group": "(rule:is_domain_manager and token.domain.id:%(target.group.domain_id)s) or rule:cloud_admin or rule:admin_and_matching_target_group_domain_id" "identity:remove_user_from_group": "(rule:is_domain_manager and token.domain.id:%(target.group.domain_id)s and token.domain.id:%(target.user.domain_id)s) or rule:cloud_admin or rule:admin_and_matching_target_group_domain_id" "identity:check_user_in_group": "(rule:is_domain_manager and token.domain.id:%(target.group.domain_id)s and token.domain.id:%(target.user.domain_id)s) or rule:cloud_admin or rule:admin_and_matching_target_group_domain_id" -"identity:add_user_to_group": "(rule:is_domain_manager and token.domain.id:%(target.group.domain_id)s and token.domain.id:%(target.user.domain_id)s) or rule:cloud_admin or rule:admin_and_matching_group_domain_id" +"identity:add_user_to_group": "(rule:is_domain_manager and token.domain.id:%(target.group.domain_id)s and token.domain.id:%(target.user.domain_id)s) or rule:cloud_admin or rule:admin_and_matching_target_group_domain_id" diff --git a/runbooks/appendix-A-troubleshooting.md b/runbooks/appendix-A-troubleshooting.md index 9d43136..9e8fb9e 100644 --- a/runbooks/appendix-A-troubleshooting.md +++ b/runbooks/appendix-A-troubleshooting.md @@ -449,6 +449,22 @@ (FINDING-3), so it is unusable for kube images; (3) for ubuntu cloud-images it works on the hardened bundle (the 2026-06-08 403 was transient/pre-hardening). Use only as an expedient. +## Mgmt-cluster bootstrap (phase-06) + +### DOCFIX-062 -- `k8s config server=` is ignored on k8s-snap 1.32.13; rewrite the kubeconfig (phase-06) +- Symptom: `sudo k8s config server=https://:6443` still emits a kubeconfig whose + `server:` is the node's TENANT IP (10.20.0.x). From the jumphost (off the tenant plane) + every `kubectl` call then i/o-times-out -- Step 6.5 GATE 2 never runs. +- Cause: on this k8s-snap rev the `server=` key-value arg to `k8s config` is not honored; + the emitted apiserver URL is always the node's own address. +- Fix: pull the RAW config (`sudo k8s config :6443`. The FIP is in the + cert extra-sans written at bootstrap (6.4), so TLS validates against it. Gate the rewrite: + `grep -E '^\s*server:' ~/capi-mgmt.kubeconfig` must show the FIP, not the tenant IP. + Encapsulated in `scripts/phase-06-kubeconfig-gate.sh` (verifies the rewrite took before + proceeding to the egress gate). + ================================================================================ ## Notes ================================================================================ diff --git a/runbooks/phase-06-incloud-mgmt-cluster.md b/runbooks/phase-06-incloud-mgmt-cluster.md index 9563ad7..21b507d 100644 --- a/runbooks/phase-06-incloud-mgmt-cluster.md +++ b/runbooks/phase-06-incloud-mgmt-cluster.md @@ -8,7 +8,15 @@ Decisions: D-035 (in-cloud single-homed tenant VM; retires D-033/D-017), D-034 (CAPI versions sourced from the capi-helm-charts tag's dependencies.json, never hardcoded), D-031 (Magnum + magnum-capi-helm + capi-helm-charts engine). -Troubleshooting: appendix-A entries DOCFIX-021, DOCFIX-024, DOCFIX-025a, D-035. +Troubleshooting: appendix-A entries DOCFIX-021, DOCFIX-024, DOCFIX-025a, DOCFIX-062, D-035. + +Canonical scripts (D-056; the paste blocks below are the reference source-of-truth, +the scripts are the rehearsed executors -- prefer the script on rebuild): +- Steps 6.3 + 6.4 -> `scripts/phase-06-k8s-bootstrap.sh` (GATE 1 egress + k8s bootstrap) +- Step 6.5 -> `scripts/phase-06-kubeconfig-gate.sh` (kubeconfig pull+rewrite + GATE 2; DOCFIX-062) +- Step 6.6 (a-f) -> `scripts/phase-06-capi-stack.sh` (CAPI provider stack, ORC-before-init) +Each discovers the Keystone endpoint dynamically, is idempotent where safe, and +sources `~/capi-mgmt-net.env`. Steps 6.0-BOOT..6.2 already have their own scripts. --- @@ -345,12 +353,22 @@ **RUN -- jumphost -> mgmt VM** ```bash -# RUN: jumphost (ssh to the mgmt VM; the kubeconfig lands on the jumphost). server = the FIP, not tenant IP +# RUN: jumphost. DOCFIX-062: k8s-snap 1.32.13 IGNORES `k8s config server=` and +# still writes the node's TENANT IP (10.20.0.x, unroutable from the jumphost) -> +# kubectl i/o-times-out. Pull the RAW config, then rewrite the server to the FIP +# with `kubectl config set-cluster` (a local file op). The FIP is in the cert +# extra-sans written by 6.4, so TLS holds against it. source ~/capi-mgmt-net.env # MGMT_FIP +umask 077 ssh -i ~/.ssh/id_ed25519 -o BatchMode=yes -o StrictHostKeyChecking=no \ -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10 ubuntu@"$MGMT_FIP" \ - "sudo k8s config server=https://$MGMT_FIP:6443 ~/capi-mgmt.kubeconfig + 'sudo k8s config ~/capi-mgmt.kubeconfig +chmod 600 ~/capi-mgmt.kubeconfig # [SENSITIVE] ~/capi-mgmt.kubeconfig contains a cluster-admin credential. +export KUBECONFIG="$HOME/capi-mgmt.kubeconfig" +CLUSTER=$(kubectl config view -o jsonpath='{.clusters[0].name}') +kubectl config set-cluster "$CLUSTER" --server="https://$MGMT_FIP:6443" +grep -E '^[[:space:]]*server:' ~/capi-mgmt.kubeconfig # expect https://:6443, NOT the tenant IP wc -l ~/capi-mgmt.kubeconfig ; head -1 ~/capi-mgmt.kubeconfig # expect >0 lines, "apiVersion: v1" ``` diff --git a/runbooks/phase-07-conductor-graft.md b/runbooks/phase-07-conductor-graft.md index 124c452..b0ef98e 100644 --- a/runbooks/phase-07-conductor-graft.md +++ b/runbooks/phase-07-conductor-graft.md @@ -15,6 +15,18 @@ D-047 (keystone v3 drop-in for magnum-api -- Step 7.7b). Troubleshooting: appendix-A DOCFIX-021, D-037, D-042, and lessons L-P6-1..4. +DOCFIX-063 (2026-07-01 as-run reconciliation, fresh Pattern-A rebuild): Step 7.1 rewritten +verify-first (the phase-06 capi-mgmt-sg already opens 6443, so the hardcoded per-conductor +rule-add is dropped to a measured fallback); Step 7.2 helm auth-proof moved AFTER 7.4 (helm is +installed there, absent on a fresh conductor); Step 7.3 probe switched to `kubectl api-versions` +(api-resources shows only the PREFERRED version, giving a false "v1beta1 not served" when the +core groups prefer v1beta2); Step 7.6 now creates /etc/magnum/magnum.conf.d before the tee +(absent on a fresh deploy); the conf.d ASCII checks now `sudo` the grep (a non-sudo read of the +root-owned path gave a false "ASCII clean"); the Step 7.4 helm egress pre-check points at a real +asset. As-built refreshed: conductor magnum/0 10.12.4.76 -> 10.12.12.107 (metal-internal, D-052). +D-063 (open): the phase-06 capi-mgmt-sg opens 6443+22 to 0.0.0.0/0 -- fine for a single-DC +rehearsal, tighten for Roosevelt. + --- ## Prerequisites (must be true entering phase-07) @@ -31,9 +43,9 @@ - `admin-openrc` on the jumphost; `juju` (model openstack); `jq`. ## Constants and env-literals (TAG: confirm per site on rebuild) -- `ENV(conductor-unit)` magnum/0 (LXD 1/lxd/2 on openstack1; addr 10.12.4.76 -- confirm per site) -- `ENV(conductor-src)` 10.12.4.76/32 (the conductor's provider IP; SG source -- confirm per site) -- `ENV(mgmt-fip)` per-rebuild (mgmt apiserver / kubeconfig server; source ~/capi-mgmt-net.env from phase-06 -- this rebuild 10.12.5.103; the old 10.12.7.40 is dead -- DOCFIX-038) +- `ENV(conductor-unit)` magnum/0 (LXD 1/lxd/2 on openstack1; addr 10.12.12.107 on metal-internal per D-052 -- confirm per site; was 10.12.4.76 pre-D-052) +- `ENV(conductor-src)` n/a (DOCFIX-063: verify-first 7.1 no longer adds a per-conductor SG rule -- the phase-06 capi-mgmt-sg already opens 6443; a source rule is a FALLBACK only, measured not hardcoded) +- `ENV(mgmt-fip)` per-rebuild (mgmt apiserver / kubeconfig server; source ~/capi-mgmt-net.env from phase-06 -- this rebuild 10.12.7.222; per-rebuild, DOCFIX-038) - `ENV(mgmt-sg)` capi-mgmt-sg (in the capi-mgmt project) - `ENV(project)` capi-mgmt (resolve by name; this rebuild id d5bc125c7c1841d389b76cd0a7b0a915, domain capi) - `ENV(magnum-ns)` magnum- (driver namespace per project; this rebuild magnum-d5bc125c7c1841d389b76cd0a7b0a915) @@ -98,39 +110,48 @@ Step A) or the magnum.conf `[trust]` names differ from the created domain/user. (Benign "No domain/user exists" idempotency lines may appear in the action output.) -## Step 7.1 -- Authorize the conductor source on the mgmt-cluster SG -(scoped to the capi-mgmt project). Idempotent. +## Step 7.1 -- Authorize the conductor source on the mgmt-cluster SG (VERIFY-FIRST; DOCFIX-063) +(scoped to the capi-mgmt project). DOCFIX-063: do NOT hardcode a per-conductor source rule. +The phase-06 capi-mgmt-sg already opens `tcp/6443` to `0.0.0.0/0` (the FIP is the access +point), so the conductor reaches the apiserver with NO new rule. Inspect the SG + prove +reachability FIRST; add a rule ONLY if 6443 is not already permitted, and then with the source +the mgmt VM actually SEES (measured, never the pre-D-052 provider literal 10.12.4.76). -**RUN -- jumphost** +**CHECK (read-only) -- jumphost** ```bash ( { set -u - # scope openstack CLI to the capi-mgmt project (id form -- robust to name/domain) source ~/admin-openrc - # resolve the capi-mgmt project id while still admin-scoped, THEN narrow scope to it (id form) CAPI_PID=$(openstack project show capi-mgmt --domain capi -f value -c id) # ENV(project); resolve, never hardcode unset OS_PROJECT_NAME OS_PROJECT_ID OS_TENANT_NAME OS_TENANT_ID export OS_PROJECT_ID="$CAPI_PID" SG=$(openstack security group show capi-mgmt-sg -f value -c id) # ENV(mgmt-sg) echo "SG=$SG" - echo "=== add ingress tcp/6443 from the conductor 10.12.4.76/32 (if absent) ===" - openstack security group rule list "$SG" -f value -c "IP Range" -c "Port Range" \ - | grep -q '10.12.4.76/32 6443:6443' \ - || openstack security group rule create --proto tcp --dst-port 6443 \ - --remote-ip 10.12.4.76/32 "$SG" - openstack security group rule list "$SG" -f value -c Protocol -c "Port Range" -c "IP Range" + echo "=== current ingress rules (JSON -- avoids the -c column-swap trap) ===" + openstack security group rule list "$SG" -f json } ) ``` -Then prove conductor -> mgmt apiserver reachability: +Then prove conductor -> mgmt apiserver reachability (FIP from phase-06's env, never hardcoded): **CHECK (read-only) -- jumphost -> magnum/0** ```bash -# RUN: jumphost -> magnum/0 (FIP from phase-06's ~/capi-mgmt-net.env -- never hardcode; DOCFIX-038) -source ~/capi-mgmt-net.env # MGMT_FIP +source ~/capi-mgmt-net.env # MGMT_FIP (DOCFIX-038: per-rebuild) juju ssh -m openstack magnum/0 \ "timeout 6 bash -c 'exec 3<>/dev/tcp/$MGMT_FIP/6443' && echo TCP-OK || echo TCP-FAIL" /32 is the source the mgmt VM sees; do NOT guess it. +( { source ~/admin-openrc + CAPI_PID=$(openstack project show capi-mgmt --domain capi -f value -c id) + unset OS_PROJECT_NAME OS_PROJECT_ID OS_TENANT_NAME OS_TENANT_ID; export OS_PROJECT_ID="$CAPI_PID" + SG=$(openstack security group show capi-mgmt-sg -f value -c id) + openstack security group rule create --proto tcp --dst-port 6443 --remote-ip /32 "$SG" +} ) +``` ## Step 7.2 -- Place the mgmt kubeconfig on the conductor [SENSITIVE; not batched] The source `~/capi-mgmt.kubeconfig` already has its @@ -159,12 +180,15 @@ ``` **GATE:** the two sha256 hashes are identical (an empty or truncated transfer fails here, not three steps later as a confusing conductor auth error). -End-to-end proof (the conductor user authenticates to the mgmt cluster via the FIP): +End-to-end proof (the conductor user authenticates to the mgmt cluster via the FIP) -- +DOCFIX-063: helm is installed in Step 7.4, so on a fresh conductor it is ABSENT here. RUN THIS +CHECK AFTER STEP 7.4 (integrity above + the 7.1 TCP reachability already gate 7.2 without it): -**CHECK (read-only) -- jumphost -> magnum/0** +**CHECK (read-only; run AFTER Step 7.4) -- jumphost -> magnum/0** ```bash juju ssh -m openstack magnum/0 \ - 'sudo -u magnum env HOME=/tmp helm --kubeconfig /etc/magnum/kubeconfig list -A' /dev/null 2>&1 || { echo "helm MISSING -- run this after Step 7.4"; exit 0; }; \ + sudo -u magnum env HOME=/tmp helm --kubeconfig /etc/magnum/kubeconfig list -A' /dev/null | awk 'NR==1 || /v1beta1/' - done + kubectl api-versions | grep -E 'cluster\.x-k8s\.io/' | sort } ) -# Expect v1beta1 for: cluster.x-k8s.io (Cluster/MachineDeployment/Machine), -# controlplane.cluster.x-k8s.io (KubeadmControlPlane), infrastructure.cluster.x-k8s.io -# (OpenStackCluster -- verified anchor). If a CORE kind serves ONLY v1beta2, override -# just that kind via api_resources in Step 7.6; otherwise the defaults work as-is. +# Expect v1beta1 SERVED for the core groups (alongside v1beta2 as preferred): +# cluster.x-k8s.io/v1beta1, controlplane.cluster.x-k8s.io/v1beta1, +# bootstrap.cluster.x-k8s.io/v1beta1, infrastructure.cluster.x-k8s.io/v1beta1, +# addons.cluster.x-k8s.io/v1beta1. If a CORE group serves ONLY v1beta2 (v1beta1 ABSENT), +# override just that kind via api_resources in Step 7.6; otherwise the empty default works. ``` ## Step 7.4 -- Install the driver (1.4.0) + helm in the conductor container @@ -208,10 +234,11 @@ **RUN -- jumphost -> magnum/0** ```bash -# egress pre-check +# egress pre-check (DOCFIX-063: hit a REAL asset -- bare https://get.helm.sh/ 404s (no root +# index) and looks like a failure; the versioned sha256sum URL is a true 200 reachability probe) juju ssh -m openstack magnum/0 \ 'curl -s -o /dev/null -w "pypi:%{http_code}\n" https://pypi.org/simple/ ; \ - curl -s -o /dev/null -w "helm:%{http_code}\n" https://get.helm.sh/' magnum/0** ```bash +# DOCFIX-063: /etc/magnum/magnum.conf.d/ does NOT exist on a fresh rebuild (the deb ships +# magnum.conf, not the .conf.d dir; tee cannot create a missing parent). Create it first +# (root:root 0755, magnum-traversable for --config-dir; also the Step 7.7 config-dir target). +juju ssh -m openstack magnum/0 'sudo install -d -o root -g root -m 0755 /etc/magnum/magnum.conf.d' /dev/null <<'CONF' [capi_helm] kubeconfig_file = /etc/magnum/kubeconfig @@ -292,8 +324,10 @@ **CHECK (read-only) -- jumphost -> magnum/0** ```bash +# DOCFIX-063: sudo the grep -- /etc/magnum is root-owned (0750 root:magnum); a non-sudo read +# gets "Permission denied" and the `|| echo` prints a FALSE "ASCII clean". juju ssh -m openstack magnum/0 \ - 'LC_ALL=C grep -nP "[^\x00-\x7F]" /etc/magnum/magnum.conf.d/00-capi-helm.conf && echo NON-ASCII || echo "ASCII clean"' read the version-less v1beta2 ref -> health UNHEALTHY (D-042). - PHASE-07 BASELINE supersedes this with the RELEASED magnum-capi-helm==1.4.0 (api_resources; default v1beta1). -- kubeconfig: /etc/magnum/kubeconfig, -rw------- magnum, ~5657 bytes, server = the mgmt FIP:6443 (per-rebuild; this rebuild 10.12.5.103, old 10.12.7.40 dead). +## As-built reference (2026-07-01 Pattern-A rebuild graft -- audit trail; supersedes the 2026-06-08/09 pre-D-052 run) +- magnum/0: LXD 1/lxd/2 on openstack1, addr 10.12.12.107 (metal-internal per D-052; was 10.12.4.76 pre-D-052), + charm magnum 2024.1/stable rev 70, DEB magnum 18.0.1, python3.10, container ubuntu 22.04; conductor user `magnum`. +- Driver: RELEASED magnum-capi-helm==1.4.0 (pip --no-deps; api_resources={} explicit -> code-default v1beta1, + served by CAPI v1.13.2 / CAPO v0.14.4). This is the v1 baseline; the pre-D-052 run's interim 1.3.0 + (version-less v1beta2 ref -> cosmetic UNHEALTHY, D-042) is superseded. +- kubeconfig: /etc/magnum/kubeconfig, -rw------- magnum, 5641 bytes this rebuild (sha256 26ed1091...6c11), + server = the mgmt FIP:6443 (per-rebuild; this rebuild 10.12.7.222 -- DOCFIX-038). - conf.d drop-in /etc/magnum/magnum.conf.d/00-capi-helm.conf: kubeconfig_file, helm_chart_repo (azimuth), helm_chart_name openstack-cluster, default_helm_chart_version 0.25.1 (api_resources left default -- v1beta1 served by CAPI v1.13.2 / CAPO v0.14.4). diff --git a/scripts/phase-06-capi-stack.sh b/scripts/phase-06-capi-stack.sh new file mode 100644 index 0000000..e5f8e6e --- /dev/null +++ b/scripts/phase-06-capi-stack.sh @@ -0,0 +1,169 @@ +#!/usr/bin/env bash +# scripts/phase-06-capi-stack.sh +# +# Phase-06 Step 6.6 (a-f) encapsulated (D-056). Runs on the jumphost; installs the +# CAPI provider stack ON the mgmt VM (all helm/clusterctl/kubectl run VM-side +# against the local apiserver -- matched 1.32.13 kubectl, no jumphost skew). +# +# HARDENED ORDER (D-034 install-ordering): pins -> cert-manager -> ORC -> +# clusterctl init -> CAAPH -> janitor -> verify. ORC precedes `clusterctl init` +# because CAPO's openstackserver controller hard-depends on ORC's +# Image.openstack.k-orc.cloud CRD; installing CAPO first crash-loops until ORC lands. +# +# Versions are READ from the chart tag's dependencies.json at runtime (D-034; +# NEVER hardcoded). The as-built cross-check (CAPI v1.13.2 / CAPO v0.14.4 / +# CERT v1.20.2 / ORC v2.5.0 / CAAPH 0.12.0 / JANITOR 0.11.0 / HELM v3.17.3) is +# informational only. KUBECTL_VERSION tracks the cluster's k8s (the CHANNEL in +# phase-06-k8s-bootstrap.sh); keep them in step. +# +# Each sub-step is gated on the remote block's own exit status (its `--wait` / +# `wait` / `get crd` fail the remote, ssh propagates non-zero, we stop). DOCFIX-021: +# not needed here (no interactive `sudo`; blocks are non-interactive helm/kubectl). +# +# Tunables via env: ENVFILE SSH_KEY CHART_TAG KUBECTL_VERSION +# Requires: jumphost; ssh + the VM key. (jq/curl are installed VM-side by 6.6a.) +# Usage: bash scripts/phase-06-capi-stack.sh +# Exit: 0 stack up + verified | 1 a sub-step gate failed | 2 precondition +# ASCII + LF. + +set -euo pipefail +shopt -s inherit_errexit 2>/dev/null || true + +ENVFILE="${ENVFILE:-$HOME/capi-mgmt-net.env}" +SSH_KEY="${SSH_KEY:-$HOME/.ssh/id_ed25519}" +CHART_TAG="${CHART_TAG:-0.25.1}" +KUBECTL_VERSION="${KUBECTL_VERSION:-v1.32.13}" + +command -v ssh >/dev/null 2>&1 || { echo "FAIL: ssh not found" >&2; exit 2; } +[ -f "$ENVFILE" ] || { echo "FAIL: $ENVFILE not found (run phase-06-mgmt-vm.sh first)" >&2; exit 2; } +# shellcheck disable=SC1090 +. "$ENVFILE" +[ -n "${MGMT_FIP:-}" ] || { echo "FAIL: MGMT_FIP unset in $ENVFILE" >&2; exit 2; } +[ -f "$SSH_KEY" ] || { echo "FAIL: ssh key $SSH_KEY not found" >&2; exit 2; } + +MGMT_VM="$MGMT_FIP" +SSH_OPTS=(-i "$SSH_KEY" -o BatchMode=yes -o StrictHostKeyChecking=no \ + -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10) + +# run_step LABEL -- reads the remote block from stdin, tees indented output, +# gates on the REMOTE block's exit status (PIPESTATUS[0]); positional args after +# the label are passed to the remote `bash -s`. +run_step() { + local label="$1"; shift + echo "=== $label ===" + ssh "${SSH_OPTS[@]}" ubuntu@"$MGMT_VM" bash -s "$@" 2>&1 | sed 's/^/ /' + local rc=${PIPESTATUS[0]} + [ "$rc" -eq 0 ] || { echo "GATE FAIL: $label (remote rc=$rc)" >&2; exit 1; } + echo "[OK] $label" +} + +# --- 6.6a: tooling + pins (read dependencies.json @ CHART_TAG) --- +run_step "6.6a tooling + pins (chart $CHART_TAG, kubectl $KUBECTL_VERSION)" "$CHART_TAG" "$KUBECTL_VERSION" <<'REOF' +set -euo pipefail +TAG="$1"; KVER="$2" +sudo apt-get update -qq "$HOME/.kube/config"; chmod 600 "$HOME/.kube/config" + +# egress pre-check (informational; a 404 at a host root still proves reachability) +for h in https://raw.githubusercontent.com https://get.helm.sh https://github.com https://dl.k8s.io; do + printf '%s -> ' "$h"; curl -s -o /dev/null -w '%{http_code}\n' "$h" || echo FAIL +done + +# version constellation from the chart tag's dependencies.json (D-034) +curl -fsSL "https://raw.githubusercontent.com/azimuth-cloud/capi-helm-charts/${TAG}/dependencies.json" -o "$HOME/deps.json" +CAPI=$(jq -r '."cluster-api"' "$HOME/deps.json") +CAPO=$(jq -r '."cluster-api-provider-openstack"' "$HOME/deps.json") +CERT=$(jq -r '."cert-manager"' "$HOME/deps.json") +ORC=$(jq -r '."openstack-resource-controller"' "$HOME/deps.json") +CAAPH=$(jq -r '."addon-provider"' "$HOME/deps.json") +JANITOR=$(jq -r '."cluster-api-janitor-openstack"' "$HOME/deps.json") +HELM=$(jq -r '.helm' "$HOME/deps.json") +{ echo "CAPI=$CAPI"; echo "CAPO=$CAPO"; echo "CERT=$CERT"; echo "ORC=$ORC"; \ + echo "CAAPH=$CAAPH"; echo "JANITOR=$JANITOR"; echo "HELM=$HELM"; } > "$HOME/capi-pins.env" +echo "== pins (cross-check: CAPI v1.13.2 CAPO v0.14.4 CERT v1.20.2 ORC v2.5.0 CAAPH 0.12.0 JANITOR 0.11.0 HELM v3.17.3) ==" +cat "$HOME/capi-pins.env" +# gate: every pin resolved (non-empty, non-null) -- a moved/renamed key must fail loud +for k in CAPI CAPO CERT ORC CAAPH JANITOR HELM; do v="${!k}"; [ -n "$v" ] && [ "$v" != null ] || { echo "PIN-FAIL: $k=$v" >&2; exit 1; }; done + +curl -fsSL "https://get.helm.sh/helm-${HELM}-linux-amd64.tar.gz" -o /tmp/helm.tgz +sudo tar -xzf /tmp/helm.tgz -C /usr/local/bin --strip-components=1 linux-amd64/helm /dev/null | head -1 +REOF + +# --- 6.6b: cert-manager (DOCFIX-025a: crds.enabled=true) --- +run_step "6.6b cert-manager" <<'REOF' +set -euo pipefail +source "$HOME/capi-pins.env" +helm repo add jetstack https://charts.jetstack.io +helm repo update +helm upgrade --install cert-manager jetstack/cert-manager \ + --namespace cert-manager --create-namespace \ + --version "$CERT" --set crds.enabled=true --wait --timeout 5m +kubectl -n cert-manager wait --for=condition=Available deploy --all --timeout=180s +kubectl -n cert-manager get pods +REOF + +# --- 6.6c: ORC (BEFORE clusterctl init) --- +run_step "6.6c ORC (before clusterctl init)" <<'REOF' +set -euo pipefail +source "$HOME/capi-pins.env" +kubectl apply --server-side -f \ + "https://github.com/k-orc/openstack-resource-controller/releases/download/${ORC}/install.yaml" +kubectl -n orc-system wait --for=condition=Available deploy --all --timeout=180s +kubectl get crd images.openstack.k-orc.cloud +REOF + +# --- 6.6d: clusterctl init --- +run_step "6.6d clusterctl init" <<'REOF' +set -euo pipefail +source "$HOME/capi-pins.env" +clusterctl init \ + --core "cluster-api:${CAPI}" \ + --bootstrap "kubeadm:${CAPI}" \ + --control-plane "kubeadm:${CAPI}" \ + --infrastructure "openstack:${CAPO}" +for ns in capi-system capi-kubeadm-bootstrap-system capi-kubeadm-control-plane-system capo-system; do + echo "== $ns =="; kubectl -n "$ns" wait --for=condition=Available deploy --all --timeout=240s +done +REOF + +# --- 6.6e: CAAPH + janitor --- +run_step "6.6e CAAPH + janitor" <<'REOF' +set -euo pipefail +source "$HOME/capi-pins.env" +helm repo add capi-addon https://azimuth-cloud.github.io/cluster-api-addon-provider +helm repo add capi-janitor https://azimuth-cloud.github.io/cluster-api-janitor-openstack +helm repo update +helm upgrade --install cluster-api-addon-provider capi-addon/cluster-api-addon-provider \ + --namespace capi-addon-system --create-namespace --version "$CAAPH" --wait --timeout 5m +helm upgrade --install cluster-api-janitor-openstack capi-janitor/cluster-api-janitor-openstack \ + --namespace capi-janitor-system --create-namespace --version "$JANITOR" --wait --timeout 5m +kubectl -n capi-addon-system get pods +kubectl -n capi-janitor-system get pods +REOF + +# --- 6.6f: verify the stack (EXIT GATE) --- +run_step "6.6f verify stack (all controllers Running + key CRDs)" <<'REOF' +set -euo pipefail +clusterctl version +echo "== controllers ==" +kubectl get pods -A | grep -E 'capi-|capo-|cert-manager|orc-system|janitor|addon' || true +notready=$(kubectl get pods -A --no-headers 2>/dev/null \ + | grep -E 'capi-|capo-|cert-manager|orc-system|janitor|addon' \ + | awk '$4!="Running"{print $1"/"$2" "$4}') +if [ -n "$notready" ]; then echo "NOT-RUNNING:"; echo "$notready"; exit 1; fi +echo "== key CRDs ==" +kubectl get crd clusters.cluster.x-k8s.io \ + openstackclusters.infrastructure.cluster.x-k8s.io \ + kubeadmcontrolplanes.controlplane.cluster.x-k8s.io \ + images.openstack.k-orc.cloud +echo "STACK: OK" +REOF + +echo "Summary: CAPI provider stack installed + verified on the mgmt VM (chart $CHART_TAG pins; ORC-before-init order). Phase-06 complete." diff --git a/scripts/phase-06-k8s-bootstrap.sh b/scripts/phase-06-k8s-bootstrap.sh new file mode 100644 index 0000000..2c0af71 --- /dev/null +++ b/scripts/phase-06-k8s-bootstrap.sh @@ -0,0 +1,119 @@ +#!/usr/bin/env bash +# scripts/phase-06-k8s-bootstrap.sh +# +# Phase-06 Steps 6.3 + 6.4 encapsulated (D-056). Runs on the jumphost; drives the +# in-cloud CAPI management VM over ssh. +# 6.3 GATE 1 -- prove the single-homed VM's egress: it can reach the OpenStack +# public API (the D-035 premise) and the internet (image pulls). The API +# target is the Keystone PUBLIC endpoint -- the as-run literal 10.12.4.50:5000 +# (6.3 tagged ENV(keystone-vip)); env-overridable per site via KEYSTONE_HOSTPORT. +# 6.4 -- install k8s-snap on the VM and bootstrap it. The bootstrap config MUST +# carry a cluster-config block (DOCFIX-024 -- without it network+dns are +# disabled and the node never goes Ready). extra-sans MUST be the real +# FIP + tenant IP (from ~/capi-mgmt-net.env, per-rebuild, DOCFIX-038). +# +# One-shot -- matches the as-run 6.4 block verbatim (NO idempotency guard): install + +# bootstrap run unconditionally. Re-run is not safe; purge on the VM first (retry hint +# below), exactly the runbook's documented retry path. +# DOCFIX-021: every remote `sudo` gets /dev/null || true + +ENVFILE="${ENVFILE:-$HOME/capi-mgmt-net.env}" +SSH_KEY="${SSH_KEY:-$HOME/.ssh/id_ed25519}" +CHANNEL="${CHANNEL:-1.32-classic/stable}" +POD_CIDR="${POD_CIDR:-10.1.0.0/16}" +SVC_CIDR="${SVC_CIDR:-10.152.183.0/24}" +CLUSTER_NAME="${CLUSTER_NAME:-capi-mgmt-v2}" +INET_PROBE="${INET_PROBE:-1.1.1.1:443}" +PROBE_TIMEOUT="${PROBE_TIMEOUT:-6}" +BOOT_TIMEOUT="${BOOT_TIMEOUT:-10m}" +READY_TIMEOUT="${READY_TIMEOUT:-5m}" + +command -v ssh >/dev/null 2>&1 || { echo "FAIL: ssh not found" >&2; exit 2; } +[ -f "$ENVFILE" ] || { echo "FAIL: $ENVFILE not found (run phase-06-mgmt-vm.sh first)" >&2; exit 2; } +# shellcheck disable=SC1090 +. "$ENVFILE" +[ -n "${MGMT_FIP:-}" ] || { echo "FAIL: MGMT_FIP unset in $ENVFILE" >&2; exit 2; } +[ -n "${MGMT_TENANT_IP:-}" ] || { echo "FAIL: MGMT_TENANT_IP unset in $ENVFILE" >&2; exit 2; } +[ -f "$SSH_KEY" ] || { echo "FAIL: ssh key $SSH_KEY not found" >&2; exit 2; } + +MGMT_VM="$MGMT_FIP" +SSH_OPTS=(-i "$SSH_KEY" -o BatchMode=yes -o StrictHostKeyChecking=no \ + -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10) + +# --- Keystone public host:port -- the as-run literal (6.3 tagged ENV(keystone-vip)); +# env-overridable per site. NOT discovered -- this is the value that ran verbatim. --- +KEYSTONE_HOSTPORT="${KEYSTONE_HOSTPORT:-10.12.4.50:5000}" +echo "[OK] Keystone public endpoint: $KEYSTONE_HOSTPORT" +KHOST="${KEYSTONE_HOSTPORT%%:*}"; KPORT="${KEYSTONE_HOSTPORT##*:}" +IHOST="${INET_PROBE%%:*}"; IPORT="${INET_PROBE##*:}" +if [ -z "$KHOST" ] || [ -z "$KPORT" ] || [ "$KHOST" = "$KPORT" ]; then + echo "FAIL: bad KEYSTONE_HOSTPORT '$KEYSTONE_HOSTPORT' (want host:port)" >&2; exit 2 +fi + +# --- 6.3 GATE 1: VM egress (API VIP + internet) --- +echo "=== 6.3 GATE 1: VM -> Keystone $KHOST:$KPORT + internet $IHOST:$IPORT ===" +g1=$(ssh "${SSH_OPTS[@]}" ubuntu@"$MGMT_VM" \ + bash -s "$KHOST" "$KPORT" "$IHOST" "$IPORT" "$PROBE_TIMEOUT" <<'REOF' 2>&1 || true +set -u +khost="$1"; kport="$2"; ihost="$3"; iport="$4"; t="$5"; ok=1 +if timeout "$t" bash -c "exec 3<>/dev/tcp/$khost/$kport" 2>/dev/null; then echo "VIP-OK $khost:$kport"; else echo "VIP-FAIL $khost:$kport"; ok=0; fi +if timeout "$t" bash -c "exec 3<>/dev/tcp/$ihost/$iport" 2>/dev/null; then echo "NET-OK $ihost:$iport"; else echo "NET-FAIL $ihost:$iport"; ok=0; fi +[ "$ok" = 1 ] && echo "GATE1: PASS" || echo "GATE1: FAIL" +REOF +) +printf '%s\n' "$g1" | sed 's/^/ /' +printf '%s\n' "$g1" | grep -q 'GATE1: PASS' || { echo "GATE FAIL: VM egress probe did not pass (see above)" >&2; exit 1; } +echo "[OK] GATE 1 passed -- single-NIC VM egress to the OpenStack public API works (D-035 premise)" + +# --- 6.4 k8s-snap install + bootstrap --- +echo "=== 6.4 k8s-snap install + bootstrap ($CHANNEL) ===" +b=$(ssh "${SSH_OPTS[@]}" ubuntu@"$MGMT_VM" \ + bash -s "$MGMT_FIP" "$MGMT_TENANT_IP" "$CHANNEL" "$POD_CIDR" "$SVC_CIDR" "$CLUSTER_NAME" "$BOOT_TIMEOUT" "$READY_TIMEOUT" <<'REOF' 2>&1 || true +set -euo pipefail +FIP="$1"; TENANT="$2"; CH="$3"; POD="$4"; SVC="$5"; NAME="$6"; BT="$7"; RT="$8" + +echo "=== install k8s snap $CH ===" +sudo snap install k8s --classic --channel="$CH" /dev/null <&2 + echo " Retry on the VM: sudo snap remove k8s --purge &2 + exit 1; } + +echo "Summary: GATE 1 PASS; k8s ($CHANNEL) bootstrapped and ready on $CLUSTER_NAME (FIP $MGMT_FIP / tenant $MGMT_TENANT_IP)." diff --git a/scripts/phase-06-kubeconfig-gate.sh b/scripts/phase-06-kubeconfig-gate.sh new file mode 100644 index 0000000..64ac72d --- /dev/null +++ b/scripts/phase-06-kubeconfig-gate.sh @@ -0,0 +1,116 @@ +#!/usr/bin/env bash +# scripts/phase-06-kubeconfig-gate.sh +# +# Phase-06 Step 6.5 encapsulated (D-056) with the DOCFIX-062 fix baked in. +# Runs on the jumphost. +# 1. Pull the mgmt cluster's admin kubeconfig to the jumphost. +# 2. DOCFIX-062: k8s-snap 1.32.13's `k8s config server=` does NOT override +# the emitted apiserver URL -- it writes the node's TENANT IP (unroutable from +# the jumphost), so kubectl i/o-times-out. Fix: pull the RAW config, then +# rewrite the server field to the FIP with `kubectl config set-cluster +# --server` (a local file op; the cluster name is read dynamically). The FIP +# is in the cert extra-sans (written by 6.4), so TLS holds against it. +# 3. Node check + GATE 2: the agnhost pod-egress probe to the Keystone PUBLIC +# endpoint -- the exact test the dual-homed D-033 node FAILED; on this +# single-NIC VM it must Complete with exitCode 0. Keystone host:port is the +# as-run literal 10.12.4.50:5000 (6.5 tagged it verbatim); env-overridable +# per site via KEYSTONE_HOSTPORT. +# +# [SENSITIVE] the kubeconfig it writes ($KUBECONFIG_OUT) holds a cluster-admin +# credential; it is created with mode 600 and kept on the jumphost. +# The throwaway probe pod is always cleaned up (even on gate failure). +# +# Tunables via env: ENVFILE SSH_KEY KUBECONFIG_OUT API_PORT KEYSTONE_HOSTPORT +# AGNHOST_IMAGE PROBE_TRIES PROBE_SLEEP +# Requires: jumphost; ssh + the VM key; kubectl; ~/capi-mgmt-net.env (from +# phase-06-mgmt-vm.sh). All tunables DEFAULT to the as-run values. +# Usage: bash scripts/phase-06-kubeconfig-gate.sh +# Exit: 0 GATE 2 pass (kubeconfig usable + pod egress works) | 1 gate fail | 2 precondition +# ASCII + LF. + +set -euo pipefail +shopt -s inherit_errexit 2>/dev/null || true + +ENVFILE="${ENVFILE:-$HOME/capi-mgmt-net.env}" +SSH_KEY="${SSH_KEY:-$HOME/.ssh/id_ed25519}" +KUBECONFIG_OUT="${KUBECONFIG_OUT:-$HOME/capi-mgmt.kubeconfig}" +API_PORT="${API_PORT:-6443}" +AGNHOST_IMAGE="${AGNHOST_IMAGE:-registry.k8s.io/e2e-test-images/agnhost:2.40}" +PROBE_TRIES="${PROBE_TRIES:-20}" +PROBE_SLEEP="${PROBE_SLEEP:-10}" + +for c in ssh kubectl; do command -v "$c" >/dev/null 2>&1 || { echo "FAIL: $c not found" >&2; exit 2; }; done +[ -f "$ENVFILE" ] || { echo "FAIL: $ENVFILE not found (run phase-06-mgmt-vm.sh first)" >&2; exit 2; } +# shellcheck disable=SC1090 +. "$ENVFILE" +[ -n "${MGMT_FIP:-}" ] || { echo "FAIL: MGMT_FIP unset in $ENVFILE" >&2; exit 2; } +[ -f "$SSH_KEY" ] || { echo "FAIL: ssh key $SSH_KEY not found" >&2; exit 2; } + +MGMT_VM="$MGMT_FIP" +SSH_OPTS=(-i "$SSH_KEY" -o BatchMode=yes -o StrictHostKeyChecking=no \ + -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10) + +# --- Keystone public host:port -- the as-run literal (6.5 tagged it verbatim); +# env-overridable per site. NOT discovered. --- +KEYSTONE_HOSTPORT="${KEYSTONE_HOSTPORT:-10.12.4.50:5000}" +echo "[OK] Keystone public endpoint: $KEYSTONE_HOSTPORT" + +# --- 1. pull the RAW admin kubeconfig (no server= arg; we rewrite locally) --- +echo "=== pull kubeconfig -> $KUBECONFIG_OUT ===" +umask 077 +if ! ssh "${SSH_OPTS[@]}" ubuntu@"$MGMT_VM" 'sudo k8s config "$KUBECONFIG_OUT" 2>/dev/null; then + echo "GATE FAIL: could not pull kubeconfig from the mgmt VM" >&2; exit 1 +fi +chmod 600 "$KUBECONFIG_OUT" +[ -s "$KUBECONFIG_OUT" ] || { echo "GATE FAIL: $KUBECONFIG_OUT is empty" >&2; exit 1; } +head -1 "$KUBECONFIG_OUT" | grep -q 'apiVersion: v1' || { echo "GATE FAIL: $KUBECONFIG_OUT does not look like a kubeconfig" >&2; exit 1; } +echo "[OK] kubeconfig pulled ($(wc -l < "$KUBECONFIG_OUT") lines)" + +# --- 2. DOCFIX-062: rewrite the server field to the FIP (routable; cert carries the FIP SAN) --- +export KUBECONFIG="$KUBECONFIG_OUT" +CLUSTER=$(kubectl config view -o jsonpath='{.clusters[0].name}' 2>/dev/null) +[ -n "$CLUSTER" ] || { echo "GATE FAIL: no cluster entry in $KUBECONFIG_OUT" >&2; exit 1; } +kubectl config set-cluster "$CLUSTER" --server="https://${MGMT_FIP}:${API_PORT}" >/dev/null +grep -qE "^[[:space:]]*server:[[:space:]]*https://${MGMT_FIP//./\\.}:${API_PORT}\$" "$KUBECONFIG_OUT" \ + || { echo "GATE FAIL: server rewrite to https://${MGMT_FIP}:${API_PORT} did not take (DOCFIX-062)" >&2; exit 1; } +echo "[OK] kubeconfig server rewritten to https://${MGMT_FIP}:${API_PORT} (cluster '$CLUSTER')" + +# --- 3a. node check --- +echo "=== node check ===" +if ! nodes=$(kubectl get nodes -o wide 2>&1); then + printf '%s\n' "$nodes" | sed 's/^/ /' + echo "GATE FAIL: kubectl cannot reach the apiserver via the FIP" >&2; exit 1 +fi +printf '%s\n' "$nodes" | sed 's/^/ /' +printf '%s\n' "$nodes" | awk 'NR>1 && $2!="Ready"{bad=1} END{exit bad?1:0}' \ + || { echo "GATE FAIL: a node is not Ready" >&2; exit 1; } +echo "[OK] node(s) Ready" + +# --- 3b. GATE 2: agnhost pod-egress probe to the Keystone public endpoint --- +echo "=== GATE 2: agnhost pod-egress probe -> $KEYSTONE_HOSTPORT ===" +cleanup() { kubectl delete pod egress-test --now --ignore-not-found >/dev/null 2>&1 || true; } +trap cleanup EXIT +kubectl delete pod egress-test --now --ignore-not-found >/dev/null 2>&1 || true +kubectl run egress-test --image="$AGNHOST_IMAGE" --restart=Never \ + --command -- /agnhost connect "$KEYSTONE_HOSTPORT" --timeout=5s >/dev/null + +phase=""; state="" +for i in $(seq 1 "$PROBE_TRIES"); do + phase=$(kubectl get pod egress-test -o jsonpath='{.status.phase}' 2>/dev/null || echo '?') + state=$(kubectl get pod egress-test -o jsonpath='{.status.containerStatuses[0].state}' 2>/dev/null || echo '') + echo " [$i] phase=$phase state=$state" + case "$phase" in + Succeeded) break ;; + Failed) echo "GATE FAIL: probe pod Failed (egress to $KEYSTONE_HOSTPORT blocked)" >&2; exit 1 ;; + esac + sleep "$PROBE_SLEEP" +done + +if [ "$phase" = Succeeded ] && printf '%s' "$state" | grep -q '"exitCode":0'; then + echo "[OK] GATE 2 passed -- pod egress to $KEYSTONE_HOSTPORT returned exitCode 0 (D-035 proof)" +else + echo "GATE FAIL: probe pod did not reach Succeeded/exitCode 0 in $((PROBE_TRIES*PROBE_SLEEP))s (last: phase=$phase state=$state)" >&2 + exit 1 +fi + +echo "Summary: kubeconfig usable via FIP; GATE 2 pod-egress proof passed. $KUBECONFIG_OUT ready for phase-07." diff --git a/scripts/phase-07-conductor-graft.sh b/scripts/phase-07-conductor-graft.sh new file mode 100644 index 0000000..781817c --- /dev/null +++ b/scripts/phase-07-conductor-graft.sh @@ -0,0 +1,247 @@ +#!/usr/bin/env bash +# scripts/phase-07-conductor-graft.sh +# +# Phase-07 -- Magnum conductor graft (D-031 / D-037 / D-042 / D-046 / D-047), +# encapsulating the validated 2026-07-01 as-run (Steps 7.0-7.8) with DOCFIX-063 +# baked in. Runs on the jumphost; every conductor-side action ships via +# `juju ssh -m ... a false "v1beta1 not served"). +# 4. 7.6 runs `install -d /etc/magnum/magnum.conf.d` before the tee (absent on a +# fresh deploy; also the 7.7 --config-dir target). +# 5. ASCII checks `sudo` the grep (a non-sudo read of the root-owned path gave a +# false "ASCII clean"). +# 6. the helm egress pre-check hits a REAL asset (bare get.helm.sh/ 404s). +# +# [SENSITIVE] Step 7.2 base64-pipes the FIP-rewritten kubeconfig into a root-written +# 0600 file owned by the conductor user (magnum) and gates on a sha256 both-sides +# match. The kubeconfig holds a cluster-admin credential; it is never staged in /tmp. +# +# Health poll + create/delete regression are NOT in this script -- on a fresh deploy +# no cluster/template exists yet; phase-08 (D-011) is the superset acceptance. +# +# Tunables via env (all DEFAULT to the as-run measured values): +# MODEL CONDUCTOR ENVFILE KUBECONFIG_SRC DRIVER_VERSION HELM_VERSION CHART_VERSION +# API_PORT SG_NAME CAPI_PROJECT CAPI_PROJECT_DOMAIN ADMIN_OPENRC +# Requires: jumphost; juju (model reachable); openstack (admin-openrc); base64; +# sha256sum; ~/capi-mgmt-net.env (MGMT_FIP, from phase-06); ~/capi-mgmt.kubeconfig. +# Usage: bash scripts/phase-07-conductor-graft.sh +# Exit: 0 all phase-07 mechanisms in place | 1 gate fail | 2 precondition fail +# ASCII + LF. + +# shellcheck disable=SC1090 # $ADMIN_OPENRC / $ENVFILE are intentionally dynamic source paths +set -euo pipefail +shopt -s inherit_errexit 2>/dev/null || true + +MODEL="${MODEL:-openstack}" +CONDUCTOR="${CONDUCTOR:-magnum/0}" +MAGNUM_APP="${CONDUCTOR%%/*}" +ENVFILE="${ENVFILE:-$HOME/capi-mgmt-net.env}" +KUBECONFIG_SRC="${KUBECONFIG_SRC:-$HOME/capi-mgmt.kubeconfig}" +DRIVER_VERSION="${DRIVER_VERSION:-1.4.0}" +HELM_VERSION="${HELM_VERSION:-v3.17.3}" +CHART_VERSION="${CHART_VERSION:-0.25.1}" +API_PORT="${API_PORT:-6443}" +SG_NAME="${SG_NAME:-capi-mgmt-sg}" +CAPI_PROJECT="${CAPI_PROJECT:-capi-mgmt}" +CAPI_PROJECT_DOMAIN="${CAPI_PROJECT_DOMAIN:-capi}" +ADMIN_OPENRC="${ADMIN_OPENRC:-$HOME/admin-openrc}" +CONF_DIR="/etc/magnum/magnum.conf.d" + +say() { printf '\n=== %s ===\n' "$*"; } +ok() { printf '[OK] %s\n' "$*"; } +die1() { printf 'GATE FAIL: %s\n' "$*" >&2; exit 1; } +die2() { printf 'PRECONDITION FAIL: %s\n' "$*" >&2; exit 2; } + +# ---- helper: run a command string on the conductor (stdin closed; DOCFIX-021) ---- +rc() { juju ssh -m "$MODEL" "$CONDUCTOR" "$1" /dev/null || true; } + +# ============================ Preconditions ============================ +for c in juju openstack base64 sha256sum; do + command -v "$c" >/dev/null 2>&1 || die2 "$c not found on the jumphost" +done +[ -f "$ADMIN_OPENRC" ] || die2 "$ADMIN_OPENRC not found" +[ -f "$ENVFILE" ] || die2 "$ENVFILE not found (run phase-06 first)" +[ -s "$KUBECONFIG_SRC" ] || die2 "$KUBECONFIG_SRC not found/empty (run phase-06 6.5 first)" +# shellcheck disable=SC1090 +. "$ENVFILE" +[ -n "${MGMT_FIP:-}" ] || die2 "MGMT_FIP unset in $ENVFILE" +grep -qE "^[[:space:]]*server:[[:space:]]*https://${MGMT_FIP//./\\.}:${API_PORT}\$" "$KUBECONFIG_SRC" \ + || die2 "$KUBECONFIG_SRC server is not the FIP https://${MGMT_FIP}:${API_PORT} (phase-06 DOCFIX-062 rewrite missing)" +ok "preconditions met (model=$MODEL conductor=$CONDUCTOR MGMT_FIP=$MGMT_FIP driver=$DRIVER_VERSION)" + +# ============================ 7.0 domain-setup (D-046) ============================ +say "7.0 magnum trustee domain-setup (D-046; idempotent)" +juju run "${MAGNUM_APP}/leader" domain-setup /dev/null /dev/null &1 no SIGPIPE gate race) +grep -q 'magnum-conductor' <<<"$COE" || die1 "coe service list did not return magnum-conductor (trustee 403?)" +ok "domain magnum=$DOM_ID user magnum_domain_admin=$USR_ID; coe service list OK" + +# ============================ 7.1 reachability (VERIFY-FIRST; DOCFIX-063) ============================ +say "7.1 conductor -> mgmt apiserver reachability (verify-first; no hardcoded SG rule)" +CAPI_PID=$( ( . "$ADMIN_OPENRC"; openstack project show "$CAPI_PROJECT" --domain "$CAPI_PROJECT_DOMAIN" -f value -c id ) 2>/dev/null /dev/tcp/${MGMT_FIP}/${API_PORT}' && echo TCP-OK || echo TCP-FAIL" || true) +case "$TCP" in + *TCP-OK*) ok "conductor reaches ${MGMT_FIP}:${API_PORT} (phase-06 capi-mgmt-sg already permits it; no rule added)" ;; + *) die1 "conductor cannot reach ${MGMT_FIP}:${API_PORT}. DOCFIX-063 fallback (manual): scope to project $CAPI_PID, \ +MEASURE the source the mgmt VM sees from $CONDUCTOR (conntrack/listener on the VM), then \ +'openstack security group rule create --proto tcp --dst-port ${API_PORT} --remote-ip /32 $SG_NAME'. Never guess the source." ;; +esac + +# ============================ 7.2 kubeconfig -> conductor [SENSITIVE] ============================ +say "7.2 place the FIP kubeconfig on the conductor [SENSITIVE]" +CUSER=$(rc "systemctl show magnum-conductor -p User --value" | tr -d '\r') +[ -z "$CUSER" ] && CUSER=$(rc "ps -eo user:32,args | awk '/[m]agnum-conductor/{print \$1; exit}'" | tr -d '\r') +[ -n "$CUSER" ] || die1 "could not determine the conductor service user" +rc "getent passwd $CUSER >/dev/null" || die1 "conductor user '$CUSER' does not exist on the conductor" +ok "conductor user = $CUSER" +# base64-pipe: stdin IS the payload -> NO /etc/magnum/kubeconfig && \ + getent passwd $CUSER >/dev/null && chown $CUSER: /etc/magnum/kubeconfig && \ + chmod 0600 /etc/magnum/kubeconfig'" \ + || die1 "kubeconfig transfer to the conductor failed" +L_SHA=$(sha256sum "$KUBECONFIG_SRC" | cut -d' ' -f1) +R_SHA=$(rc "sudo sha256sum /etc/magnum/kubeconfig" | cut -d' ' -f1) +[ -n "$R_SHA" ] && [ "$L_SHA" = "$R_SHA" ] || die1 "kubeconfig sha256 mismatch (local=$L_SHA remote=$R_SHA)" +ok "kubeconfig on conductor: 0600 $CUSER, sha256 match ($L_SHA)" + +# ============================ 7.3 served CAPI versions (DOCFIX-063 probe) ============================ +say "7.3 confirm v1beta1 is SERVED per core CAPI group (kubectl api-versions)" +SERVED=$(KUBECONFIG="$KUBECONFIG_SRC" kubectl api-versions 2>/dev/null | grep -E 'cluster\.x-k8s\.io/' | sort || true) +[ -n "$SERVED" ] || die1 "no cluster.x-k8s.io api-versions returned (mgmt cluster unreachable via $KUBECONFIG_SRC)" +printf '%s\n' "$SERVED" +for g in cluster.x-k8s.io controlplane.cluster.x-k8s.io bootstrap.cluster.x-k8s.io infrastructure.cluster.x-k8s.io; do + printf '%s\n' "$SERVED" | grep -qx "${g}/v1beta1" \ + || die1 "core group ${g} does NOT serve v1beta1 -- set an api_resources override for it in 7.6 (edit CHART/driver map)" +done +ok "v1beta1 served for all core CAPI groups; empty api_resources={} is correct (D-042 premise)" + +# ============================ 7.4 driver + helm install ============================ +say "7.4 install helm $HELM_VERSION + magnum-capi-helm $DRIVER_VERSION on the conductor" +# (a) egress pre-check -- REAL assets (DOCFIX-063: bare get.helm.sh/ 404s) +rc "curl -s -o /dev/null -w 'pypi:%{http_code}\n' https://pypi.org/simple/ ; \ + curl -s -o /dev/null -w 'helm:%{http_code}\n' https://get.helm.sh/helm-${HELM_VERSION}-linux-amd64.tar.gz.sha256sum" +# (b) helm -- checksum-verified; /usr/local/bin + /usr/bin symlink (DOCFIX-035). WANT injected from the local tunable. +juju ssh -m "$MODEL" "$CONDUCTOR" "WANT='$HELM_VERSION'; "'set -e + if [ -x /usr/bin/helm ] && /usr/bin/helm version --short 2>/dev/null | grep -q "$WANT"; then + echo "[SKIP] /usr/bin/helm already $WANT" + else + T=helm-$WANT-linux-amd64.tar.gz; D=$(mktemp -d); cd "$D" + curl -fsSLO "https://get.helm.sh/$T" + EXP=$(curl -fsSL "https://get.helm.sh/$T.sha256sum" | cut -d" " -f1) + GOT=$(sha256sum "$T" | cut -d" " -f1) + [ -n "$EXP" ] && [ "$EXP" = "$GOT" ] || { echo "GATE FAIL: helm checksum exp=$EXP got=$GOT"; exit 1; } + tar xzf "$T" + sudo install -o root -g root -m 0755 linux-amd64/helm /usr/local/bin/helm + sudo ln -sfn /usr/local/bin/helm /usr/bin/helm + cd /; rm -rf "$D"; echo "[OK] installed $(/usr/bin/helm version --short)" + fi' /dev/null | grep -E '^Version:'")" \ + || die1 "installed magnum-capi-helm is not $DRIVER_VERSION" +grep -q 'k8s_capi_helm_v1' <<<"$(rcap "python3 -c \"import importlib.metadata as m; print([e.name for e in m.entry_points(group='magnum.drivers')])\"")" \ + || die1 "k8s_capi_helm_v1 entry point missing after install" +ok "helm $HELM_VERSION (restricted PATH) + magnum-capi-helm $DRIVER_VERSION; entry point present" + +# ---- moved 7.2 auth-proof (helm now present) ---- +say "7.2/7.4 end-to-end auth proof (helm list -A as $CUSER via the FIP)" +AUTH=$(rc "sudo -u $CUSER env HOME=/tmp helm --kubeconfig /etc/magnum/kubeconfig list -A" || true) +printf '%s\n' "$AUTH" +grep -q 'cert-manager' <<<"$AUTH" || die1 "conductor could not auth/list mgmt-cluster releases (expected cert-manager et al.)" +ok "conductor authenticates to the mgmt cluster; releases listed" + +# ============================ 7.6 [capi_helm] drop-in (D-037) ============================ +say "7.6 stage the [capi_helm] conf.d drop-in (D-037)" +rc "sudo install -d -o root -g root -m 0755 $CONF_DIR" # DOCFIX-063: dir absent on fresh deploy +CONF_CONTENT="[capi_helm] +kubeconfig_file = /etc/magnum/kubeconfig +helm_chart_repo = https://azimuth-cloud.github.io/capi-helm-charts +helm_chart_name = openstack-cluster +default_helm_chart_version = $CHART_VERSION +api_resources = {}" +# stdin IS the payload -> NO /dev/null" || die1 "writing 00-capi-helm.conf failed" +rc "sudo chmod 0644 $CONF_DIR/00-capi-helm.conf" +# verify content + perms + ASCII (DOCFIX-063: sudo the grep) +rc "sudo grep -q '^default_helm_chart_version = $CHART_VERSION\$' $CONF_DIR/00-capi-helm.conf" \ + || die1 "00-capi-helm.conf missing default_helm_chart_version = $CHART_VERSION" +rc "sudo env LC_ALL=C grep -nP '[^\x00-\x7F]' $CONF_DIR/00-capi-helm.conf" \ + && die1 "00-capi-helm.conf has non-ASCII bytes" || true +ok "00-capi-helm.conf staged (chart $CHART_VERSION, api_resources={}, ASCII clean)" + +# ============================ 7.7 conductor --config-dir (D-037) ============================ +say "7.7 wire --config-dir into the conductor via /etc/default (LSB init)" +juju ssh -m "$MODEL" "$CONDUCTOR" \ + "echo 'DAEMON_ARGS=\"\$DAEMON_ARGS --config-dir $CONF_DIR\"' | sudo tee /etc/default/magnum-conductor >/dev/null && \ + sudo chmod 0644 /etc/default/magnum-conductor" /dev/null \ + || echo 'DAEMON_ARGS="\$DAEMON_ARGS --config-dir $CONF_DIR"' >> /etc/default/magnum-api +chmod 0644 /etc/default/magnum-api +WWW=\$(awk -F'= ' '/^\[keystone_authtoken\]/{s=1} s&&/^www_authenticate_uri/{print \$2; exit}' /etc/magnum/magnum.conf) +AURL=\$(awk -F'= ' '/^\[keystone_authtoken\]/{s=1} s&&/^auth_url/{print \$2; exit}' /etc/magnum/magnum.conf) +WWW3=\${WWW/\/v2.0//v3}; case "\$WWW3" in */v3) ;; *) WWW3="\${WWW3%/}/v3";; esac +AURL3=\${AURL/\/v2.0//v3}; case "\$AURL3" in */v3) ;; *) AURL3="\${AURL3%/}/v3";; esac +printf '[keystone_authtoken]\nauth_version = v3\nwww_authenticate_uri = %s\nauth_url = %s\n[keystone_auth]\nauth_version = v3\nwww_authenticate_uri = %s\nauth_url = %s\n' \ + "\$WWW3" "\$AURL3" "\$WWW3" "\$AURL3" > $CONF_DIR/50-keystone-v3-override.conf +chmod 0644 $CONF_DIR/50-keystone-v3-override.conf +REOF +[ "$(rcap "sudo grep -c '^auth_version = v3\$' $CONF_DIR/50-keystone-v3-override.conf")" = 2 ] \ + || die1 "50-keystone-v3-override.conf missing auth_version=v3 in both sections" +[ "$(rcap "sudo grep -c -- '--config-dir $CONF_DIR' /etc/default/magnum-api")" = 1 ] \ + || die1 "/etc/default/magnum-api does not carry exactly one --config-dir line" +ok "keystone-v3 override written (both sections v3); magnum-api /etc/default wired" + +# ============================ 7.8 restart + driver enabled ============================ +say "7.8 restart conductor + api; verify both live cmdlines carry --config-dir" +ACT=$(rcap "sudo systemctl restart magnum-conductor magnum-api && sleep 3 && systemctl is-active magnum-conductor magnum-api") +[ "$(grep -c '^active$' <<<"$ACT")" = 2 ] || die1 "magnum-conductor and/or magnum-api not active after restart" +grep -q -- "--config-dir $CONF_DIR" <<<"$(rcap "ps -ww -C magnum-conductor -o args= | head -1")" \ + || die1 "running conductor cmdline lacks --config-dir after restart" +grep -q -- "--config-dir $CONF_DIR" <<<"$(rcap "ps -ww -C magnum-api -o args= | head -1")" \ + || die1 "running magnum-api cmdline lacks --config-dir after restart" +grep -q 'k8s_capi_helm_v1' <<<"$(rcap "sudo magnum-driver-manage list-drivers 2>/dev/null")" \ + || die1 "k8s_capi_helm_v1 not enabled in magnum-driver-manage list-drivers" +ok "both services active with --config-dir; k8s_capi_helm_v1 enabled" + +say "PHASE-07 COMPLETE" +echo "All conductor-graft mechanisms in place. HEALTHY poll + create/delete regression are" +echo "phase-08 (D-011) -- no cluster/template exists yet on a fresh deploy." +exit 0 diff --git a/tests/phase-06-capi-stack/fakebin/ssh b/tests/phase-06-capi-stack/fakebin/ssh new file mode 100644 index 0000000..cedb150 --- /dev/null +++ b/tests/phase-06-capi-stack/fakebin/ssh @@ -0,0 +1,35 @@ +#!/usr/bin/env bash +# fake ssh for phase-06-capi-stack.sh tests. Reads the remote block from stdin, +# identifies the sub-step by a distinctive token, appends it to $ORDER_FILE (so +# tests can assert ORC-before-init), and emits canned output + exit code. +# Steered by env: A_FAIL B_FAIL C_FAIL D_FAIL E_FAIL F_FAIL. +body="$(cat 2>/dev/null || true)" +log() { [ -n "${ORDER_FILE:-}" ] && printf '%s\n' "$1" >> "$ORDER_FILE"; } + +if printf '%s' "$body" | grep -q 'dependencies.json'; then + log a + [ "${A_FAIL:-0}" = 1 ] && { echo "PIN-FAIL: CAPO=null"; exit 1; } + echo "== pins =="; echo "CAPI=v1.13.2"; echo "== tooling =="; echo "clusterctl v1.13.2"; exit 0 +elif printf '%s' "$body" | grep -q 'jetstack/cert-manager'; then + log b + [ "${B_FAIL:-0}" = 1 ] && { echo "Error: timed out waiting for the condition"; exit 1; } + echo "cert-manager deployed"; exit 0 +elif printf '%s' "$body" | grep -q 'server-side'; then + log c + [ "${C_FAIL:-0}" = 1 ] && { echo "error: no matches for kind Image"; exit 1; } + echo "images.openstack.k-orc.cloud"; exit 0 +elif printf '%s' "$body" | grep -q 'clusterctl init'; then + log d + [ "${D_FAIL:-0}" = 1 ] && { echo "capo-system deploy not Available"; exit 1; } + echo "Your management cluster has been initialized successfully!"; exit 0 +elif printf '%s' "$body" | grep -q 'cluster-api-addon-provider'; then + log e + [ "${E_FAIL:-0}" = 1 ] && { echo "Error: helm timeout"; exit 1; } + echo "addon + janitor Running"; exit 0 +elif printf '%s' "$body" | grep -q 'STACK: OK'; then + log f + [ "${F_FAIL:-0}" = 1 ] && { echo "NOT-RUNNING:"; echo "capo-system/capo-controller CrashLoopBackOff"; exit 1; } + echo "STACK: OK"; exit 0 +fi +echo "fake-ssh: unrecognized block" >&2 +exit 0 diff --git a/tests/phase-06-capi-stack/run-tests.sh b/tests/phase-06-capi-stack/run-tests.sh new file mode 100644 index 0000000..670e98b --- /dev/null +++ b/tests/phase-06-capi-stack/run-tests.sh @@ -0,0 +1,75 @@ +#!/usr/bin/env bash +# tests/phase-06-capi-stack/run-tests.sh -- offline regression for +# phase-06-capi-stack.sh. Fake ssh; real bash. +# Key assertions: (1) sub-steps run a,b,c,d,e,f in order; (2) ORC (c) precedes +# clusterctl init (d); (3) if ORC fails, init must NOT run (the hardened order +# exists precisely to stop CAPO crash-looping on a missing ORC CRD). +set -euo pipefail +IFS=$'\n\t' +HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +SCRIPTS="$(cd "$HERE/../../scripts" && pwd)" +TARGET="$SCRIPTS/phase-06-capi-stack.sh" +BIN="$HERE/fakebin" +[ -f "$TARGET" ] || { echo "FAIL: $TARGET missing" >&2; exit 1; } +chmod +x "$BIN"/* 2>/dev/null || true +WORK="$(mktemp -d)"; trap 'rm -rf "$WORK"' EXIT +rc_all=0 + +mkenv() { printf 'MGMT_FIP=%s\n' '10.12.7.222' > "$WORK/net.env"; } +: > "$WORK/id_key" +ORDER="$WORK/order" + +run() { # want_rc out_regex want_order(comma or -) label [extra env...] + local want="$1" re="$2" order_want="$3" label="$4"; shift 4 + local rc order_got + : > "$ORDER" + set +e + PATH="$BIN:$PATH" HOME="$WORK" ENVFILE="$WORK/net.env" SSH_KEY="$WORK/id_key" \ + ORDER_FILE="$ORDER" env "$@" bash "$TARGET" >"$WORK/out" 2>&1 + rc=$? + set -e + order_got=$(paste -sd, "$ORDER" 2>/dev/null || true) + local ok=1 + [ "$rc" -eq "$want" ] || ok=0 + grep -qE "$re" "$WORK/out" || ok=0 + if [ "$order_want" != '-' ] && [ "$order_got" != "$order_want" ]; then ok=0; fi + if [ "$ok" = 1 ]; then + printf ' [OK] %-46s exit %s order=[%s]\n' "$label" "$rc" "$order_got" + else + printf ' [XX] %-46s exit %s (want %s; /%s/; order [%s] want [%s])\n' \ + "$label" "$rc" "$want" "$re" "$order_got" "$order_want" + sed 's/^/ /' "$WORK/out"; rc_all=1 + fi +} + +echo "=== phase-06-capi-stack.sh ===" +mkenv +run 0 'Phase-06 complete' 'a,b,c,d,e,f' "happy path -> full stack, ordered" +run 1 '6.6a' 'a' "6.6a pin fail -> stop at a" A_FAIL=1 +run 1 '6.6b' 'a,b' "6.6b cert-manager fail -> stop at b" B_FAIL=1 +run 1 '6.6c' 'a,b,c' "6.6c ORC fail -> init (d) NOT run" C_FAIL=1 +run 1 '6.6d' 'a,b,c,d' "6.6d init fail -> stop at d" D_FAIL=1 +run 1 '6.6e' 'a,b,c,d,e' "6.6e CAAPH/janitor fail -> stop at e" E_FAIL=1 +run 1 '6.6f' 'a,b,c,d,e,f' "6.6f verify fail -> stop at f" F_FAIL=1 + +# preconditions +run 2 'not found' '-' "precondition: no ENVFILE -> exit 2" ENVFILE="$WORK/nope.env" +: > "$WORK/net.env" +run 2 'MGMT_FIP unset' '-' "precondition: MGMT_FIP unset -> exit 2" +mkenv + +echo "=== assert: ORC (c) strictly precedes clusterctl init (d) on happy path ===" +: > "$ORDER" +PATH="$BIN:$PATH" HOME="$WORK" ENVFILE="$WORK/net.env" SSH_KEY="$WORK/id_key" \ + ORDER_FILE="$ORDER" bash "$TARGET" >/dev/null 2>&1 || true +ci=$(grep -n '^c$' "$ORDER" | head -1 | cut -d: -f1) +di=$(grep -n '^d$' "$ORDER" | head -1 | cut -d: -f1) +if [ -n "$ci" ] && [ -n "$di" ] && [ "$ci" -lt "$di" ]; then + echo " [OK] ORC at step $ci precedes clusterctl init at step $di" +else + echo " [XX] ORC/init ordering wrong (c=$ci d=$di)"; rc_all=1 +fi + +echo +[ "$rc_all" -eq 0 ] && echo "ALL PASS" || echo "SOME FAILED" +exit "$rc_all" diff --git a/tests/phase-06-k8s-bootstrap/fakebin/ssh b/tests/phase-06-k8s-bootstrap/fakebin/ssh new file mode 100644 index 0000000..7c0d69e --- /dev/null +++ b/tests/phase-06-k8s-bootstrap/fakebin/ssh @@ -0,0 +1,27 @@ +#!/usr/bin/env bash +# fake ssh for phase-06-k8s-bootstrap.sh tests. +# Collects positionals after 'bash -s' to tell GATE 1 (5 args) from 6.4 (8 args). +# Steered by env: VIP_FAIL NET_FAIL BOOT_FAIL. +pos=(); after=0 +for a in "$@"; do + if [ "$after" = 1 ]; then pos+=("$a"); continue; fi + [ "$a" = "-s" ] && after=1 +done +cat >/dev/null 2>&1 || true # discard the heredoc on stdin +case "${#pos[@]}" in + 5) # GATE 1 egress probe: khost kport ihost iport timeout + khost="${pos[0]}"; kport="${pos[1]}"; ihost="${pos[2]}"; iport="${pos[3]}" + if [ "${VIP_FAIL:-0}" = 1 ]; then echo "VIP-FAIL $khost:$kport"; else echo "VIP-OK $khost:$kport"; fi + if [ "${NET_FAIL:-0}" = 1 ]; then echo "NET-FAIL $ihost:$iport"; else echo "NET-OK $ihost:$iport"; fi + if [ "${VIP_FAIL:-0}" = 1 ] || [ "${NET_FAIL:-0}" = 1 ]; then echo "GATE1: FAIL"; else echo "GATE1: PASS"; fi + ;; + 8) # 6.4 bootstrap + if [ "${BOOT_FAIL:-0}" = 1 ]; then + echo "=== bootstrap ==="; echo "Error: bootstrap failed" + else + echo "cluster status: ready"; echo "network: enabled"; echo "BOOT: READY" + fi + ;; + *) echo "fake-ssh: unexpected positional count ${#pos[@]}" >&2 ;; +esac +exit 0 diff --git a/tests/phase-06-k8s-bootstrap/run-tests.sh b/tests/phase-06-k8s-bootstrap/run-tests.sh new file mode 100644 index 0000000..882f725 --- /dev/null +++ b/tests/phase-06-k8s-bootstrap/run-tests.sh @@ -0,0 +1,66 @@ +#!/usr/bin/env bash +# tests/phase-06-k8s-bootstrap/run-tests.sh -- offline regression for +# phase-06-k8s-bootstrap.sh. Fake ssh; real bash. +set -euo pipefail +IFS=$'\n\t' +HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +SCRIPTS="$(cd "$HERE/../../scripts" && pwd)" +TARGET="$SCRIPTS/phase-06-k8s-bootstrap.sh" +BIN="$HERE/fakebin" +[ -f "$TARGET" ] || { echo "FAIL: $TARGET missing" >&2; exit 1; } +chmod +x "$BIN"/* 2>/dev/null || true +WORK="$(mktemp -d)"; trap 'rm -rf "$WORK"' EXIT +rc_all=0 + +# baseline fixtures: env file + fake ssh key +mkenv() { printf 'MGMT_FIP=%s\nMGMT_TENANT_IP=%s\n' "${1:-10.12.7.222}" "${2:-10.20.0.207}" > "$WORK/net.env"; } +: > "$WORK/id_key" + +run() { # want_rc regex label [extra env assignments...] + local want="$1" re="$2" label="$3"; shift 3 + local rc + set +e + PATH="$BIN:$PATH" HOME="$WORK" ENVFILE="$WORK/net.env" SSH_KEY="$WORK/id_key" \ + PROBE_TIMEOUT=1 BOOT_TIMEOUT=1m READY_TIMEOUT=1m \ + env "$@" bash "$TARGET" >"$WORK/out" 2>&1 + rc=$? + set -e + if [ "$rc" -eq "$want" ] && grep -qE "$re" "$WORK/out"; then + printf ' [OK] %-46s exit %s\n' "$label" "$rc" + else + printf ' [XX] %-46s exit %s (want %s; /%s/)\n' "$label" "$rc" "$want" "$re" + sed 's/^/ /' "$WORK/out"; rc_all=1 + fi +} + +echo "=== phase-06-k8s-bootstrap.sh ===" +mkenv +run 0 'GATE 1 passed' "happy path (literal keystone + bootstrap)" +run 0 'Keystone public endpoint: 10.12.4.50:5000' "keystone = as-run literal default (no discovery)" +run 0 'bootstrapped and ready' "6.4 reaches ready" +run 0 'Keystone public endpoint: 1.2.3.4:5000' "KEYSTONE_HOSTPORT override honored" KEYSTONE_HOSTPORT=1.2.3.4:5000 +run 1 'VM egress probe did not pass' "GATE 1 VIP fail -> exit 1" VIP_FAIL=1 +run 1 'VM egress probe did not pass' "GATE 1 NET fail -> exit 1" NET_FAIL=1 +run 1 'did not reach ready' "6.4 bootstrap fail -> exit 1" BOOT_FAIL=1 + +# preconditions +run 2 'not found' "precondition: no ENVFILE -> exit 2" ENVFILE="$WORK/nope.env" +mkenv "" ""; : > "$WORK/net.env" # empty env file (no MGMT_FIP) +run 2 'MGMT_FIP unset' "precondition: MGMT_FIP unset -> exit 2" +mkenv +run 2 'ssh key' "precondition: missing ssh key -> exit 2" SSH_KEY="$WORK/nokey" + +# as-run fidelity: the script must NOT dynamically discover Keystone (uses the literal) +set +e +PATH="$BIN:$PATH" HOME="$WORK" ENVFILE="$WORK/net.env" SSH_KEY="$WORK/id_key" \ + PROBE_TIMEOUT=1 BOOT_TIMEOUT=1m READY_TIMEOUT=1m bash "$TARGET" >"$WORK/fid" 2>&1 +set -e +if grep -qiE 'discovered|endpoint list' "$WORK/fid"; then + printf ' [XX] %-46s (performed discovery; must use as-run literal)\n' "fidelity: no dynamic discovery"; rc_all=1 +else + printf ' [OK] %-46s\n' "fidelity: no dynamic discovery (as-run literal)" +fi + +echo +[ "$rc_all" -eq 0 ] && echo "ALL PASS" || echo "SOME FAILED" +exit "$rc_all" diff --git a/tests/phase-06-kubeconfig-gate/fakebin/kubectl b/tests/phase-06-kubeconfig-gate/fakebin/kubectl new file mode 100644 index 0000000..dc5ec11 --- /dev/null +++ b/tests/phase-06-kubeconfig-gate/fakebin/kubectl @@ -0,0 +1,28 @@ +#!/usr/bin/env bash +# fake kubectl for phase-06-kubeconfig-gate.sh tests. +# Steered by env: CLUSTER_NAME_OUT SET_CLUSTER_NOOP NODE_NOTREADY POD_PHASE POD_STATE. +a1="${1:-}"; a2="${2:-}"; rest=" $* " +case "$a1 $a2" in + "config view") + echo "${CLUSTER_NAME_OUT:-k8s}" ;; + "config set-cluster") + srv="" + for a in "$@"; do case "$a" in --server=*) srv="${a#--server=}";; esac; done + if [ "${SET_CLUSTER_NOOP:-0}" != 1 ] && [ -n "${KUBECONFIG:-}" ] && [ -f "${KUBECONFIG:-}" ]; then + sed -i -E "s#^([[:space:]]*server:).*#\1 $srv#" "$KUBECONFIG" + fi + echo "Cluster set." ;; + "get nodes") + st="Ready"; [ "${NODE_NOTREADY:-0}" = 1 ] && st="NotReady" + echo "NAME STATUS ROLES AGE VERSION" + echo "capi-mgmt-v2 $st control-plane,worker 12m v1.32.13" ;; + "get pod") + if printf '%s' "$rest" | grep -q 'status.phase'; then + echo "${POD_PHASE:-Succeeded}" + elif printf '%s' "$rest" | grep -q 'containerStatuses'; then + echo "${POD_STATE:-{\"terminated\":{\"reason\":\"Completed\",\"exitCode\":0}}}" + fi ;; + "delete pod") exit 0 ;; + "run egress-test") exit 0 ;; +esac +exit 0 diff --git a/tests/phase-06-kubeconfig-gate/fakebin/ssh b/tests/phase-06-kubeconfig-gate/fakebin/ssh new file mode 100644 index 0000000..496f3a9 --- /dev/null +++ b/tests/phase-06-kubeconfig-gate/fakebin/ssh @@ -0,0 +1,29 @@ +#!/usr/bin/env bash +# fake ssh for phase-06-kubeconfig-gate.sh: only the 'sudo k8s config' pull is used. +# Emits a kubeconfig whose server is the TENANT IP (10.20.0.207) -- the exact +# DOCFIX-062 defect the script must rewrite to the FIP. +# Steered by env: PULL_FAIL PULL_EMPTY PULL_BADHEAD. +want_config=0 +for a in "$@"; do case "$a" in *"k8s config"*) want_config=1;; esac; done +if [ "$want_config" = 1 ]; then + [ "${PULL_FAIL:-0}" = 1 ] && exit 1 + [ "${PULL_EMPTY:-0}" = 1 ] && exit 0 + if [ "${PULL_BADHEAD:-0}" = 1 ]; then echo "not-a-kubeconfig"; exit 0; fi + cat <<'KC' +apiVersion: v1 +clusters: +- cluster: + server: https://10.20.0.207:6443 + name: k8s +contexts: +- context: + cluster: k8s + user: admin + name: k8s +current-context: k8s +kind: Config +users: +- name: admin +KC +fi +exit 0 diff --git a/tests/phase-06-kubeconfig-gate/run-tests.sh b/tests/phase-06-kubeconfig-gate/run-tests.sh new file mode 100644 index 0000000..6cc0506 --- /dev/null +++ b/tests/phase-06-kubeconfig-gate/run-tests.sh @@ -0,0 +1,89 @@ +#!/usr/bin/env bash +# tests/phase-06-kubeconfig-gate/run-tests.sh -- offline regression for +# phase-06-kubeconfig-gate.sh. Fake ssh + kubectl; real bash. +# Key assertion: DOCFIX-062 -- the emitted kubeconfig server (tenant IP) is +# rewritten to the FIP before the gate runs. +set -euo pipefail +IFS=$'\n\t' +HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +SCRIPTS="$(cd "$HERE/../../scripts" && pwd)" +TARGET="$SCRIPTS/phase-06-kubeconfig-gate.sh" +BIN="$HERE/fakebin" +[ -f "$TARGET" ] || { echo "FAIL: $TARGET missing" >&2; exit 1; } +chmod +x "$BIN"/* 2>/dev/null || true +WORK="$(mktemp -d)"; trap 'rm -rf "$WORK"' EXIT +rc_all=0 +FIP=10.12.7.222 + +mkenv() { printf 'MGMT_FIP=%s\n' "$FIP" > "$WORK/net.env"; } +: > "$WORK/id_key" + +run() { # want_rc regex label [extra env...] + local want="$1" re="$2" label="$3"; shift 3 + local rc + rm -f "$WORK/kc" + set +e + PATH="$BIN:$PATH" HOME="$WORK" ENVFILE="$WORK/net.env" SSH_KEY="$WORK/id_key" \ + KUBECONFIG_OUT="$WORK/kc" PROBE_TRIES=2 PROBE_SLEEP=0 \ + env "$@" bash "$TARGET" >"$WORK/out" 2>&1 + rc=$? + set -e + if [ "$rc" -eq "$want" ] && grep -qE "$re" "$WORK/out"; then + printf ' [OK] %-48s exit %s\n' "$label" "$rc" + else + printf ' [XX] %-48s exit %s (want %s; /%s/)\n' "$label" "$rc" "$want" "$re" + sed 's/^/ /' "$WORK/out"; rc_all=1 + fi +} + +echo "=== phase-06-kubeconfig-gate.sh ===" +mkenv +run 0 'GATE 2 passed' "happy path (pull + rewrite + probe)" +run 0 'Keystone public endpoint: 10.12.4.50:5000' "keystone = as-run literal default (no discovery)" +run 0 'server rewritten to https' "DOCFIX-062 rewrite message" +run 0 'GATE 2 passed' "KEYSTONE_HOSTPORT override" KEYSTONE_HOSTPORT=1.2.3.4:5000 +run 1 'could not pull kubeconfig' "pull fail -> exit 1" PULL_FAIL=1 +run 1 'is empty' "empty kubeconfig -> exit 1" PULL_EMPTY=1 +run 1 'does not look like' "bad head -> exit 1" PULL_BADHEAD=1 +run 1 'did not take' "set-cluster no-op -> exit 1 (DOCFIX-062 guard)" SET_CLUSTER_NOOP=1 +run 1 'node is not Ready' "node NotReady -> exit 1" NODE_NOTREADY=1 +run 1 'probe pod Failed' "GATE 2 pod Failed -> exit 1" POD_PHASE=Failed +run 1 'did not reach Succeeded' "GATE 2 exitCode!=0 -> exit 1" POD_STATE='{"terminated":{"reason":"Error","exitCode":1}}' +run 1 'did not reach Succeeded' "GATE 2 Pending timeout -> exit 1" POD_PHASE=Pending + +# preconditions +run 2 'not found' "precondition: no ENVFILE -> exit 2" ENVFILE="$WORK/nope.env" +: > "$WORK/net.env" +run 2 'MGMT_FIP unset' "precondition: MGMT_FIP unset -> exit 2" +mkenv + +echo "=== assert DOCFIX-062: kubeconfig server rewritten tenant-IP -> FIP ===" +rm -f "$WORK/kc" +PATH="$BIN:$PATH" HOME="$WORK" ENVFILE="$WORK/net.env" SSH_KEY="$WORK/id_key" \ + KUBECONFIG_OUT="$WORK/kc" PROBE_TRIES=2 PROBE_SLEEP=0 \ + bash "$TARGET" >/dev/null 2>&1 || true +if grep -qE "server:[[:space:]]*https://${FIP//./\\.}:6443" "$WORK/kc" \ + && ! grep -q '10.20.0.207:6443' "$WORK/kc"; then + perm=$(stat -c '%a' "$WORK/kc" 2>/dev/null || echo '?') + if [ "$perm" = 600 ]; then + echo " [OK] server rewritten to FIP; tenant IP gone; mode 600" + else + echo " [XX] kubeconfig mode=$perm (want 600)"; rc_all=1 + fi +else + echo " [XX] server not rewritten to FIP (DOCFIX-062 regression)"; sed 's/^/ /' "$WORK/kc"; rc_all=1 +fi + +echo "=== assert as-run fidelity: no dynamic Keystone discovery ===" +PATH="$BIN:$PATH" HOME="$WORK" ENVFILE="$WORK/net.env" SSH_KEY="$WORK/id_key" \ + KUBECONFIG_OUT="$WORK/kc" PROBE_TRIES=2 PROBE_SLEEP=0 \ + bash "$TARGET" >"$WORK/fid" 2>&1 || true +if grep -qiE 'discovered|endpoint list' "$WORK/fid"; then + echo " [XX] performed discovery; must use as-run literal"; rc_all=1 +else + echo " [OK] no dynamic discovery (as-run literal 10.12.4.50:5000)" +fi + +echo +[ "$rc_all" -eq 0 ] && echo "ALL PASS" || echo "SOME FAILED" +exit "$rc_all" diff --git a/tests/phase-07-conductor-graft/fakebin/juju b/tests/phase-07-conductor-graft/fakebin/juju new file mode 100644 index 0000000..679e2d5 --- /dev/null +++ b/tests/phase-07-conductor-graft/fakebin/juju @@ -0,0 +1,103 @@ +#!/usr/bin/env bash +# fake juju for phase-07-conductor-graft.sh tests. +# Logs every call to $JUJU_LOG; keeps decoded-file state in $JUJU_STATE so the +# 7.2 sha256 both-sides gate passes legitimately. Dispatches `juju ssh` remote +# commands by substring. Steered by env: +# DOMAIN_SETUP_FAIL TCP_FAIL SHA_MISMATCH DRIVER_MISSING NOTACTIVE +# SHOWARGS_NODIR PS_NODIR NODRIVER_ENABLED +: "${JUJU_LOG:=/dev/null}" +: "${JUJU_STATE:=/tmp}" +printf 'juju %s\n' "$*" >> "$JUJU_LOG" + +sub="${1:-}"; shift || true + +if [ "$sub" = "run" ]; then + # juju run magnum/leader domain-setup + [ "${DOMAIN_SETUP_FAIL:-0}" = 1 ] && exit 1 + echo "Running domain-setup on magnum/leader"; exit 0 +fi + +if [ "$sub" != "ssh" ]; then exit 0; fi + +# strip: -m MODEL UNIT ; the remainder is the remote command (1+ args) +while [ "${1:-}" = "-m" ]; do shift 2; done +shift || true # drop the UNIT +REMOTE="$*" + +emit_kubeconfig_sha() { + if [ "${SHA_MISMATCH:-0}" = 1 ]; then + echo "0000000000000000000000000000000000000000000000000000000000000000 /etc/magnum/kubeconfig" + elif [ -f "$JUJU_STATE/kubeconfig" ]; then + sha256sum "$JUJU_STATE/kubeconfig" | awk '{print $1" /etc/magnum/kubeconfig"}' + else + echo "(no kubeconfig)"; return 1 + fi +} + +case "$REMOTE" in + *"base64 -d > /etc/magnum/kubeconfig"*) # 7.2 write -- stdin is base64 payload (must precede the getent match) + base64 -d > "$JUJU_STATE/kubeconfig" 2>/dev/null || true; exit 0 ;; + *"echo TCP-OK"*) + if [ "${TCP_FAIL:-0}" = 1 ]; then echo "TCP-FAIL"; else echo "TCP-OK"; fi ;; + *"systemctl show magnum-conductor -p User"*) + echo "magnum" ;; + *"[m]agnum-conductor"*) # ps fallback owner probe + echo "magnum" ;; + *"getent passwd magnum"*) + exit 0 ;; + *"sha256sum /etc/magnum/kubeconfig"*) + emit_kubeconfig_sha ;; + *"curl "*pypi*|*"pypi:"*) + echo "pypi:200"; echo "helm:200" ;; + *"WANT="*"get.helm.sh"*) # helm install block + echo "[OK] installed v3.17.3+ge4da497" ;; + *"command -v helm"*) # restricted-PATH gate + echo "/usr/bin/helm"; echo "v3.17.3+ge4da497" ;; + *"pip install"*"magnum-capi-helm"*) + echo "Successfully installed magnum-capi-helm-1.4.0" ;; + *"pip show magnum-capi-helm"*) + echo "Version: 1.4.0" ;; + *"entry_points"*) + if [ "${DRIVER_MISSING:-0}" = 1 ]; then echo "['k8s_fedora_coreos_v1']"; else echo "['k8s_capi_helm_v1', 'k8s_fedora_coreos_v1']"; fi ;; + *"helm --kubeconfig /etc/magnum/kubeconfig list -A"*) # auth proof + echo "NAME NAMESPACE REVISION STATUS CHART" + echo "cert-manager cert-manager 1 deployed cert-manager-v1.20.2" + echo "ck-network kube-system 1 deployed cilium-1.17.12" ;; + *"install -d"*"magnum.conf.d"*) + mkdir -p "$JUJU_STATE/conf.d"; exit 0 ;; + *"tee /etc/magnum/magnum.conf.d/00-capi-helm.conf"*) # 7.6 write -- stdin payload + mkdir -p "$JUJU_STATE/conf.d"; cat > "$JUJU_STATE/conf.d/00-capi-helm.conf"; exit 0 ;; + *"chmod 0644 /etc/magnum/magnum.conf.d/00-capi-helm.conf"*) + exit 0 ;; + *"grep -q '^default_helm_chart_version = "*) + grep -q '^default_helm_chart_version = 0.25.1$' "$JUJU_STATE/conf.d/00-capi-helm.conf"; exit $? ;; + *"grep -nP '[^\\x00-\\x7F]' /etc/magnum/magnum.conf.d/00-capi-helm.conf"*) + LC_ALL=C grep -nP '[^\x00-\x7F]' "$JUJU_STATE/conf.d/00-capi-helm.conf"; exit $? ;; # empty -> exit 1 (no non-ascii) + *"tee /etc/default/magnum-conductor"*) + cat > "$JUJU_STATE/default-conductor"; exit 0 ;; + *"show-args"*) + if [ "${SHOWARGS_NODIR:-0}" = 1 ]; then + echo "/usr/bin/magnum-conductor --config-file=/etc/magnum/magnum.conf --log-file=/var/log/magnum/magnum-conductor.log" + else + echo "/usr/bin/magnum-conductor --config-file=/etc/magnum/magnum.conf --config-dir /etc/magnum/magnum.conf.d --log-file=/var/log/magnum/magnum-conductor.log" + fi ;; + *"sudo bash -s"*) # 7.7b heredoc -- consume + simulate success + cat >/dev/null; exit 0 ;; + *"grep -c '^auth_version = v3"*) + echo "2" ;; + *"grep -c -- '--config-dir /etc/magnum/magnum.conf.d' /etc/default/magnum-api"*) + echo "1" ;; + *"systemctl restart magnum-conductor magnum-api"*) + if [ "${NOTACTIVE:-0}" = 1 ]; then echo "active"; echo "failed"; else echo "active"; echo "active"; fi ;; + *"ps -ww -C magnum-conductor"*|*"ps -ww -C magnum-api"*) + if [ "${PS_NODIR:-0}" = 1 ]; then + echo "/usr/bin/python3 /usr/bin/magnum-x --config-file=/etc/magnum/magnum.conf --log-file=/var/log/magnum/x.log" + else + echo "/usr/bin/python3 /usr/bin/magnum-x --config-file=/etc/magnum/magnum.conf --config-dir /etc/magnum/magnum.conf.d --log-file=/var/log/magnum/x.log" + fi ;; + *"magnum-driver-manage list-drivers"*) + if [ "${NODRIVER_ENABLED:-0}" = 1 ]; then echo "| k8s_fedora_coreos_v1 |"; else echo "| k8s_capi_helm_v1 |"; echo "| k8s_fedora_coreos_v1 |"; fi ;; + *) + exit 0 ;; +esac +exit 0 diff --git a/tests/phase-07-conductor-graft/fakebin/kubectl b/tests/phase-07-conductor-graft/fakebin/kubectl new file mode 100644 index 0000000..29d0304 --- /dev/null +++ b/tests/phase-07-conductor-graft/fakebin/kubectl @@ -0,0 +1,25 @@ +#!/usr/bin/env bash +# fake kubectl for phase-07-conductor-graft.sh tests. +# Only `kubectl api-versions` is used (7.3 DOCFIX-063 probe). Emits the served +# group/versions; steered by env NO_V1BETA1_CORE (drops cluster.x-k8s.io/v1beta1 +# so the 7.3 core-group gate fails). +if [ "${1:-}" = "api-versions" ]; then + cat <<'AV' +addons.cluster.x-k8s.io/v1beta1 +addons.cluster.x-k8s.io/v1beta2 +bootstrap.cluster.x-k8s.io/v1beta1 +bootstrap.cluster.x-k8s.io/v1beta2 +controlplane.cluster.x-k8s.io/v1beta1 +controlplane.cluster.x-k8s.io/v1beta2 +infrastructure.cluster.x-k8s.io/v1beta1 +infrastructure.cluster.x-k8s.io/v1beta2 +v1 +AV + if [ "${NO_V1BETA1_CORE:-0}" = 1 ]; then + echo "cluster.x-k8s.io/v1beta2" + else + echo "cluster.x-k8s.io/v1beta1" + echo "cluster.x-k8s.io/v1beta2" + fi +fi +exit 0 diff --git a/tests/phase-07-conductor-graft/fakebin/openstack b/tests/phase-07-conductor-graft/fakebin/openstack new file mode 100644 index 0000000..c25c761 --- /dev/null +++ b/tests/phase-07-conductor-graft/fakebin/openstack @@ -0,0 +1,32 @@ +#!/usr/bin/env bash +# fake openstack for phase-07-conductor-graft.sh tests. +# Logs every call to $OS_LOG (so the harness can assert 7.1 verify-first NEVER +# calls `security group rule create`). Steered by env: +# DOMAIN_MISSING USER_MISSING PROJECT_MISSING COE_403 +: "${OS_LOG:=/dev/null}" +printf 'openstack %s\n' "$*" >> "$OS_LOG" + +j=" $* " +case "$j" in + *" domain show magnum "*) + [ "${DOMAIN_MISSING:-0}" = 1 ] && exit 1 + echo "d9d0a4a8215d49f2aeb243b6aea4b0b0" ;; + *" user show magnum_domain_admin "*) + [ "${USER_MISSING:-0}" = 1 ] && exit 1 + echo "0885dca38f8043ed85d5e72f14a54124" ;; + *" project show "*) + [ "${PROJECT_MISSING:-0}" = 1 ] && exit 1 + echo "d5bc125c7c1841d389b76cd0a7b0a915" ;; + *" coe service list "*) + if [ "${COE_403:-0}" = 1 ]; then + echo "ERROR (Forbidden): Keystone client authentication failed" >&2; exit 1 + fi + echo "| id | host | binary | state |" + echo "| 1 | None | magnum-conductor | up |" ;; + *" security group rule create "*) + # should NEVER run on the happy path (7.1 is verify-first) + echo "RULE-CREATE-CALLED" ; exit 0 ;; + *) + exit 0 ;; +esac +exit 0 diff --git a/tests/phase-07-conductor-graft/run-tests.sh b/tests/phase-07-conductor-graft/run-tests.sh new file mode 100644 index 0000000..c491bbd --- /dev/null +++ b/tests/phase-07-conductor-graft/run-tests.sh @@ -0,0 +1,126 @@ +#!/usr/bin/env bash +# tests/phase-07-conductor-graft/run-tests.sh -- offline regression for +# scripts/phase-07-conductor-graft.sh. Fake juju/openstack/kubectl; real +# base64/sha256sum/bash. Focus: the DOCFIX-063 behaviors (7.1 verify-first is a +# no-op when 6443 is open and NEVER creates a rule; 7.3 api-versions probe; the +# sha256 both-sides gate; install -d before the tee) plus the 7.7b v3-URL +# derivation and the phase gates/preconditions. +set -euo pipefail +IFS=$'\n\t' +HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +SCRIPTS="$(cd "$HERE/../../scripts" && pwd)" +TARGET="$SCRIPTS/phase-07-conductor-graft.sh" +BIN="$HERE/fakebin" +[ -f "$TARGET" ] || { echo "FAIL: $TARGET missing" >&2; exit 1; } +chmod +x "$BIN"/* 2>/dev/null || true +WORK="$(mktemp -d)"; trap 'rm -rf "$WORK"' EXIT +rc_all=0 +FIP=10.12.7.222 + +# fixtures on the jumphost side +printf 'export OS_AUTH_URL=https://10.12.8.50:35357/v3\nexport OS_USERNAME=admin\n' > "$WORK/admin-openrc" +printf 'MGMT_FIP=%s\n' "$FIP" > "$WORK/net.env" +cat > "$WORK/kubeconfig" < "$WORK/juju.log"; : > "$WORK/os.log" + set +e + PATH="$BIN:$PATH" HOME="$WORK" \ + ADMIN_OPENRC="$WORK/admin-openrc" ENVFILE="$WORK/net.env" KUBECONFIG_SRC="$WORK/kubeconfig" \ + JUJU_LOG="$WORK/juju.log" JUJU_STATE="$state" OS_LOG="$WORK/os.log" \ + env "$@" bash "$TARGET" >"$WORK/out" 2>&1 + rc=$? + set -e + if [ "$rc" -eq "$want" ] && grep -qE "$re" "$WORK/out"; then + printf ' [OK] %-52s exit %s\n' "$label" "$rc" + else + printf ' [XX] %-52s exit %s (want %s; /%s/)\n' "$label" "$rc" "$want" "$re" + sed 's/^/ /' "$WORK/out"; rc_all=1 + fi +} + +echo "=== phase-07-conductor-graft.sh ===" +run 0 'PHASE-07 COMPLETE' "happy path (7.0-7.8)" +run 0 'no rule added' "7.1 verify-first: reachable -> no SG mutation" +run 0 'v1beta1 served for all core' "7.3 api-versions probe passes" +run 0 'sha256 match' "7.2 kubeconfig sha256 both-sides gate" +run 0 'k8s_capi_helm_v1 enabled' "7.8 driver enabled gate" + +# --- gate failures --- +run 1 'cannot reach' "7.1 TCP-FAIL -> exit 1 (fallback msg)" TCP_FAIL=1 +run 1 'sha256 mismatch' "7.2 sha mismatch -> exit 1" SHA_MISMATCH=1 +run 1 'does NOT serve v1beta1' "7.3 core group missing v1beta1 -> exit 1" NO_V1BETA1_CORE=1 +run 1 'entry point missing' "7.4 driver entry point absent -> exit 1" DRIVER_MISSING=1 +run 1 'not active after restart' "7.8 service not active -> exit 1" NOTACTIVE=1 +run 1 'lacks --config-dir' "7.8 live cmdline missing --config-dir" PS_NODIR=1 +run 1 'not enabled in magnum-driver' "7.8 driver not enabled -> exit 1" NODRIVER_ENABLED=1 +run 1 'domain-setup action failed' "7.0 domain-setup fails -> exit 1" DOMAIN_SETUP_FAIL=1 +run 1 "domain 'magnum' absent" "7.0 domain missing -> exit 1" DOMAIN_MISSING=1 + +# --- preconditions --- +run 2 'not found' "precondition: no ENVFILE -> exit 2" ENVFILE="$WORK/nope.env" +: > "$WORK/empty.env" +run 2 'MGMT_FIP unset' "precondition: empty env (MGMT_FIP unset) -> exit 2" ENVFILE="$WORK/empty.env" +cat > "$WORK/badkc" <<'BK' +apiVersion: v1 +clusters: +- cluster: + server: https://10.20.0.207:6443 + name: x +BK +run 2 'server is not the FIP' "precondition: kubeconfig server not FIP -> exit 2" KUBECONFIG_SRC="$WORK/badkc" + +# --- dedicated happy-path capture for the structural assertions (NOT leftover +# logs from the last precondition run, which never reaches 7.0+) --- +echo "=== structural assertions (dedicated happy-path capture) ===" +hstate="$WORK/hstate"; rm -rf "$hstate"; mkdir -p "$hstate" +: > "$WORK/hjuju.log"; : > "$WORK/hos.log" +PATH="$BIN:$PATH" HOME="$WORK" \ + ADMIN_OPENRC="$WORK/admin-openrc" ENVFILE="$WORK/net.env" KUBECONFIG_SRC="$WORK/kubeconfig" \ + JUJU_LOG="$WORK/hjuju.log" JUJU_STATE="$hstate" OS_LOG="$WORK/hos.log" \ + bash "$TARGET" >/dev/null 2>&1 || true + +# 7.1 verify-first NEVER calls security group rule create +if grep -q 'security group rule create' "$WORK/hos.log"; then + echo " [XX] 7.1 created an SG rule (must be verify-first no-op)"; rc_all=1 +else + echo " [OK] no 'security group rule create' in the openstack call log" +fi + +# install -d /etc/magnum/magnum.conf.d precedes the 00-capi-helm.conf tee +ln=$(grep -n 'install -d.*magnum.conf.d' "$WORK/hjuju.log" | head -1 | cut -d: -f1 || true) +lt=$(grep -n 'tee /etc/magnum/magnum.conf.d/00-capi-helm.conf' "$WORK/hjuju.log" | head -1 | cut -d: -f1 || true) +if [ -n "$ln" ] && [ -n "$lt" ] && [ "$ln" -lt "$lt" ]; then + echo " [OK] install -d (line $ln) precedes tee (line $lt)" +else + echo " [XX] dir-create/tee ordering wrong (install-d=$ln tee=$lt)"; rc_all=1 +fi + +# --- assertion: 7.7b v3-URL derivation (mirrors the as-run block) --- +echo "=== assert 7.7b keystone v3-URL derivation ===" +v3() { # replicate the 7.7b derivation exactly + local in="$1" out + out=${in/\/v2.0//v3} + case "$out" in */v3) ;; *) out="${out%/}/v3";; esac + printf '%s' "$out" +} +derr=0 +[ "$(v3 https://10.12.4.50:5000)" = https://10.12.4.50:5000/v3 ] || { echo " [XX] unversioned -> /v3 failed"; derr=1; } +[ "$(v3 https://10.12.4.50:5000/v2.0)" = https://10.12.4.50:5000/v3 ] || { echo " [XX] /v2.0 -> /v3 failed"; derr=1; } +[ "$(v3 https://10.12.4.50:5000/v3)" = https://10.12.4.50:5000/v3 ] || { echo " [XX] /v3 -> /v3 (idempotent) failed"; derr=1; } +[ "$(v3 https://10.12.8.50:35357/)" = https://10.12.8.50:35357/v3 ] || { echo " [XX] trailing-slash -> /v3 failed"; derr=1; } +[ "$derr" -eq 0 ] && echo " [OK] v3 derivation: unversioned, /v2.0, /v3, trailing-slash all -> /v3" +[ "$derr" -eq 0 ] || rc_all=1 + +echo +[ "$rc_all" -eq 0 ] && echo "ALL PASS" || echo "SOME FAILED" +exit "$rc_all"