diff --git a/docs/v1-redeploy-changelog.md b/docs/v1-redeploy-changelog.md index a86b2ab..3543d10 100644 --- a/docs/v1-redeploy-changelog.md +++ b/docs/v1-redeploy-changelog.md @@ -1087,6 +1087,144 @@ its own `machines:` 8-11, so 8/9/10 are internally consistent. The live 0-3 numbering was a deploy-time artifact; the bundle correctly uses 8-11. +### 2026-06-30 -- Phase-04 executed (network carve + internal-cert SAN gate); DOCFIX-059/060 + +PHASE-04 (network carve) -- PASS: +- Step 4.1 provider-ext + provider-ext-fip created idempotently (phase-04-network-create.sh); EXIT gate + PASS via phase-04-network-verify.sh (external/flat/physnet1/not-shared; subnet 10.12.4.0/22, gw + 10.12.4.1, no-dhcp, FIP pool 10.12.5.0-10.12.7.254). As-built this deploy: provider-ext = a4e1a7fa-..., + provider-ext-fip = f66e5bc5-... (runbook as-built refreshed). + +DOCFIX-059 -- internal-cert SAN gate + the VANTAGE correction (the substantive finding): +- The prompt's phase-04 item "confirm internal certs carry 10.12.12.5x SANs" was NOT implemented by any + committed artifact. Added scripts/phase-04-internal-cert-san-verify.sh (+ tests/phase-04-internal-cert-san/) + and runbook Step 4.2. +- VANTAGE (load-bearing): metal-internal (10.12.12.0/22, VID 103) is an ISOLATED service plane (D-052); the + jumphost is NOT on it. An s_client from the jumphost to 10.12.12.x TIMES OUT / conn-errors, and an + un-hardened check mislabels that as "no IP-SAN" -- a FALSE negative (observed live this session). The gate + must probe FROM a unit ON the plane (keystone/leader) via juju exec. Confirmed live: all 11 internal https + endpoints carry their own 10.12.12.5x IP-SAN (keystone/glance/nova/neutron/cinderv3/placement/barbican/ + octavia/magnum/swift/s3). Internal TLS is correct; the earlier failures were purely vantage. +- HARDENING in the gate: (a) every probe is timeout-bounded (an unbounded s_client hangs ~127s on a filtered + VIP -- proven at 6.02s vs 127s), classified TIMEOUT/CONN-ERR distinctly from a real NO-SAN; (b) non-https + endpoints (the plain-HTTP glance-simplestreams image-stream) are SKIPPED (no cert). Test covers PASS / + SKIP-http / NO-SAN / NO-CERT (fake openstack/juju + real jq; run on the jumphost). + +DOCFIX-060 -- phase-04-network-carve.md drift (the script was right; the md lagged): +- Inline Step 4.1 used the hardcoded `maas admin subnet read 1` -- a post-D-052 landmine (subnet ids drift + across cutovers). Corrected to gateway-by-CIDR, matching phase-04-network-create.sh (DOCFIX-047). +- IPAM reference carried the pre-D-052 single "Metal 10.12.8.0/22 = internal/admin VIPs" model. Corrected to + the D-052/053 split: metal-admin 10.12.8 (admin VIPs .8.5x + PXE), metal-internal 10.12.12 (VID 103, + internal VIPs .12.5x + all service east-west). +- Added a CANONICAL EXECUTION note (D-056) pointing to the three phase-04 scripts; refreshed as-built IDs. + +PROCESS lessons (recorded; no code change): +- PASTE SAFETY: a delivered block whose BEGIN/END label lines (which contain parentheses) were left inside + the fenced code region broke on paste -- bash rejected the parenthesis as an unexpected token. Label lines + carrying parens are NOT comments. RULE: put only valid bash inside a fenced block; labels go as # comments + or prose -- and run bash -n on EVERY delivered block first (that was the miss). The runbooks' own + bold-label convention (labels OUTSIDE the fence) is correct and unaffected. +- NETWORK-PROBE TIMEOUT: any s_client/curl/nc in a runbook step must be timeout-bounded with an explicit + timeout branch -- an unbounded probe is not acceptable in a deterministic gate. + +PHASE-05 (octavia) -- IN PROGRESS at handoff: config gate clear (retrofit use-internal-endpoints=true, +image-format=raw, amp-image-tag=octavia-amphora on both sides); octavia blocked, charm-octavia resources +0/0/0; PRE gate PROCEED. Step 5.1 configure-resources running (--wait=20m; do NOT re-fire on wait-timeout). + +### 2026-06-30 -- Phase-05 executed (Octavia enablement, D-021) -- PASS; DOCFIX-061 + +PHASE-05 (octavia) -- PASS (scripts ran clean; no script defects): +- 5.1 configure-resources (op 35/task 36, --wait=20m) cleared octavia's blocked -> active; lb-mgmt-net / + lb-mgmt-subnetv6 / lb-mgmt-sec-grp created; o-hm0 UP with an fc00:: ULA (state=UNKNOWN is normal for an + OVS internal port). Benign in-progress noise confirmed harmless: `ovs-vsctl: no row o-hm0` (queried + before the action creates the port) and the systemd-networkd stop/socket warning. +- 5.2 amphora pipeline (phase-05-amphora-pipeline.sh): config gate clear; base seeded via STAGE-AND-VERIFY + (sha256 070de108...); retrofit op 39/task 40 built amphora-haproxy-x86_64-ubuntu-22.04-20260701 + (807e3f5b-...) ACTIVE, tag octavia-amphora, image-format raw. phase-05-octavia-verify.sh -> PASS. + +DOCFIX-061 -- phase-05 as-built reconciliation (runbook drift; no script change): +- Retrofit's internal glance target corrected .8.53 -> 10.12.12.53: under D-052 the INTERNAL glance VIP is + on metal-internal (.12.53, confirmed live this session); the doc's ".8.53" predates the metal-admin/ + metal-internal split and is now the ADMIN VIP. +- Seed-method note corrected: this rebuild used STAGE-AND-VERIFY (the canonical Step 5.2 script), not the + 06-16 web-download expedient. Object IDs / op numbers refreshed to 2026-06-30. noble is seeded in + phase-06 6.0-BOOT this rebuild (not pre-staged in phase-05). o-hm0 ULA not captured this run (regenerates). + +DISCIPLINE (operator-directed 2026-06-30): reconcile scripts + commands + this changelog at the SUCCESSFUL +completion of EACH phase, before starting the next. Deliver the per-phase reconciliation as a repo-relative ZIP. + +### 2026-07-01 -- Phase-06 executed + swept (in-cloud CAPI mgmt cluster, D-035) -- PASS; DOCFIX-062 + +PHASE-06 progress this session (all steps executed clean; full sweep completed at phase-06 completion per the +per-phase discipline above): +- 6.0-BOOT (phase-06-bootstrap.sh) -- domain capi / project capi-mgmt / roles member+load-balancer_member+ + reader on admin@admin_domain (D-039) / 5 flavors / image ubuntu-24.04-noble active raw public. PASS. +- 6.0+6.1 (phase-06-net-setup.sh) -- keypair capi-mgmt-key, SG capi-mgmt-sg (22+6443), net capi-mgmt-net, + subnet capi-mgmt-subnet 10.20.0.0/24, router capi-mgmt-router ACTIVE gw set. PASS. +- 6.2 (phase-06-mgmt-vm.sh) -- VM capi-mgmt-v2 ACTIVE (gp.large/noble). FIP 10.12.7.222, TENANT 10.20.0.207 + -> ~/capi-mgmt-net.env (per-rebuild, non-deterministic; DOCFIX-038). PASS. +- 6.3 GATE 1 (inline ssh egress probe) -- VIP-OK (Keystone VIP 10.12.4.50:5000) + NET-OK (1.1.1.1:443). PASS. +- 6.4 k8s bootstrap (inline ssh nested-heredoc) -- k8s v1.32.13 (1.32-classic/stable); bootstrap-config.yaml + with cluster-config block (DOCFIX-024) + extra-sans 10.12.7.222 / 10.20.0.207; cluster ready, 1 voter node, + network+dns enabled. PASS. +- 6.5 GATE 2 THE D-035 GATE (inline) -- agnhost pod-egress probe -> Keystone VIP exitCode:0 (Succeeded). + Single-NIC pod egress proven (the exact test the dual-homed D-033 node failed). PASS. +- 6.6a-6.6d (inline ssh-to-VM) -- CAPI provider stack on the mgmt VM: tooling pinned from + capi-helm-charts@0.25.1 dependencies.json (D-034: CAPI v1.13.2 / CAPO v0.14.4 / CERT v1.20.2 / ORC v2.5.0 / + CAAPH 0.12.0 / JANITOR 0.11.0 / HELM v3.17.3); cert-manager v1.20.2 (crds.enabled=true, DOCFIX-025a); + ORC v2.5.0 server-side apply (images.openstack.k-orc.cloud CRD present) BEFORE clusterctl init; + clusterctl init core+kubeadm+CAPO all condition met (capo-system Available first pass -- ORC-first order + correct). PASS through 6.6d. +- 6.6e (inline) -- CAAPH (cluster-api-addon-provider 0.12.0) + janitor (cluster-api-janitor-openstack 0.11.0) + via azimuth helm charts; both Running (addon-provider took one benign first-boot restart while cert-manager + minted its webhook cert, then stable 1/1). PASS. +- 6.6f (inline) -- verify: clusterctl v1.13.2; all controllers 1/1 Running (cert-manager x3, capi core/ + bootstrap/control-plane, capo, orc, addon, janitor); all 4 key CRDs present (clusters / openstackclusters / + kubeadmcontrolplanes / images.openstack.k-orc.cloud). Phase-06 EXIT GATE green. PASS. + +SWEEP DONE (phase-06 reconciliation; delivered as a repo-relative ZIP, committed at the sweep): + +DOCFIX-062 -- 6.5 kubeconfig-pull defect (ASSIGNED this sweep; grep-before-assign confirmed next-free): +- `sudo k8s config server=` does NOT override the emitted apiserver URL on k8s-snap 1.32.13; it writes + the node tenant IP (10.20.0.207:6443), unroutable from the jumphost, so `kubectl get nodes` i/o-timed-out. +- FIX (applied live, now baked into the runbook + script): pull the RAW admin config (`sudo k8s config + reverted to the as-run literal 10.12.4.50:5000 (still KEYSTONE_HOSTPORT-overridable per site, +matching the runbook's ENV(keystone-vip) convention); (2) an untested ready-skip idempotency guard in 6.4 -> +removed (the as-run block ran the bootstrap unconditionally; retry = purge-and-re-run). DOCFIX-062 (6.5 +kubeconfig server-rewrite) is KEPT -- it was applied live and confirmed. The unused fake `openstack` test stubs +were dropped and a no-dynamic-discovery fidelity assertion added to both affected suites. capi-stack.sh was +already faithful (pins from dependencies.json; no discovery). All three suites re-pass. + ### Next-free numbers -Design decision: D-063. Doc fix: DOCFIX-059. (D-061 teardown, D-062 mysql; DOCFIX-057 old-teardown -deprecation, DOCFIX-058 phase-03 3.3 HTTP-upstream both recorded above.) +Design decision: D-063. Doc fix: DOCFIX-063. (DOCFIX-062 phase-06 kubeconfig-server-rewrite ASSIGNED above; +DOCFIX-061 phase-05 as-built, DOCFIX-060 phase-04 md drift, DOCFIX-059 internal-cert SAN gate recorded +earlier; D-061 teardown, D-062 mysql. D-063 still unused -- the phase-06 sweep produced doc/script fixes, +no new design decision.) diff --git a/runbooks/appendix-A-troubleshooting.md b/runbooks/appendix-A-troubleshooting.md index 9d43136..9e8fb9e 100644 --- a/runbooks/appendix-A-troubleshooting.md +++ b/runbooks/appendix-A-troubleshooting.md @@ -449,6 +449,22 @@ (FINDING-3), so it is unusable for kube images; (3) for ubuntu cloud-images it works on the hardened bundle (the 2026-06-08 403 was transient/pre-hardening). Use only as an expedient. +## Mgmt-cluster bootstrap (phase-06) + +### DOCFIX-062 -- `k8s config server=` is ignored on k8s-snap 1.32.13; rewrite the kubeconfig (phase-06) +- Symptom: `sudo k8s config server=https://:6443` still emits a kubeconfig whose + `server:` is the node's TENANT IP (10.20.0.x). From the jumphost (off the tenant plane) + every `kubectl` call then i/o-times-out -- Step 6.5 GATE 2 never runs. +- Cause: on this k8s-snap rev the `server=` key-value arg to `k8s config` is not honored; + the emitted apiserver URL is always the node's own address. +- Fix: pull the RAW config (`sudo k8s config :6443`. The FIP is in the + cert extra-sans written at bootstrap (6.4), so TLS validates against it. Gate the rewrite: + `grep -E '^\s*server:' ~/capi-mgmt.kubeconfig` must show the FIP, not the tenant IP. + Encapsulated in `scripts/phase-06-kubeconfig-gate.sh` (verifies the rewrite took before + proceeding to the egress gate). + ================================================================================ ## Notes ================================================================================ diff --git a/runbooks/phase-06-incloud-mgmt-cluster.md b/runbooks/phase-06-incloud-mgmt-cluster.md index 9563ad7..21b507d 100644 --- a/runbooks/phase-06-incloud-mgmt-cluster.md +++ b/runbooks/phase-06-incloud-mgmt-cluster.md @@ -8,7 +8,15 @@ Decisions: D-035 (in-cloud single-homed tenant VM; retires D-033/D-017), D-034 (CAPI versions sourced from the capi-helm-charts tag's dependencies.json, never hardcoded), D-031 (Magnum + magnum-capi-helm + capi-helm-charts engine). -Troubleshooting: appendix-A entries DOCFIX-021, DOCFIX-024, DOCFIX-025a, D-035. +Troubleshooting: appendix-A entries DOCFIX-021, DOCFIX-024, DOCFIX-025a, DOCFIX-062, D-035. + +Canonical scripts (D-056; the paste blocks below are the reference source-of-truth, +the scripts are the rehearsed executors -- prefer the script on rebuild): +- Steps 6.3 + 6.4 -> `scripts/phase-06-k8s-bootstrap.sh` (GATE 1 egress + k8s bootstrap) +- Step 6.5 -> `scripts/phase-06-kubeconfig-gate.sh` (kubeconfig pull+rewrite + GATE 2; DOCFIX-062) +- Step 6.6 (a-f) -> `scripts/phase-06-capi-stack.sh` (CAPI provider stack, ORC-before-init) +Each discovers the Keystone endpoint dynamically, is idempotent where safe, and +sources `~/capi-mgmt-net.env`. Steps 6.0-BOOT..6.2 already have their own scripts. --- @@ -345,12 +353,22 @@ **RUN -- jumphost -> mgmt VM** ```bash -# RUN: jumphost (ssh to the mgmt VM; the kubeconfig lands on the jumphost). server = the FIP, not tenant IP +# RUN: jumphost. DOCFIX-062: k8s-snap 1.32.13 IGNORES `k8s config server=` and +# still writes the node's TENANT IP (10.20.0.x, unroutable from the jumphost) -> +# kubectl i/o-times-out. Pull the RAW config, then rewrite the server to the FIP +# with `kubectl config set-cluster` (a local file op). The FIP is in the cert +# extra-sans written by 6.4, so TLS holds against it. source ~/capi-mgmt-net.env # MGMT_FIP +umask 077 ssh -i ~/.ssh/id_ed25519 -o BatchMode=yes -o StrictHostKeyChecking=no \ -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10 ubuntu@"$MGMT_FIP" \ - "sudo k8s config server=https://$MGMT_FIP:6443 ~/capi-mgmt.kubeconfig + 'sudo k8s config ~/capi-mgmt.kubeconfig +chmod 600 ~/capi-mgmt.kubeconfig # [SENSITIVE] ~/capi-mgmt.kubeconfig contains a cluster-admin credential. +export KUBECONFIG="$HOME/capi-mgmt.kubeconfig" +CLUSTER=$(kubectl config view -o jsonpath='{.clusters[0].name}') +kubectl config set-cluster "$CLUSTER" --server="https://$MGMT_FIP:6443" +grep -E '^[[:space:]]*server:' ~/capi-mgmt.kubeconfig # expect https://:6443, NOT the tenant IP wc -l ~/capi-mgmt.kubeconfig ; head -1 ~/capi-mgmt.kubeconfig # expect >0 lines, "apiVersion: v1" ``` diff --git a/scripts/phase-06-capi-stack.sh b/scripts/phase-06-capi-stack.sh new file mode 100644 index 0000000..e5f8e6e --- /dev/null +++ b/scripts/phase-06-capi-stack.sh @@ -0,0 +1,169 @@ +#!/usr/bin/env bash +# scripts/phase-06-capi-stack.sh +# +# Phase-06 Step 6.6 (a-f) encapsulated (D-056). Runs on the jumphost; installs the +# CAPI provider stack ON the mgmt VM (all helm/clusterctl/kubectl run VM-side +# against the local apiserver -- matched 1.32.13 kubectl, no jumphost skew). +# +# HARDENED ORDER (D-034 install-ordering): pins -> cert-manager -> ORC -> +# clusterctl init -> CAAPH -> janitor -> verify. ORC precedes `clusterctl init` +# because CAPO's openstackserver controller hard-depends on ORC's +# Image.openstack.k-orc.cloud CRD; installing CAPO first crash-loops until ORC lands. +# +# Versions are READ from the chart tag's dependencies.json at runtime (D-034; +# NEVER hardcoded). The as-built cross-check (CAPI v1.13.2 / CAPO v0.14.4 / +# CERT v1.20.2 / ORC v2.5.0 / CAAPH 0.12.0 / JANITOR 0.11.0 / HELM v3.17.3) is +# informational only. KUBECTL_VERSION tracks the cluster's k8s (the CHANNEL in +# phase-06-k8s-bootstrap.sh); keep them in step. +# +# Each sub-step is gated on the remote block's own exit status (its `--wait` / +# `wait` / `get crd` fail the remote, ssh propagates non-zero, we stop). DOCFIX-021: +# not needed here (no interactive `sudo`; blocks are non-interactive helm/kubectl). +# +# Tunables via env: ENVFILE SSH_KEY CHART_TAG KUBECTL_VERSION +# Requires: jumphost; ssh + the VM key. (jq/curl are installed VM-side by 6.6a.) +# Usage: bash scripts/phase-06-capi-stack.sh +# Exit: 0 stack up + verified | 1 a sub-step gate failed | 2 precondition +# ASCII + LF. + +set -euo pipefail +shopt -s inherit_errexit 2>/dev/null || true + +ENVFILE="${ENVFILE:-$HOME/capi-mgmt-net.env}" +SSH_KEY="${SSH_KEY:-$HOME/.ssh/id_ed25519}" +CHART_TAG="${CHART_TAG:-0.25.1}" +KUBECTL_VERSION="${KUBECTL_VERSION:-v1.32.13}" + +command -v ssh >/dev/null 2>&1 || { echo "FAIL: ssh not found" >&2; exit 2; } +[ -f "$ENVFILE" ] || { echo "FAIL: $ENVFILE not found (run phase-06-mgmt-vm.sh first)" >&2; exit 2; } +# shellcheck disable=SC1090 +. "$ENVFILE" +[ -n "${MGMT_FIP:-}" ] || { echo "FAIL: MGMT_FIP unset in $ENVFILE" >&2; exit 2; } +[ -f "$SSH_KEY" ] || { echo "FAIL: ssh key $SSH_KEY not found" >&2; exit 2; } + +MGMT_VM="$MGMT_FIP" +SSH_OPTS=(-i "$SSH_KEY" -o BatchMode=yes -o StrictHostKeyChecking=no \ + -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10) + +# run_step LABEL -- reads the remote block from stdin, tees indented output, +# gates on the REMOTE block's exit status (PIPESTATUS[0]); positional args after +# the label are passed to the remote `bash -s`. +run_step() { + local label="$1"; shift + echo "=== $label ===" + ssh "${SSH_OPTS[@]}" ubuntu@"$MGMT_VM" bash -s "$@" 2>&1 | sed 's/^/ /' + local rc=${PIPESTATUS[0]} + [ "$rc" -eq 0 ] || { echo "GATE FAIL: $label (remote rc=$rc)" >&2; exit 1; } + echo "[OK] $label" +} + +# --- 6.6a: tooling + pins (read dependencies.json @ CHART_TAG) --- +run_step "6.6a tooling + pins (chart $CHART_TAG, kubectl $KUBECTL_VERSION)" "$CHART_TAG" "$KUBECTL_VERSION" <<'REOF' +set -euo pipefail +TAG="$1"; KVER="$2" +sudo apt-get update -qq "$HOME/.kube/config"; chmod 600 "$HOME/.kube/config" + +# egress pre-check (informational; a 404 at a host root still proves reachability) +for h in https://raw.githubusercontent.com https://get.helm.sh https://github.com https://dl.k8s.io; do + printf '%s -> ' "$h"; curl -s -o /dev/null -w '%{http_code}\n' "$h" || echo FAIL +done + +# version constellation from the chart tag's dependencies.json (D-034) +curl -fsSL "https://raw.githubusercontent.com/azimuth-cloud/capi-helm-charts/${TAG}/dependencies.json" -o "$HOME/deps.json" +CAPI=$(jq -r '."cluster-api"' "$HOME/deps.json") +CAPO=$(jq -r '."cluster-api-provider-openstack"' "$HOME/deps.json") +CERT=$(jq -r '."cert-manager"' "$HOME/deps.json") +ORC=$(jq -r '."openstack-resource-controller"' "$HOME/deps.json") +CAAPH=$(jq -r '."addon-provider"' "$HOME/deps.json") +JANITOR=$(jq -r '."cluster-api-janitor-openstack"' "$HOME/deps.json") +HELM=$(jq -r '.helm' "$HOME/deps.json") +{ echo "CAPI=$CAPI"; echo "CAPO=$CAPO"; echo "CERT=$CERT"; echo "ORC=$ORC"; \ + echo "CAAPH=$CAAPH"; echo "JANITOR=$JANITOR"; echo "HELM=$HELM"; } > "$HOME/capi-pins.env" +echo "== pins (cross-check: CAPI v1.13.2 CAPO v0.14.4 CERT v1.20.2 ORC v2.5.0 CAAPH 0.12.0 JANITOR 0.11.0 HELM v3.17.3) ==" +cat "$HOME/capi-pins.env" +# gate: every pin resolved (non-empty, non-null) -- a moved/renamed key must fail loud +for k in CAPI CAPO CERT ORC CAAPH JANITOR HELM; do v="${!k}"; [ -n "$v" ] && [ "$v" != null ] || { echo "PIN-FAIL: $k=$v" >&2; exit 1; }; done + +curl -fsSL "https://get.helm.sh/helm-${HELM}-linux-amd64.tar.gz" -o /tmp/helm.tgz +sudo tar -xzf /tmp/helm.tgz -C /usr/local/bin --strip-components=1 linux-amd64/helm /dev/null | head -1 +REOF + +# --- 6.6b: cert-manager (DOCFIX-025a: crds.enabled=true) --- +run_step "6.6b cert-manager" <<'REOF' +set -euo pipefail +source "$HOME/capi-pins.env" +helm repo add jetstack https://charts.jetstack.io +helm repo update +helm upgrade --install cert-manager jetstack/cert-manager \ + --namespace cert-manager --create-namespace \ + --version "$CERT" --set crds.enabled=true --wait --timeout 5m +kubectl -n cert-manager wait --for=condition=Available deploy --all --timeout=180s +kubectl -n cert-manager get pods +REOF + +# --- 6.6c: ORC (BEFORE clusterctl init) --- +run_step "6.6c ORC (before clusterctl init)" <<'REOF' +set -euo pipefail +source "$HOME/capi-pins.env" +kubectl apply --server-side -f \ + "https://github.com/k-orc/openstack-resource-controller/releases/download/${ORC}/install.yaml" +kubectl -n orc-system wait --for=condition=Available deploy --all --timeout=180s +kubectl get crd images.openstack.k-orc.cloud +REOF + +# --- 6.6d: clusterctl init --- +run_step "6.6d clusterctl init" <<'REOF' +set -euo pipefail +source "$HOME/capi-pins.env" +clusterctl init \ + --core "cluster-api:${CAPI}" \ + --bootstrap "kubeadm:${CAPI}" \ + --control-plane "kubeadm:${CAPI}" \ + --infrastructure "openstack:${CAPO}" +for ns in capi-system capi-kubeadm-bootstrap-system capi-kubeadm-control-plane-system capo-system; do + echo "== $ns =="; kubectl -n "$ns" wait --for=condition=Available deploy --all --timeout=240s +done +REOF + +# --- 6.6e: CAAPH + janitor --- +run_step "6.6e CAAPH + janitor" <<'REOF' +set -euo pipefail +source "$HOME/capi-pins.env" +helm repo add capi-addon https://azimuth-cloud.github.io/cluster-api-addon-provider +helm repo add capi-janitor https://azimuth-cloud.github.io/cluster-api-janitor-openstack +helm repo update +helm upgrade --install cluster-api-addon-provider capi-addon/cluster-api-addon-provider \ + --namespace capi-addon-system --create-namespace --version "$CAAPH" --wait --timeout 5m +helm upgrade --install cluster-api-janitor-openstack capi-janitor/cluster-api-janitor-openstack \ + --namespace capi-janitor-system --create-namespace --version "$JANITOR" --wait --timeout 5m +kubectl -n capi-addon-system get pods +kubectl -n capi-janitor-system get pods +REOF + +# --- 6.6f: verify the stack (EXIT GATE) --- +run_step "6.6f verify stack (all controllers Running + key CRDs)" <<'REOF' +set -euo pipefail +clusterctl version +echo "== controllers ==" +kubectl get pods -A | grep -E 'capi-|capo-|cert-manager|orc-system|janitor|addon' || true +notready=$(kubectl get pods -A --no-headers 2>/dev/null \ + | grep -E 'capi-|capo-|cert-manager|orc-system|janitor|addon' \ + | awk '$4!="Running"{print $1"/"$2" "$4}') +if [ -n "$notready" ]; then echo "NOT-RUNNING:"; echo "$notready"; exit 1; fi +echo "== key CRDs ==" +kubectl get crd clusters.cluster.x-k8s.io \ + openstackclusters.infrastructure.cluster.x-k8s.io \ + kubeadmcontrolplanes.controlplane.cluster.x-k8s.io \ + images.openstack.k-orc.cloud +echo "STACK: OK" +REOF + +echo "Summary: CAPI provider stack installed + verified on the mgmt VM (chart $CHART_TAG pins; ORC-before-init order). Phase-06 complete." diff --git a/scripts/phase-06-k8s-bootstrap.sh b/scripts/phase-06-k8s-bootstrap.sh new file mode 100644 index 0000000..2c0af71 --- /dev/null +++ b/scripts/phase-06-k8s-bootstrap.sh @@ -0,0 +1,119 @@ +#!/usr/bin/env bash +# scripts/phase-06-k8s-bootstrap.sh +# +# Phase-06 Steps 6.3 + 6.4 encapsulated (D-056). Runs on the jumphost; drives the +# in-cloud CAPI management VM over ssh. +# 6.3 GATE 1 -- prove the single-homed VM's egress: it can reach the OpenStack +# public API (the D-035 premise) and the internet (image pulls). The API +# target is the Keystone PUBLIC endpoint -- the as-run literal 10.12.4.50:5000 +# (6.3 tagged ENV(keystone-vip)); env-overridable per site via KEYSTONE_HOSTPORT. +# 6.4 -- install k8s-snap on the VM and bootstrap it. The bootstrap config MUST +# carry a cluster-config block (DOCFIX-024 -- without it network+dns are +# disabled and the node never goes Ready). extra-sans MUST be the real +# FIP + tenant IP (from ~/capi-mgmt-net.env, per-rebuild, DOCFIX-038). +# +# One-shot -- matches the as-run 6.4 block verbatim (NO idempotency guard): install + +# bootstrap run unconditionally. Re-run is not safe; purge on the VM first (retry hint +# below), exactly the runbook's documented retry path. +# DOCFIX-021: every remote `sudo` gets /dev/null || true + +ENVFILE="${ENVFILE:-$HOME/capi-mgmt-net.env}" +SSH_KEY="${SSH_KEY:-$HOME/.ssh/id_ed25519}" +CHANNEL="${CHANNEL:-1.32-classic/stable}" +POD_CIDR="${POD_CIDR:-10.1.0.0/16}" +SVC_CIDR="${SVC_CIDR:-10.152.183.0/24}" +CLUSTER_NAME="${CLUSTER_NAME:-capi-mgmt-v2}" +INET_PROBE="${INET_PROBE:-1.1.1.1:443}" +PROBE_TIMEOUT="${PROBE_TIMEOUT:-6}" +BOOT_TIMEOUT="${BOOT_TIMEOUT:-10m}" +READY_TIMEOUT="${READY_TIMEOUT:-5m}" + +command -v ssh >/dev/null 2>&1 || { echo "FAIL: ssh not found" >&2; exit 2; } +[ -f "$ENVFILE" ] || { echo "FAIL: $ENVFILE not found (run phase-06-mgmt-vm.sh first)" >&2; exit 2; } +# shellcheck disable=SC1090 +. "$ENVFILE" +[ -n "${MGMT_FIP:-}" ] || { echo "FAIL: MGMT_FIP unset in $ENVFILE" >&2; exit 2; } +[ -n "${MGMT_TENANT_IP:-}" ] || { echo "FAIL: MGMT_TENANT_IP unset in $ENVFILE" >&2; exit 2; } +[ -f "$SSH_KEY" ] || { echo "FAIL: ssh key $SSH_KEY not found" >&2; exit 2; } + +MGMT_VM="$MGMT_FIP" +SSH_OPTS=(-i "$SSH_KEY" -o BatchMode=yes -o StrictHostKeyChecking=no \ + -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10) + +# --- Keystone public host:port -- the as-run literal (6.3 tagged ENV(keystone-vip)); +# env-overridable per site. NOT discovered -- this is the value that ran verbatim. --- +KEYSTONE_HOSTPORT="${KEYSTONE_HOSTPORT:-10.12.4.50:5000}" +echo "[OK] Keystone public endpoint: $KEYSTONE_HOSTPORT" +KHOST="${KEYSTONE_HOSTPORT%%:*}"; KPORT="${KEYSTONE_HOSTPORT##*:}" +IHOST="${INET_PROBE%%:*}"; IPORT="${INET_PROBE##*:}" +if [ -z "$KHOST" ] || [ -z "$KPORT" ] || [ "$KHOST" = "$KPORT" ]; then + echo "FAIL: bad KEYSTONE_HOSTPORT '$KEYSTONE_HOSTPORT' (want host:port)" >&2; exit 2 +fi + +# --- 6.3 GATE 1: VM egress (API VIP + internet) --- +echo "=== 6.3 GATE 1: VM -> Keystone $KHOST:$KPORT + internet $IHOST:$IPORT ===" +g1=$(ssh "${SSH_OPTS[@]}" ubuntu@"$MGMT_VM" \ + bash -s "$KHOST" "$KPORT" "$IHOST" "$IPORT" "$PROBE_TIMEOUT" <<'REOF' 2>&1 || true +set -u +khost="$1"; kport="$2"; ihost="$3"; iport="$4"; t="$5"; ok=1 +if timeout "$t" bash -c "exec 3<>/dev/tcp/$khost/$kport" 2>/dev/null; then echo "VIP-OK $khost:$kport"; else echo "VIP-FAIL $khost:$kport"; ok=0; fi +if timeout "$t" bash -c "exec 3<>/dev/tcp/$ihost/$iport" 2>/dev/null; then echo "NET-OK $ihost:$iport"; else echo "NET-FAIL $ihost:$iport"; ok=0; fi +[ "$ok" = 1 ] && echo "GATE1: PASS" || echo "GATE1: FAIL" +REOF +) +printf '%s\n' "$g1" | sed 's/^/ /' +printf '%s\n' "$g1" | grep -q 'GATE1: PASS' || { echo "GATE FAIL: VM egress probe did not pass (see above)" >&2; exit 1; } +echo "[OK] GATE 1 passed -- single-NIC VM egress to the OpenStack public API works (D-035 premise)" + +# --- 6.4 k8s-snap install + bootstrap --- +echo "=== 6.4 k8s-snap install + bootstrap ($CHANNEL) ===" +b=$(ssh "${SSH_OPTS[@]}" ubuntu@"$MGMT_VM" \ + bash -s "$MGMT_FIP" "$MGMT_TENANT_IP" "$CHANNEL" "$POD_CIDR" "$SVC_CIDR" "$CLUSTER_NAME" "$BOOT_TIMEOUT" "$READY_TIMEOUT" <<'REOF' 2>&1 || true +set -euo pipefail +FIP="$1"; TENANT="$2"; CH="$3"; POD="$4"; SVC="$5"; NAME="$6"; BT="$7"; RT="$8" + +echo "=== install k8s snap $CH ===" +sudo snap install k8s --classic --channel="$CH" /dev/null <&2 + echo " Retry on the VM: sudo snap remove k8s --purge &2 + exit 1; } + +echo "Summary: GATE 1 PASS; k8s ($CHANNEL) bootstrapped and ready on $CLUSTER_NAME (FIP $MGMT_FIP / tenant $MGMT_TENANT_IP)." diff --git a/scripts/phase-06-kubeconfig-gate.sh b/scripts/phase-06-kubeconfig-gate.sh new file mode 100644 index 0000000..64ac72d --- /dev/null +++ b/scripts/phase-06-kubeconfig-gate.sh @@ -0,0 +1,116 @@ +#!/usr/bin/env bash +# scripts/phase-06-kubeconfig-gate.sh +# +# Phase-06 Step 6.5 encapsulated (D-056) with the DOCFIX-062 fix baked in. +# Runs on the jumphost. +# 1. Pull the mgmt cluster's admin kubeconfig to the jumphost. +# 2. DOCFIX-062: k8s-snap 1.32.13's `k8s config server=` does NOT override +# the emitted apiserver URL -- it writes the node's TENANT IP (unroutable from +# the jumphost), so kubectl i/o-times-out. Fix: pull the RAW config, then +# rewrite the server field to the FIP with `kubectl config set-cluster +# --server` (a local file op; the cluster name is read dynamically). The FIP +# is in the cert extra-sans (written by 6.4), so TLS holds against it. +# 3. Node check + GATE 2: the agnhost pod-egress probe to the Keystone PUBLIC +# endpoint -- the exact test the dual-homed D-033 node FAILED; on this +# single-NIC VM it must Complete with exitCode 0. Keystone host:port is the +# as-run literal 10.12.4.50:5000 (6.5 tagged it verbatim); env-overridable +# per site via KEYSTONE_HOSTPORT. +# +# [SENSITIVE] the kubeconfig it writes ($KUBECONFIG_OUT) holds a cluster-admin +# credential; it is created with mode 600 and kept on the jumphost. +# The throwaway probe pod is always cleaned up (even on gate failure). +# +# Tunables via env: ENVFILE SSH_KEY KUBECONFIG_OUT API_PORT KEYSTONE_HOSTPORT +# AGNHOST_IMAGE PROBE_TRIES PROBE_SLEEP +# Requires: jumphost; ssh + the VM key; kubectl; ~/capi-mgmt-net.env (from +# phase-06-mgmt-vm.sh). All tunables DEFAULT to the as-run values. +# Usage: bash scripts/phase-06-kubeconfig-gate.sh +# Exit: 0 GATE 2 pass (kubeconfig usable + pod egress works) | 1 gate fail | 2 precondition +# ASCII + LF. + +set -euo pipefail +shopt -s inherit_errexit 2>/dev/null || true + +ENVFILE="${ENVFILE:-$HOME/capi-mgmt-net.env}" +SSH_KEY="${SSH_KEY:-$HOME/.ssh/id_ed25519}" +KUBECONFIG_OUT="${KUBECONFIG_OUT:-$HOME/capi-mgmt.kubeconfig}" +API_PORT="${API_PORT:-6443}" +AGNHOST_IMAGE="${AGNHOST_IMAGE:-registry.k8s.io/e2e-test-images/agnhost:2.40}" +PROBE_TRIES="${PROBE_TRIES:-20}" +PROBE_SLEEP="${PROBE_SLEEP:-10}" + +for c in ssh kubectl; do command -v "$c" >/dev/null 2>&1 || { echo "FAIL: $c not found" >&2; exit 2; }; done +[ -f "$ENVFILE" ] || { echo "FAIL: $ENVFILE not found (run phase-06-mgmt-vm.sh first)" >&2; exit 2; } +# shellcheck disable=SC1090 +. "$ENVFILE" +[ -n "${MGMT_FIP:-}" ] || { echo "FAIL: MGMT_FIP unset in $ENVFILE" >&2; exit 2; } +[ -f "$SSH_KEY" ] || { echo "FAIL: ssh key $SSH_KEY not found" >&2; exit 2; } + +MGMT_VM="$MGMT_FIP" +SSH_OPTS=(-i "$SSH_KEY" -o BatchMode=yes -o StrictHostKeyChecking=no \ + -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10) + +# --- Keystone public host:port -- the as-run literal (6.5 tagged it verbatim); +# env-overridable per site. NOT discovered. --- +KEYSTONE_HOSTPORT="${KEYSTONE_HOSTPORT:-10.12.4.50:5000}" +echo "[OK] Keystone public endpoint: $KEYSTONE_HOSTPORT" + +# --- 1. pull the RAW admin kubeconfig (no server= arg; we rewrite locally) --- +echo "=== pull kubeconfig -> $KUBECONFIG_OUT ===" +umask 077 +if ! ssh "${SSH_OPTS[@]}" ubuntu@"$MGMT_VM" 'sudo k8s config "$KUBECONFIG_OUT" 2>/dev/null; then + echo "GATE FAIL: could not pull kubeconfig from the mgmt VM" >&2; exit 1 +fi +chmod 600 "$KUBECONFIG_OUT" +[ -s "$KUBECONFIG_OUT" ] || { echo "GATE FAIL: $KUBECONFIG_OUT is empty" >&2; exit 1; } +head -1 "$KUBECONFIG_OUT" | grep -q 'apiVersion: v1' || { echo "GATE FAIL: $KUBECONFIG_OUT does not look like a kubeconfig" >&2; exit 1; } +echo "[OK] kubeconfig pulled ($(wc -l < "$KUBECONFIG_OUT") lines)" + +# --- 2. DOCFIX-062: rewrite the server field to the FIP (routable; cert carries the FIP SAN) --- +export KUBECONFIG="$KUBECONFIG_OUT" +CLUSTER=$(kubectl config view -o jsonpath='{.clusters[0].name}' 2>/dev/null) +[ -n "$CLUSTER" ] || { echo "GATE FAIL: no cluster entry in $KUBECONFIG_OUT" >&2; exit 1; } +kubectl config set-cluster "$CLUSTER" --server="https://${MGMT_FIP}:${API_PORT}" >/dev/null +grep -qE "^[[:space:]]*server:[[:space:]]*https://${MGMT_FIP//./\\.}:${API_PORT}\$" "$KUBECONFIG_OUT" \ + || { echo "GATE FAIL: server rewrite to https://${MGMT_FIP}:${API_PORT} did not take (DOCFIX-062)" >&2; exit 1; } +echo "[OK] kubeconfig server rewritten to https://${MGMT_FIP}:${API_PORT} (cluster '$CLUSTER')" + +# --- 3a. node check --- +echo "=== node check ===" +if ! nodes=$(kubectl get nodes -o wide 2>&1); then + printf '%s\n' "$nodes" | sed 's/^/ /' + echo "GATE FAIL: kubectl cannot reach the apiserver via the FIP" >&2; exit 1 +fi +printf '%s\n' "$nodes" | sed 's/^/ /' +printf '%s\n' "$nodes" | awk 'NR>1 && $2!="Ready"{bad=1} END{exit bad?1:0}' \ + || { echo "GATE FAIL: a node is not Ready" >&2; exit 1; } +echo "[OK] node(s) Ready" + +# --- 3b. GATE 2: agnhost pod-egress probe to the Keystone public endpoint --- +echo "=== GATE 2: agnhost pod-egress probe -> $KEYSTONE_HOSTPORT ===" +cleanup() { kubectl delete pod egress-test --now --ignore-not-found >/dev/null 2>&1 || true; } +trap cleanup EXIT +kubectl delete pod egress-test --now --ignore-not-found >/dev/null 2>&1 || true +kubectl run egress-test --image="$AGNHOST_IMAGE" --restart=Never \ + --command -- /agnhost connect "$KEYSTONE_HOSTPORT" --timeout=5s >/dev/null + +phase=""; state="" +for i in $(seq 1 "$PROBE_TRIES"); do + phase=$(kubectl get pod egress-test -o jsonpath='{.status.phase}' 2>/dev/null || echo '?') + state=$(kubectl get pod egress-test -o jsonpath='{.status.containerStatuses[0].state}' 2>/dev/null || echo '') + echo " [$i] phase=$phase state=$state" + case "$phase" in + Succeeded) break ;; + Failed) echo "GATE FAIL: probe pod Failed (egress to $KEYSTONE_HOSTPORT blocked)" >&2; exit 1 ;; + esac + sleep "$PROBE_SLEEP" +done + +if [ "$phase" = Succeeded ] && printf '%s' "$state" | grep -q '"exitCode":0'; then + echo "[OK] GATE 2 passed -- pod egress to $KEYSTONE_HOSTPORT returned exitCode 0 (D-035 proof)" +else + echo "GATE FAIL: probe pod did not reach Succeeded/exitCode 0 in $((PROBE_TRIES*PROBE_SLEEP))s (last: phase=$phase state=$state)" >&2 + exit 1 +fi + +echo "Summary: kubeconfig usable via FIP; GATE 2 pod-egress proof passed. $KUBECONFIG_OUT ready for phase-07." diff --git a/tests/phase-06-capi-stack/fakebin/ssh b/tests/phase-06-capi-stack/fakebin/ssh new file mode 100644 index 0000000..cedb150 --- /dev/null +++ b/tests/phase-06-capi-stack/fakebin/ssh @@ -0,0 +1,35 @@ +#!/usr/bin/env bash +# fake ssh for phase-06-capi-stack.sh tests. Reads the remote block from stdin, +# identifies the sub-step by a distinctive token, appends it to $ORDER_FILE (so +# tests can assert ORC-before-init), and emits canned output + exit code. +# Steered by env: A_FAIL B_FAIL C_FAIL D_FAIL E_FAIL F_FAIL. +body="$(cat 2>/dev/null || true)" +log() { [ -n "${ORDER_FILE:-}" ] && printf '%s\n' "$1" >> "$ORDER_FILE"; } + +if printf '%s' "$body" | grep -q 'dependencies.json'; then + log a + [ "${A_FAIL:-0}" = 1 ] && { echo "PIN-FAIL: CAPO=null"; exit 1; } + echo "== pins =="; echo "CAPI=v1.13.2"; echo "== tooling =="; echo "clusterctl v1.13.2"; exit 0 +elif printf '%s' "$body" | grep -q 'jetstack/cert-manager'; then + log b + [ "${B_FAIL:-0}" = 1 ] && { echo "Error: timed out waiting for the condition"; exit 1; } + echo "cert-manager deployed"; exit 0 +elif printf '%s' "$body" | grep -q 'server-side'; then + log c + [ "${C_FAIL:-0}" = 1 ] && { echo "error: no matches for kind Image"; exit 1; } + echo "images.openstack.k-orc.cloud"; exit 0 +elif printf '%s' "$body" | grep -q 'clusterctl init'; then + log d + [ "${D_FAIL:-0}" = 1 ] && { echo "capo-system deploy not Available"; exit 1; } + echo "Your management cluster has been initialized successfully!"; exit 0 +elif printf '%s' "$body" | grep -q 'cluster-api-addon-provider'; then + log e + [ "${E_FAIL:-0}" = 1 ] && { echo "Error: helm timeout"; exit 1; } + echo "addon + janitor Running"; exit 0 +elif printf '%s' "$body" | grep -q 'STACK: OK'; then + log f + [ "${F_FAIL:-0}" = 1 ] && { echo "NOT-RUNNING:"; echo "capo-system/capo-controller CrashLoopBackOff"; exit 1; } + echo "STACK: OK"; exit 0 +fi +echo "fake-ssh: unrecognized block" >&2 +exit 0 diff --git a/tests/phase-06-capi-stack/run-tests.sh b/tests/phase-06-capi-stack/run-tests.sh new file mode 100644 index 0000000..670e98b --- /dev/null +++ b/tests/phase-06-capi-stack/run-tests.sh @@ -0,0 +1,75 @@ +#!/usr/bin/env bash +# tests/phase-06-capi-stack/run-tests.sh -- offline regression for +# phase-06-capi-stack.sh. Fake ssh; real bash. +# Key assertions: (1) sub-steps run a,b,c,d,e,f in order; (2) ORC (c) precedes +# clusterctl init (d); (3) if ORC fails, init must NOT run (the hardened order +# exists precisely to stop CAPO crash-looping on a missing ORC CRD). +set -euo pipefail +IFS=$'\n\t' +HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +SCRIPTS="$(cd "$HERE/../../scripts" && pwd)" +TARGET="$SCRIPTS/phase-06-capi-stack.sh" +BIN="$HERE/fakebin" +[ -f "$TARGET" ] || { echo "FAIL: $TARGET missing" >&2; exit 1; } +chmod +x "$BIN"/* 2>/dev/null || true +WORK="$(mktemp -d)"; trap 'rm -rf "$WORK"' EXIT +rc_all=0 + +mkenv() { printf 'MGMT_FIP=%s\n' '10.12.7.222' > "$WORK/net.env"; } +: > "$WORK/id_key" +ORDER="$WORK/order" + +run() { # want_rc out_regex want_order(comma or -) label [extra env...] + local want="$1" re="$2" order_want="$3" label="$4"; shift 4 + local rc order_got + : > "$ORDER" + set +e + PATH="$BIN:$PATH" HOME="$WORK" ENVFILE="$WORK/net.env" SSH_KEY="$WORK/id_key" \ + ORDER_FILE="$ORDER" env "$@" bash "$TARGET" >"$WORK/out" 2>&1 + rc=$? + set -e + order_got=$(paste -sd, "$ORDER" 2>/dev/null || true) + local ok=1 + [ "$rc" -eq "$want" ] || ok=0 + grep -qE "$re" "$WORK/out" || ok=0 + if [ "$order_want" != '-' ] && [ "$order_got" != "$order_want" ]; then ok=0; fi + if [ "$ok" = 1 ]; then + printf ' [OK] %-46s exit %s order=[%s]\n' "$label" "$rc" "$order_got" + else + printf ' [XX] %-46s exit %s (want %s; /%s/; order [%s] want [%s])\n' \ + "$label" "$rc" "$want" "$re" "$order_got" "$order_want" + sed 's/^/ /' "$WORK/out"; rc_all=1 + fi +} + +echo "=== phase-06-capi-stack.sh ===" +mkenv +run 0 'Phase-06 complete' 'a,b,c,d,e,f' "happy path -> full stack, ordered" +run 1 '6.6a' 'a' "6.6a pin fail -> stop at a" A_FAIL=1 +run 1 '6.6b' 'a,b' "6.6b cert-manager fail -> stop at b" B_FAIL=1 +run 1 '6.6c' 'a,b,c' "6.6c ORC fail -> init (d) NOT run" C_FAIL=1 +run 1 '6.6d' 'a,b,c,d' "6.6d init fail -> stop at d" D_FAIL=1 +run 1 '6.6e' 'a,b,c,d,e' "6.6e CAAPH/janitor fail -> stop at e" E_FAIL=1 +run 1 '6.6f' 'a,b,c,d,e,f' "6.6f verify fail -> stop at f" F_FAIL=1 + +# preconditions +run 2 'not found' '-' "precondition: no ENVFILE -> exit 2" ENVFILE="$WORK/nope.env" +: > "$WORK/net.env" +run 2 'MGMT_FIP unset' '-' "precondition: MGMT_FIP unset -> exit 2" +mkenv + +echo "=== assert: ORC (c) strictly precedes clusterctl init (d) on happy path ===" +: > "$ORDER" +PATH="$BIN:$PATH" HOME="$WORK" ENVFILE="$WORK/net.env" SSH_KEY="$WORK/id_key" \ + ORDER_FILE="$ORDER" bash "$TARGET" >/dev/null 2>&1 || true +ci=$(grep -n '^c$' "$ORDER" | head -1 | cut -d: -f1) +di=$(grep -n '^d$' "$ORDER" | head -1 | cut -d: -f1) +if [ -n "$ci" ] && [ -n "$di" ] && [ "$ci" -lt "$di" ]; then + echo " [OK] ORC at step $ci precedes clusterctl init at step $di" +else + echo " [XX] ORC/init ordering wrong (c=$ci d=$di)"; rc_all=1 +fi + +echo +[ "$rc_all" -eq 0 ] && echo "ALL PASS" || echo "SOME FAILED" +exit "$rc_all" diff --git a/tests/phase-06-k8s-bootstrap/fakebin/ssh b/tests/phase-06-k8s-bootstrap/fakebin/ssh new file mode 100644 index 0000000..7c0d69e --- /dev/null +++ b/tests/phase-06-k8s-bootstrap/fakebin/ssh @@ -0,0 +1,27 @@ +#!/usr/bin/env bash +# fake ssh for phase-06-k8s-bootstrap.sh tests. +# Collects positionals after 'bash -s' to tell GATE 1 (5 args) from 6.4 (8 args). +# Steered by env: VIP_FAIL NET_FAIL BOOT_FAIL. +pos=(); after=0 +for a in "$@"; do + if [ "$after" = 1 ]; then pos+=("$a"); continue; fi + [ "$a" = "-s" ] && after=1 +done +cat >/dev/null 2>&1 || true # discard the heredoc on stdin +case "${#pos[@]}" in + 5) # GATE 1 egress probe: khost kport ihost iport timeout + khost="${pos[0]}"; kport="${pos[1]}"; ihost="${pos[2]}"; iport="${pos[3]}" + if [ "${VIP_FAIL:-0}" = 1 ]; then echo "VIP-FAIL $khost:$kport"; else echo "VIP-OK $khost:$kport"; fi + if [ "${NET_FAIL:-0}" = 1 ]; then echo "NET-FAIL $ihost:$iport"; else echo "NET-OK $ihost:$iport"; fi + if [ "${VIP_FAIL:-0}" = 1 ] || [ "${NET_FAIL:-0}" = 1 ]; then echo "GATE1: FAIL"; else echo "GATE1: PASS"; fi + ;; + 8) # 6.4 bootstrap + if [ "${BOOT_FAIL:-0}" = 1 ]; then + echo "=== bootstrap ==="; echo "Error: bootstrap failed" + else + echo "cluster status: ready"; echo "network: enabled"; echo "BOOT: READY" + fi + ;; + *) echo "fake-ssh: unexpected positional count ${#pos[@]}" >&2 ;; +esac +exit 0 diff --git a/tests/phase-06-k8s-bootstrap/run-tests.sh b/tests/phase-06-k8s-bootstrap/run-tests.sh new file mode 100644 index 0000000..882f725 --- /dev/null +++ b/tests/phase-06-k8s-bootstrap/run-tests.sh @@ -0,0 +1,66 @@ +#!/usr/bin/env bash +# tests/phase-06-k8s-bootstrap/run-tests.sh -- offline regression for +# phase-06-k8s-bootstrap.sh. Fake ssh; real bash. +set -euo pipefail +IFS=$'\n\t' +HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +SCRIPTS="$(cd "$HERE/../../scripts" && pwd)" +TARGET="$SCRIPTS/phase-06-k8s-bootstrap.sh" +BIN="$HERE/fakebin" +[ -f "$TARGET" ] || { echo "FAIL: $TARGET missing" >&2; exit 1; } +chmod +x "$BIN"/* 2>/dev/null || true +WORK="$(mktemp -d)"; trap 'rm -rf "$WORK"' EXIT +rc_all=0 + +# baseline fixtures: env file + fake ssh key +mkenv() { printf 'MGMT_FIP=%s\nMGMT_TENANT_IP=%s\n' "${1:-10.12.7.222}" "${2:-10.20.0.207}" > "$WORK/net.env"; } +: > "$WORK/id_key" + +run() { # want_rc regex label [extra env assignments...] + local want="$1" re="$2" label="$3"; shift 3 + local rc + set +e + PATH="$BIN:$PATH" HOME="$WORK" ENVFILE="$WORK/net.env" SSH_KEY="$WORK/id_key" \ + PROBE_TIMEOUT=1 BOOT_TIMEOUT=1m READY_TIMEOUT=1m \ + env "$@" bash "$TARGET" >"$WORK/out" 2>&1 + rc=$? + set -e + if [ "$rc" -eq "$want" ] && grep -qE "$re" "$WORK/out"; then + printf ' [OK] %-46s exit %s\n' "$label" "$rc" + else + printf ' [XX] %-46s exit %s (want %s; /%s/)\n' "$label" "$rc" "$want" "$re" + sed 's/^/ /' "$WORK/out"; rc_all=1 + fi +} + +echo "=== phase-06-k8s-bootstrap.sh ===" +mkenv +run 0 'GATE 1 passed' "happy path (literal keystone + bootstrap)" +run 0 'Keystone public endpoint: 10.12.4.50:5000' "keystone = as-run literal default (no discovery)" +run 0 'bootstrapped and ready' "6.4 reaches ready" +run 0 'Keystone public endpoint: 1.2.3.4:5000' "KEYSTONE_HOSTPORT override honored" KEYSTONE_HOSTPORT=1.2.3.4:5000 +run 1 'VM egress probe did not pass' "GATE 1 VIP fail -> exit 1" VIP_FAIL=1 +run 1 'VM egress probe did not pass' "GATE 1 NET fail -> exit 1" NET_FAIL=1 +run 1 'did not reach ready' "6.4 bootstrap fail -> exit 1" BOOT_FAIL=1 + +# preconditions +run 2 'not found' "precondition: no ENVFILE -> exit 2" ENVFILE="$WORK/nope.env" +mkenv "" ""; : > "$WORK/net.env" # empty env file (no MGMT_FIP) +run 2 'MGMT_FIP unset' "precondition: MGMT_FIP unset -> exit 2" +mkenv +run 2 'ssh key' "precondition: missing ssh key -> exit 2" SSH_KEY="$WORK/nokey" + +# as-run fidelity: the script must NOT dynamically discover Keystone (uses the literal) +set +e +PATH="$BIN:$PATH" HOME="$WORK" ENVFILE="$WORK/net.env" SSH_KEY="$WORK/id_key" \ + PROBE_TIMEOUT=1 BOOT_TIMEOUT=1m READY_TIMEOUT=1m bash "$TARGET" >"$WORK/fid" 2>&1 +set -e +if grep -qiE 'discovered|endpoint list' "$WORK/fid"; then + printf ' [XX] %-46s (performed discovery; must use as-run literal)\n' "fidelity: no dynamic discovery"; rc_all=1 +else + printf ' [OK] %-46s\n' "fidelity: no dynamic discovery (as-run literal)" +fi + +echo +[ "$rc_all" -eq 0 ] && echo "ALL PASS" || echo "SOME FAILED" +exit "$rc_all" diff --git a/tests/phase-06-kubeconfig-gate/fakebin/kubectl b/tests/phase-06-kubeconfig-gate/fakebin/kubectl new file mode 100644 index 0000000..dc5ec11 --- /dev/null +++ b/tests/phase-06-kubeconfig-gate/fakebin/kubectl @@ -0,0 +1,28 @@ +#!/usr/bin/env bash +# fake kubectl for phase-06-kubeconfig-gate.sh tests. +# Steered by env: CLUSTER_NAME_OUT SET_CLUSTER_NOOP NODE_NOTREADY POD_PHASE POD_STATE. +a1="${1:-}"; a2="${2:-}"; rest=" $* " +case "$a1 $a2" in + "config view") + echo "${CLUSTER_NAME_OUT:-k8s}" ;; + "config set-cluster") + srv="" + for a in "$@"; do case "$a" in --server=*) srv="${a#--server=}";; esac; done + if [ "${SET_CLUSTER_NOOP:-0}" != 1 ] && [ -n "${KUBECONFIG:-}" ] && [ -f "${KUBECONFIG:-}" ]; then + sed -i -E "s#^([[:space:]]*server:).*#\1 $srv#" "$KUBECONFIG" + fi + echo "Cluster set." ;; + "get nodes") + st="Ready"; [ "${NODE_NOTREADY:-0}" = 1 ] && st="NotReady" + echo "NAME STATUS ROLES AGE VERSION" + echo "capi-mgmt-v2 $st control-plane,worker 12m v1.32.13" ;; + "get pod") + if printf '%s' "$rest" | grep -q 'status.phase'; then + echo "${POD_PHASE:-Succeeded}" + elif printf '%s' "$rest" | grep -q 'containerStatuses'; then + echo "${POD_STATE:-{\"terminated\":{\"reason\":\"Completed\",\"exitCode\":0}}}" + fi ;; + "delete pod") exit 0 ;; + "run egress-test") exit 0 ;; +esac +exit 0 diff --git a/tests/phase-06-kubeconfig-gate/fakebin/ssh b/tests/phase-06-kubeconfig-gate/fakebin/ssh new file mode 100644 index 0000000..496f3a9 --- /dev/null +++ b/tests/phase-06-kubeconfig-gate/fakebin/ssh @@ -0,0 +1,29 @@ +#!/usr/bin/env bash +# fake ssh for phase-06-kubeconfig-gate.sh: only the 'sudo k8s config' pull is used. +# Emits a kubeconfig whose server is the TENANT IP (10.20.0.207) -- the exact +# DOCFIX-062 defect the script must rewrite to the FIP. +# Steered by env: PULL_FAIL PULL_EMPTY PULL_BADHEAD. +want_config=0 +for a in "$@"; do case "$a" in *"k8s config"*) want_config=1;; esac; done +if [ "$want_config" = 1 ]; then + [ "${PULL_FAIL:-0}" = 1 ] && exit 1 + [ "${PULL_EMPTY:-0}" = 1 ] && exit 0 + if [ "${PULL_BADHEAD:-0}" = 1 ]; then echo "not-a-kubeconfig"; exit 0; fi + cat <<'KC' +apiVersion: v1 +clusters: +- cluster: + server: https://10.20.0.207:6443 + name: k8s +contexts: +- context: + cluster: k8s + user: admin + name: k8s +current-context: k8s +kind: Config +users: +- name: admin +KC +fi +exit 0 diff --git a/tests/phase-06-kubeconfig-gate/run-tests.sh b/tests/phase-06-kubeconfig-gate/run-tests.sh new file mode 100644 index 0000000..6cc0506 --- /dev/null +++ b/tests/phase-06-kubeconfig-gate/run-tests.sh @@ -0,0 +1,89 @@ +#!/usr/bin/env bash +# tests/phase-06-kubeconfig-gate/run-tests.sh -- offline regression for +# phase-06-kubeconfig-gate.sh. Fake ssh + kubectl; real bash. +# Key assertion: DOCFIX-062 -- the emitted kubeconfig server (tenant IP) is +# rewritten to the FIP before the gate runs. +set -euo pipefail +IFS=$'\n\t' +HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +SCRIPTS="$(cd "$HERE/../../scripts" && pwd)" +TARGET="$SCRIPTS/phase-06-kubeconfig-gate.sh" +BIN="$HERE/fakebin" +[ -f "$TARGET" ] || { echo "FAIL: $TARGET missing" >&2; exit 1; } +chmod +x "$BIN"/* 2>/dev/null || true +WORK="$(mktemp -d)"; trap 'rm -rf "$WORK"' EXIT +rc_all=0 +FIP=10.12.7.222 + +mkenv() { printf 'MGMT_FIP=%s\n' "$FIP" > "$WORK/net.env"; } +: > "$WORK/id_key" + +run() { # want_rc regex label [extra env...] + local want="$1" re="$2" label="$3"; shift 3 + local rc + rm -f "$WORK/kc" + set +e + PATH="$BIN:$PATH" HOME="$WORK" ENVFILE="$WORK/net.env" SSH_KEY="$WORK/id_key" \ + KUBECONFIG_OUT="$WORK/kc" PROBE_TRIES=2 PROBE_SLEEP=0 \ + env "$@" bash "$TARGET" >"$WORK/out" 2>&1 + rc=$? + set -e + if [ "$rc" -eq "$want" ] && grep -qE "$re" "$WORK/out"; then + printf ' [OK] %-48s exit %s\n' "$label" "$rc" + else + printf ' [XX] %-48s exit %s (want %s; /%s/)\n' "$label" "$rc" "$want" "$re" + sed 's/^/ /' "$WORK/out"; rc_all=1 + fi +} + +echo "=== phase-06-kubeconfig-gate.sh ===" +mkenv +run 0 'GATE 2 passed' "happy path (pull + rewrite + probe)" +run 0 'Keystone public endpoint: 10.12.4.50:5000' "keystone = as-run literal default (no discovery)" +run 0 'server rewritten to https' "DOCFIX-062 rewrite message" +run 0 'GATE 2 passed' "KEYSTONE_HOSTPORT override" KEYSTONE_HOSTPORT=1.2.3.4:5000 +run 1 'could not pull kubeconfig' "pull fail -> exit 1" PULL_FAIL=1 +run 1 'is empty' "empty kubeconfig -> exit 1" PULL_EMPTY=1 +run 1 'does not look like' "bad head -> exit 1" PULL_BADHEAD=1 +run 1 'did not take' "set-cluster no-op -> exit 1 (DOCFIX-062 guard)" SET_CLUSTER_NOOP=1 +run 1 'node is not Ready' "node NotReady -> exit 1" NODE_NOTREADY=1 +run 1 'probe pod Failed' "GATE 2 pod Failed -> exit 1" POD_PHASE=Failed +run 1 'did not reach Succeeded' "GATE 2 exitCode!=0 -> exit 1" POD_STATE='{"terminated":{"reason":"Error","exitCode":1}}' +run 1 'did not reach Succeeded' "GATE 2 Pending timeout -> exit 1" POD_PHASE=Pending + +# preconditions +run 2 'not found' "precondition: no ENVFILE -> exit 2" ENVFILE="$WORK/nope.env" +: > "$WORK/net.env" +run 2 'MGMT_FIP unset' "precondition: MGMT_FIP unset -> exit 2" +mkenv + +echo "=== assert DOCFIX-062: kubeconfig server rewritten tenant-IP -> FIP ===" +rm -f "$WORK/kc" +PATH="$BIN:$PATH" HOME="$WORK" ENVFILE="$WORK/net.env" SSH_KEY="$WORK/id_key" \ + KUBECONFIG_OUT="$WORK/kc" PROBE_TRIES=2 PROBE_SLEEP=0 \ + bash "$TARGET" >/dev/null 2>&1 || true +if grep -qE "server:[[:space:]]*https://${FIP//./\\.}:6443" "$WORK/kc" \ + && ! grep -q '10.20.0.207:6443' "$WORK/kc"; then + perm=$(stat -c '%a' "$WORK/kc" 2>/dev/null || echo '?') + if [ "$perm" = 600 ]; then + echo " [OK] server rewritten to FIP; tenant IP gone; mode 600" + else + echo " [XX] kubeconfig mode=$perm (want 600)"; rc_all=1 + fi +else + echo " [XX] server not rewritten to FIP (DOCFIX-062 regression)"; sed 's/^/ /' "$WORK/kc"; rc_all=1 +fi + +echo "=== assert as-run fidelity: no dynamic Keystone discovery ===" +PATH="$BIN:$PATH" HOME="$WORK" ENVFILE="$WORK/net.env" SSH_KEY="$WORK/id_key" \ + KUBECONFIG_OUT="$WORK/kc" PROBE_TRIES=2 PROBE_SLEEP=0 \ + bash "$TARGET" >"$WORK/fid" 2>&1 || true +if grep -qiE 'discovered|endpoint list' "$WORK/fid"; then + echo " [XX] performed discovery; must use as-run literal"; rc_all=1 +else + echo " [OK] no dynamic discovery (as-run literal 10.12.4.50:5000)" +fi + +echo +[ "$rc_all" -eq 0 ] && echo "ALL PASS" || echo "SOME FAILED" +exit "$rc_all"