Newer
Older
openstack-caracal-ipv4 / runbooks / phase-06-incloud-mgmt-cluster.md

Phase 06 -- In-Cloud Management Cluster (D-035)

Stand up the CAPI/Magnum management cluster as a single-homed in-cloud tenant VM (capi-mgmt-v2), bootstrap k8s-snap on it, prove pod egress through the hard gate, and install the pinned CAPI provider stack. This is the persistent v1 management cluster -- there is NO clusterctl move/pivot.

Decisions: D-035 (in-cloud single-homed tenant VM; retires D-033/D-017), D-034 (CAPI versions sourced from the capi-helm-charts tag's dependencies.json, never hardcoded), D-031 (Magnum + magnum-capi-helm + capi-helm-charts engine). Troubleshooting: appendix-A entries DOCFIX-021, DOCFIX-024, DOCFIX-025a, D-035.


Prerequisites (must be true entering phase-06)

  • Charmed OpenStack live and verified (phase-03 done); Keystone reachable on the provider VIP.
  • The external provider network exists (phase-04 done) -- the mgmt FIP in Step 6.2 is allocated from it. Octavia is NOT required for the mgmt cluster itself (its apiserver is reached via the FIP directly); Octavia is a phase-08 prereq for workload clusters.
  • admin-openrc sourced on the jumphost; openstack, jq, kubectl available.
  • The capi-mgmt Keystone project exists. The Magnum trustee domain is auto-configured by the magnum charm via its keystone (identity-credentials) relation -- verify [trust] (trustee_domain_id / trustee_domain_admin_id / trustee_domain_admin_password) is populated in magnum.conf; no manual step.
  • No capi-mgmt-net tenant network yet (this phase creates it).

Constants and env-literals (TAG: regenerate/confirm per site on rebuild)

Literals below are tagged ENV(...) so the later generalization pass is mechanical. Discover everything else dynamically at run time.

  • ENV(project) capi-mgmt (id 674171fd28d446d3a37073b6a761e910)
  • ENV(ext-net) provider-ext (id 70b34bb2-3afb-4b43-96d3-f520dbcbf9a8)
  • ENV(image) ubuntu-24.04-noble (id c66342ce-f402-4e6e-a324-ae27032396d7)
  • ENV(flavor) gp.large (16384 MB / 4 vCPU / 80 GB)
  • ENV(mgmt-cidr) 10.20.0.0/24 (capi-mgmt-subnet; overlay, non-IPAM)
  • ENV(keystone-vip) 10.12.4.50:5000 (the gate target -- the deployed VIP)
  • ENV(mgmt-fip) 10.12.7.40 (assigned in 6.2; apiserver SAN)
  • ENV(pod-cidr) 10.1.0.0/16 ENV(svc-cidr) 10.152.183.0/24 (snap defaults; non-colliding)
  • ENV(capi-tag) 0.25.1 (capi-helm-charts release; dependencies.json source)

Run-location legend (every block states where it runs)

  • # RUN: jumphost -- on vopenstack-jesse as jessea123, admin-openrc sourced.
  • # RUN: mgmt VM -- shipped to the VM over SSH via the FIP (heredoc below).
  • VM SSH form (used verbatim throughout; DOCFIX-021 </dev/null on every sudo): ssh -i ~/.ssh/id_ed25519 -o BatchMode=yes -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10 ubuntu@10.12.7.40 bash -s <<'REOF' ... REOF

Step 6.0 -- Keypair + security group (capi-mgmt project)

# RUN: jumphost Safe/idempotent setup -- consolidated. (LIVE-REVIEW: exact SG rule syntax is standard openstack-client; confirm on the redeploy test.)

( {
  set -u
  PROJ=capi-mgmt                                   # ENV(project)
  echo "=== keypair (import the jumphost pubkey) ==="
  openstack keypair show capi-mgmt-key >/dev/null 2>&1 \
    || openstack keypair create --public-key ~/.ssh/id_ed25519.pub capi-mgmt-key
  echo "=== security group capi-mgmt-sg (ingress 22 + 6443; egress default-allow) ==="
  openstack security group show capi-mgmt-sg >/dev/null 2>&1 \
    || openstack security group create --project "$PROJ" capi-mgmt-sg
  SG=$(openstack security group show capi-mgmt-sg -f value -c id)
  # add rules only if absent (re-run safe)
  openstack security group rule list "$SG" -f value -c "Port Range" | grep -q '^22:22' \
    || openstack security group rule create --proto tcp --dst-port 22   "$SG"
  openstack security group rule list "$SG" -f value -c "Port Range" | grep -q '^6443:6443' \
    || openstack security group rule create --proto tcp --dst-port 6443 "$SG"
  echo "=== verify ==="
  openstack security group rule list "$SG" -f value -c Protocol -c "Port Range"
} )

Expect: capi-mgmt-key present; capi-mgmt-sg with tcp/22 and tcp/6443 ingress.

Step 6.1 -- Network, subnet, router (capi-mgmt project)

# RUN: jumphost Idempotent network plumbing -- consolidated. DNS nameservers 1.1.1.1/1.0.0.1 (D-019: public resolvers; image pulls need internet egress).

( {
  set -u
  PROJ=capi-mgmt                                   # ENV(project)
  EXT=provider-ext                                 # ENV(ext-net)
  echo "=== network capi-mgmt-net ==="
  openstack network show capi-mgmt-net >/dev/null 2>&1 \
    || openstack network create --project "$PROJ" capi-mgmt-net
  echo "=== subnet capi-mgmt-subnet 10.20.0.0/24 ==="   # ENV(mgmt-cidr)
  openstack subnet show capi-mgmt-subnet >/dev/null 2>&1 \
    || openstack subnet create --project "$PROJ" --network capi-mgmt-net \
         --subnet-range 10.20.0.0/24 \
         --dns-nameserver 1.1.1.1 --dns-nameserver 1.0.0.1 capi-mgmt-subnet
  echo "=== router capi-mgmt-router + ext-gw + subnet ==="
  openstack router show capi-mgmt-router >/dev/null 2>&1 \
    || openstack router create --project "$PROJ" capi-mgmt-router
  openstack router set --external-gateway "$EXT" capi-mgmt-router
  openstack router add subnet capi-mgmt-router capi-mgmt-subnet 2>/dev/null || true
  echo "=== verify ==="
  openstack router show capi-mgmt-router -f value -c external_gateway_info -c status
} )

Expect: subnet 10.20.0.0/24; router ACTIVE with an external gateway on provider-ext.

Step 6.2 -- VM + floating IP (MUTATION; not batched with the gate)

# RUN: jumphost Creates the VM and pins the management FIP. The FIP is the stable apiserver endpoint for the jumphost AND the Magnum conductor.

( {
  set -u
  PROJ=capi-mgmt                                   # ENV(project)
  EXT=provider-ext                                 # ENV(ext-net)
  echo "=== create capi-mgmt-v2 (gp.large / ubuntu-24.04-noble) ==="
  openstack server show capi-mgmt-v2 >/dev/null 2>&1 \
    || openstack server create --image ubuntu-24.04-noble --flavor gp.large \
         --network capi-mgmt-net --security-group capi-mgmt-sg \
         --key-name capi-mgmt-key capi-mgmt-v2
  echo "=== wait ACTIVE (re-run until ACTIVE) ==="
  openstack server show capi-mgmt-v2 -f value -c status -c addresses
  echo "=== floating ip on provider-ext, associate to the VM ==="
  FIP=$(openstack floating ip create "$EXT" -f value -c floating_ip_address)
  echo "allocated FIP=$FIP   # expect this to be 10.12.7.40 on a clean run -- ENV(mgmt-fip)"
  openstack server add floating ip capi-mgmt-v2 "$FIP"
  openstack server show capi-mgmt-v2 -f value -c addresses
} )

Note: the tenant IP lands on 10.20.0.45 and the FIP on 10.12.7.40 on the as-built run. If the FIP differs on rebuild, carry the new value into 6.4 (extra-sans) and 6.5 (kubeconfig server) and phase-07 (conductor kubeconfig).

Step 6.3 -- GATE 1: OS-level egress (before any k8s investment)

# RUN: mgmt VM This is the premise of D-035. PROCEED ONLY IF VIP-OK.

ssh -i ~/.ssh/id_ed25519 -o BatchMode=yes -o StrictHostKeyChecking=no \
    -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10 ubuntu@10.12.7.40 bash -s <<'REOF'
set -u
echo "=== VM -> Keystone VIP 10.12.4.50:5000 ==="            # ENV(keystone-vip)
timeout 6 bash -c 'exec 3<>/dev/tcp/10.12.4.50/5000' && echo VIP-OK || echo VIP-FAIL
echo "=== VM -> internet 1.1.1.1:443 (image pulls) ==="
timeout 6 bash -c 'exec 3<>/dev/tcp/1.1.1.1/443' && echo NET-OK || echo NET-FAIL
REOF

GATE: require VIP-OK. NET-FAIL means sort provider-ext internet egress (or a registry mirror) before 6.6. Do NOT build k8s on a VM that fails VIP-OK. (appendix-A: D-035 -- single-NIC removes the dual-homed reverse-path bug.)

Step 6.4 -- k8s-snap install + bootstrap (MUTATION; secret-free)

# RUN: mgmt VM Channel is 1.32-classic/stable (NOT 1.32/stable -- that is the charm-era track and does not exist for the snap). The bootstrap config MUST carry an explicit cluster-config block (appendix-A: DOCFIX-024 -- a config without it disables network+dns and the node never goes Ready). Every sudo gets </dev/null (appendix-A: DOCFIX-021 -- remote bash -s reads the script from stdin).

ssh -i ~/.ssh/id_ed25519 -o BatchMode=yes -o StrictHostKeyChecking=no \
    -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10 ubuntu@10.12.7.40 bash -s <<'REOF'
set -euo pipefail

echo "=== install k8s snap 1.32-classic/stable ==="
sudo snap install k8s --classic --channel=1.32-classic/stable </dev/null

echo "=== write bootstrap config (DOCFIX-024: cluster-config block REQUIRED) ==="
sudo tee /root/bootstrap-config.yaml >/dev/null <<'CFG'
cluster-config:
  network:
    enabled: true
  dns:
    enabled: true
pod-cidr: 10.1.0.0/16
service-cidr: 10.152.183.0/24
extra-sans:
- 10.12.7.40
- 10.20.0.45
CFG
sudo cat /root/bootstrap-config.yaml

echo "=== bootstrap (timeout 10m) ==="
sudo k8s bootstrap --name capi-mgmt-v2 --file /root/bootstrap-config.yaml --timeout 10m </dev/null

echo "=== status ==="
sudo k8s status --wait-ready --timeout 5m </dev/null
REOF

Expect: k8s status reports cluster ready, network+dns enabled, one node. Retry path: sudo snap remove k8s --purge </dev/null then re-run this block.

Step 6.5 -- GATE 2: kubeconfig to jumphost + pod-egress proof (THE D-035 GATE)

The agnhost pod-egress probe is the exact test the dual-homed D-033 node and the old k3s node FAILED. On this single-NIC VM it must Completed.

# RUN: mgmt VM -- emit a jumphost-facing kubeconfig (server = the FIP, not tenant IP)
ssh -i ~/.ssh/id_ed25519 -o BatchMode=yes -o StrictHostKeyChecking=no \
    -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10 ubuntu@10.12.7.40 \
    "sudo k8s config server=https://10.12.7.40:6443 </dev/null" > ~/capi-mgmt.kubeconfig
# [SENSITIVE] ~/capi-mgmt.kubeconfig contains a cluster-admin credential.
wc -l ~/capi-mgmt.kubeconfig ; head -1 ~/capi-mgmt.kubeconfig   # expect >0 lines, "apiVersion: v1"
# RUN: jumphost -- node check + the hard gate
( {
  set -u
  export KUBECONFIG="$HOME/capi-mgmt.kubeconfig"
  echo "=== node ==="
  kubectl get nodes -o wide                          # expect capi-mgmt-v2 Ready, v1.32.13
  echo "=== agnhost pod-egress probe -> Keystone VIP 10.12.4.50:5000 ==="
  kubectl run egress-test --image=registry.k8s.io/e2e-test-images/agnhost:2.40 \
    --restart=Never --command -- /agnhost connect 10.12.4.50:5000 --timeout=5s
  echo "(poll the next line until STATUS=Completed)"
  kubectl get pod egress-test -o jsonpath='{.status.phase} {.status.containerStatuses[0].state}{"\n"}'
} )

GATE: require the probe pod Completed / exitCode 0 (empty logs = clean TCP connect). That proves pod -> Cilium -> ens3 -> OVN -> router SNAT egress works. Then clean up the throwaway pod:

# RUN: jumphost
KUBECONFIG="$HOME/capi-mgmt.kubeconfig" kubectl delete pod egress-test --now

Step 6.6 -- CAPI provider stack (pinned to dependencies.json; D-034)

# RUN: mgmt VM Run VM-side as root with KUBECONFIG=/root/kubeconfig (local apiserver 10.20.0.45:6443) so the matched 1.32.13 kubectl is used -- avoids the jumphost kubectl's +3-minor skew. Versions are READ from the tag's dependencies.json, never hardcoded (D-034). The as-built pins are in the reference block below as a known-good cross-check only.

HARDENED ORDER (appendix-A: D-034 install-ordering): cert-manager -> ORC -> clusterctl init -> CAAPH -> janitor. ORC precedes clusterctl init because CAPO v0.14.4's openstackserver controller hard-depends on ORC's Image.openstack.k-orc.cloud CRD; installing CAPO first crash-loops until ORC lands. (The 2026-06-08 run used ORC last and self-healed after 6 restarts -- the runbook corrects the order.)

6.6a -- tooling + pins (install helm/clusterctl/kubectl VM-side; read dependencies.json @ 0.25.1)

# RUN: jumphost Installs the CAPI tooling on the mgmt VM at the dependencies.json pins and writes ~/capi-pins.env (sourced by 6.6b-6.6f). kubectl is pinned to the cluster's 1.32.13 (no apiserver skew). The SSH_OPTS/MGMT_VM vars set here are reused by 6.6b-6.6f (same jumphost shell).

# define the mgmt-VM connection once (reused by 6.6b-6.6f)
MGMT_VM=10.12.7.40
SSH_OPTS="-i $HOME/.ssh/id_ed25519 -o BatchMode=yes -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10"

ssh $SSH_OPTS ubuntu@"$MGMT_VM" bash -s <<'REOF'
set -euo pipefail
sudo apt-get update -qq </dev/null && sudo apt-get install -y jq curl </dev/null

# kubeconfig for the local apiserver (10.20.0.45:6443), readable by ubuntu -> helm/clusterctl/kubectl need no sudo
mkdir -p "$HOME/.kube"; sudo k8s config </dev/null > "$HOME/.kube/config"; chmod 600 "$HOME/.kube/config"

# egress pre-check (the VM pulls charts/binaries/manifests from these)
for h in https://raw.githubusercontent.com https://get.helm.sh https://github.com https://dl.k8s.io; do
  printf '%s -> ' "$h"; curl -s -o /dev/null -w '%{http_code}\n' "$h" || echo FAIL
done

# version constellation from the chart tag's dependencies.json (D-034; never hardcoded)
curl -fsSL https://raw.githubusercontent.com/azimuth-cloud/capi-helm-charts/0.25.1/dependencies.json -o "$HOME/deps.json"
CAPI=$(jq -r '."cluster-api"' "$HOME/deps.json")
CAPO=$(jq -r '."cluster-api-provider-openstack"' "$HOME/deps.json")
CERT=$(jq -r '."cert-manager"' "$HOME/deps.json")
ORC=$(jq -r '."openstack-resource-controller"' "$HOME/deps.json")
CAAPH=$(jq -r '."addon-provider"' "$HOME/deps.json")
JANITOR=$(jq -r '."cluster-api-janitor-openstack"' "$HOME/deps.json")
HELM=$(jq -r '.helm' "$HOME/deps.json")
{ echo "CAPI=$CAPI"; echo "CAPO=$CAPO"; echo "CERT=$CERT"; echo "ORC=$ORC"; \
  echo "CAAPH=$CAAPH"; echo "JANITOR=$JANITOR"; echo "HELM=$HELM"; } > "$HOME/capi-pins.env"
echo "== pins (cross-check: CAPI v1.13.2 CAPO v0.14.4 CERT v1.20.2 ORC v2.5.0 CAAPH 0.12.0 JANITOR 0.11.0 HELM v3.17.3) =="
cat "$HOME/capi-pins.env"

# install helm (pinned), clusterctl (= CAPI pin), kubectl (= cluster 1.32.13)
curl -fsSL "https://get.helm.sh/helm-${HELM}-linux-amd64.tar.gz" -o /tmp/helm.tgz
sudo tar -xzf /tmp/helm.tgz -C /usr/local/bin --strip-components=1 linux-amd64/helm </dev/null
curl -fsSL "https://github.com/kubernetes-sigs/cluster-api/releases/download/${CAPI}/clusterctl-linux-amd64" -o /tmp/clusterctl
sudo install -m 0755 /tmp/clusterctl /usr/local/bin/clusterctl </dev/null
curl -fsSL "https://dl.k8s.io/release/v1.32.13/bin/linux/amd64/kubectl" -o /tmp/kubectl
sudo install -m 0755 /tmp/kubectl /usr/local/bin/kubectl </dev/null

echo "== tooling =="; helm version --short; clusterctl version; kubectl version --client 2>/dev/null | head -1
REOF

6.6b -- cert-manager (DOCFIX-025a: crds.enabled=true, NOT installCRDs)

# RUN: jumphost

ssh $SSH_OPTS ubuntu@"$MGMT_VM" bash -s <<'REOF'
set -euo pipefail
source "$HOME/capi-pins.env"
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm upgrade --install cert-manager jetstack/cert-manager \
  --namespace cert-manager --create-namespace \
  --version "$CERT" --set crds.enabled=true --wait --timeout 5m
kubectl -n cert-manager wait --for=condition=Available deploy --all --timeout=180s
kubectl -n cert-manager get pods
REOF

6.6c -- ORC (BEFORE clusterctl init; CAPO hard-depends on the ORC Image CRD)

# RUN: jumphost server-side apply (large CRDs). Manifest is the k-orc release install.yaml (D-034).

ssh $SSH_OPTS ubuntu@"$MGMT_VM" bash -s <<'REOF'
set -euo pipefail
source "$HOME/capi-pins.env"
kubectl apply --server-side -f \
  "https://github.com/k-orc/openstack-resource-controller/releases/download/${ORC}/install.yaml"
kubectl -n orc-system wait --for=condition=Available deploy --all --timeout=180s
kubectl get crd images.openstack.k-orc.cloud
REOF

6.6d -- clusterctl init (core + kubeadm bootstrap/control-plane + CAPO)

# RUN: jumphost cert-manager already present -> clusterctl detects and skips it.

ssh $SSH_OPTS ubuntu@"$MGMT_VM" bash -s <<'REOF'
set -euo pipefail
source "$HOME/capi-pins.env"
clusterctl init \
  --core "cluster-api:${CAPI}" \
  --bootstrap "kubeadm:${CAPI}" \
  --control-plane "kubeadm:${CAPI}" \
  --infrastructure "openstack:${CAPO}"
for ns in capi-system capi-kubeadm-bootstrap-system capi-kubeadm-control-plane-system capo-system; do
  echo "== $ns =="; kubectl -n "$ns" wait --for=condition=Available deploy --all --timeout=240s
done
REOF

6.6e -- CAAPH + janitor (azimuth helm charts; chart names from each repo Chart.yaml)

# RUN: jumphost

ssh $SSH_OPTS ubuntu@"$MGMT_VM" bash -s <<'REOF'
set -euo pipefail
source "$HOME/capi-pins.env"
helm repo add capi-addon   https://azimuth-cloud.github.io/cluster-api-addon-provider
helm repo add capi-janitor https://azimuth-cloud.github.io/cluster-api-janitor-openstack
helm repo update
helm upgrade --install cluster-api-addon-provider capi-addon/cluster-api-addon-provider \
  --namespace capi-addon-system --create-namespace --version "$CAAPH" --wait --timeout 5m
helm upgrade --install cluster-api-janitor-openstack capi-janitor/cluster-api-janitor-openstack \
  --namespace capi-janitor-system --create-namespace --version "$JANITOR" --wait --timeout 5m
kubectl -n capi-addon-system   get pods
kubectl -n capi-janitor-system get pods
REOF

6.6f -- verify the stack

# RUN: jumphost

ssh $SSH_OPTS ubuntu@"$MGMT_VM" bash -s <<'REOF'
set -euo pipefail
clusterctl version
echo "== all controllers Running =="
kubectl get pods -A | egrep 'capi-|capo-|cert-manager|orc-system|janitor|addon' || true
echo "== key CRDs present =="
kubectl get crd clusters.cluster.x-k8s.io \
  openstackclusters.infrastructure.cluster.x-k8s.io \
  kubeadmcontrolplanes.controlplane.cluster.x-k8s.io \
  images.openstack.k-orc.cloud
REOF

EXIT GATE (phase-06 complete)

  • GATE 1 VIP-OK and GATE 2 agnhost Completed both passed.
  • capi-mgmt-v2 Ready (v1.32.13); ~/capi-mgmt.kubeconfig (server = FIP) works from the jumphost.
  • All CAPI controllers Running; ORC Image CRD present; no crash-looping CAPO.
  • Proceed to phase-07 (conductor graft).

As-built reference (2026-06-08/09 run -- audit trail; values are run-specific)

  • VM capi-mgmt-v2: gp.large, ubuntu-24.04-noble; tenant IP 10.20.0.45 (ens3); FIP 10.12.7.40.
  • Net capi-mgmt-net / subnet capi-mgmt-subnet 10.20.0.0/24; router capi-mgmt-router.
  • k8s-snap: 1.32-classic/stable, rev 5326, v1.32.13 (classic confinement); CNI Cilium 1.17.12-ck0.
  • pod CIDR 10.1.0.0/16; svc CIDR 10.152.183.0/24; cluster DNS 10.152.183.31.
  • GATE 2: probe pod 10.1.0.150 -> 10.12.4.50:5000, exitCode 0 / Completed (agnhost:2.40, ~9s pull).
  • Pins (capi-helm-charts 0.25.1 dependencies.json): CAPI v1.13.2 | CAPO v0.14.4 | cert-manager v1.20.2 | CAAPH 0.12.0 | janitor 0.11.0 | ORC v2.5.0 | helm v3.17.3. CAAPH/janitor deploy SHA-pinned images: 62f7c00 / d527847.
  • Tooling VM-side: helm v3.17.3, clusterctl v1.13.2, matched kubectl 1.32.13 (KUBECONFIG=/root/kubeconfig).

Next

phase-07 -- conductor graft: place ~/capi-mgmt.kubeconfig at /etc/magnum/kubeconfig on magnum/0 and stage the [capi_helm] conf.d drop-in (D-037), pointing the conductor at the FIP.