Newer
Older
openstack-caracal-ipv4 / runbooks / phase-07-conductor-graft.md

Phase 07 -- Magnum Conductor Graft (D-031 / D-037 / D-042)

Graft the magnum-capi-helm CAPI driver onto the charm-managed conductor (magnum/0), point it at the in-cloud management cluster (phase-06) via the FIP, and land on a CONTRACT-COHERENT driver so coe cluster health reports HEALTHY. The driver upgrade (D-042) is part of the v1 baseline here, not a follow-up -- the as-first-built 1.3.0 read the version-less v1beta2 infrastructureRef and reported a cosmetic UNHEALTHY; it is superseded by the RELEASED magnum-capi-helm==1.4.0, which is the v1 end state.

Decisions: D-031 (driver/engine/surface), D-037 (conf.d drop-in + config-dir via /etc/default, NOT a systemd ExecStart drop-in), D-042 (driver must be contract-coherent with the Layer-A core; amends D-034). D-036 (driver/engine/ chart coherence). Troubleshooting: appendix-A DOCFIX-021, D-037, D-042, and lessons L-P6-1..4.


Prerequisites (must be true entering phase-07)

  • phase-06 EXIT GATE passed: capi-mgmt-v2 Ready, CAPI stack up (ORC Image CRD present, no crash-looping CAPO), ~/capi-mgmt.kubeconfig (server = FIP) works from the jumphost.
  • Magnum charm live (magnum/0); the Keystone trustee domain is auto-configured by the magnum charm via its keystone (identity-credentials) relation -- verify [trust] (trustee_domain_id / trustee_domain_admin_id / trustee_domain_admin_password) is populated in magnum.conf; no manual step.
  • admin-openrc on the jumphost; juju (model openstack); jq.

Constants and env-literals (TAG: confirm per site on rebuild)

  • ENV(conductor-unit) magnum/0 (LXD 1/lxd/2 on openstack1; addr 10.12.4.76)
  • ENV(conductor-src) 10.12.4.76/32 (the conductor's provider IP; SG source)
  • ENV(mgmt-fip) 10.12.7.40 (mgmt apiserver; kubeconfig server)
  • ENV(mgmt-sg) capi-mgmt-sg (in the capi-mgmt project)
  • ENV(project) capi-mgmt (id 674171fd28d446d3a37073b6a761e910)
  • ENV(magnum-ns) magnum-674171fd28d446d3a37073b6a761e910 (driver namespace per project)
  • ENV(chart-ver) 0.25.1 (capi-helm-charts; load-bearing -- driver default is 0.10.1)
  • ENV(helm-ver) v3.17.3

Run-location legend

  • # RUN: jumphost -- vopenstack-jesse as jessea123 (admin-openrc).
  • # RUN: jumphost -> magnum/0-- shipped to the conductor via juju ssh -m openstack magnum/0 '...' </dev/null (DOCFIX-021: </dev/null on every juju ssh / sudo so the remote command does not eat the heredoc/pipe).
  • Conductor facts: DEB install (magnum 18.0.1, python3.10, container base ubuntu 22.04); conductor runs as user magnum; daemon launched by an LSB init script wrapped by systemd systemd-start (NOT a direct ExecStart) -- see Step 7.7.

Step 7.1 -- Authorize the conductor source on the mgmt-cluster SG

# RUN: jumphost (scoped to the capi-mgmt project). Idempotent.

( {
  set -u
  # scope openstack CLI to the capi-mgmt project (id form -- robust to name/domain)
  source ~/admin-openrc
  unset OS_PROJECT_NAME OS_PROJECT_ID OS_TENANT_NAME OS_TENANT_ID
  export OS_PROJECT_ID=674171fd28d446d3a37073b6a761e910      # ENV(project)
  SG=$(openstack security group show capi-mgmt-sg -f value -c id)   # ENV(mgmt-sg)
  echo "SG=$SG"
  echo "=== add ingress tcp/6443 from the conductor 10.12.4.76/32 (if absent) ==="
  openstack security group rule list "$SG" -f value -c "IP Range" -c "Port Range" \
    | grep -q '10.12.4.76/32 6443:6443' \
    || openstack security group rule create --proto tcp --dst-port 6443 \
         --remote-ip 10.12.4.76/32 "$SG"
  openstack security group rule list "$SG" -f value -c Protocol -c "Port Range" -c "IP Range"
} )

Then prove conductor -> mgmt apiserver reachability:

# RUN: jumphost -> magnum/0
juju ssh -m openstack magnum/0 \
  "timeout 6 bash -c 'exec 3<>/dev/tcp/10.12.7.40/6443' && echo TCP-OK || echo TCP-FAIL" </dev/null

GATE: require TCP-OK. (Pre-existing jumphost rules tcp/22+6443 from 10.12.4.1/32 remain.)

Step 7.2 -- Place the mgmt kubeconfig on the conductor [SENSITIVE; not batched]

# RUN: jumphost -> magnum/0 The source ~/capi-mgmt.kubeconfig already has its server rewritten to the FIP (phase-06 6.5). Transfer base64-piped straight into a root-written 0600 file owned by the conductor user -- never stage the admin kubeconfig in /tmp (appendix-A: L-P6-4).

# discover the conductor service user (expect: magnum)
juju ssh -m openstack magnum/0 'systemctl show magnum-conductor -p User --value' </dev/null

# transfer (umask 077; chown to the discovered user; 0600)
# NOTE: NO trailing </dev/null here -- stdin IS the payload. A </dev/null would
# override the pipe (SC2259) and silently write an EMPTY kubeconfig while the
# && chain still exits 0. DOCFIX-021 applies only to commands whose stdin is
# NOT in use; the discovery line above keeps it, this pipe must not.
base64 ~/capi-mgmt.kubeconfig | juju ssh -m openstack magnum/0 \
  "sudo bash -c 'umask 077; base64 -d > /etc/magnum/kubeconfig && \
   getent passwd magnum >/dev/null && chown magnum: /etc/magnum/kubeconfig && \
   chmod 0600 /etc/magnum/kubeconfig'"

# verify byte-exact (hashes must match before proceeding)
sha256sum ~/capi-mgmt.kubeconfig
juju ssh -m openstack magnum/0 'sudo sha256sum /etc/magnum/kubeconfig' </dev/null

GATE: the two sha256 hashes are identical (an empty or truncated transfer fails here, not three steps later as a confusing conductor auth error). End-to-end proof (the conductor user authenticates to the mgmt cluster via the FIP):

juju ssh -m openstack magnum/0 \
  'sudo -u magnum env HOME=/tmp helm --kubeconfig /etc/magnum/kubeconfig list -A' </dev/null

Expect: the mgmt-cluster helm releases listed (cert-manager, ck-dns, ck-network cilium, cluster-api-addon-provider, cluster-api-janitor-openstack, metrics-server). GATE: a populated list = reach + auth OK. (Hardening, Roosevelt: replace this cluster-admin kubeconfig with a scoped ServiceAccount kubeconfig.)

Step 7.3 -- Confirm the driver target + served CAPI versions (D-042)

# RUN: jumphost + jumphost kubectl. The fix is the RELEASED tag magnum-capi-helm==1.4.0 (the "generalize-api-resources" feature). 1.3.0 read the version-less v1beta2 infrastructureRef and failed the health GET; 1.4.0 resolves each resource query as api_resources.get(<Kind>,{}).get("api_version", <code-default>), where the driver's CODE defaults are v1beta1 for every CAPI core kind (Cluster / MachineDeployment / Machine -> cluster.x-k8s.io/v1beta1; OpenstackCluster -> infrastructure.cluster.x-k8s.io/v1beta1; K8sControlPlane -> controlplane.cluster.x-k8s.io/v1beta1). IMPORTANT: the api_resources OPTION itself defaults to an EMPTY map {} -- the v1beta1 values are code-level fallbacks, NOT option defaults. This cluster serves v1beta1 (CAPI v1.13 still serves it; unserved only in v1.16), so an empty api_resources yields v1beta1 lookups that match -- no per-kind override needed.

Sanity-confirm v1beta1 is served per group before installing:

( {
  export KUBECONFIG="$HOME/capi-mgmt.kubeconfig"
  for g in cluster.x-k8s.io controlplane.cluster.x-k8s.io infrastructure.cluster.x-k8s.io \
           bootstrap.cluster.x-k8s.io addons.cluster.x-k8s.io; do
    echo "== $g =="; kubectl api-resources --api-group="$g" 2>/dev/null | awk 'NR==1 || /v1beta1/'
  done
} )
#   Expect v1beta1 for: cluster.x-k8s.io (Cluster/MachineDeployment/Machine),
#   controlplane.cluster.x-k8s.io (KubeadmControlPlane), infrastructure.cluster.x-k8s.io
#   (OpenStackCluster -- verified anchor). If a CORE kind serves ONLY v1beta2, override
#   just that kind via api_resources in Step 7.6; otherwise the defaults work as-is.

Step 7.4 -- Install the driver (1.4.0) + helm in the conductor container

# RUN: jumphost -> magnum/0 --no-deps preserves the deb-managed oslo stack (no PEP668 issue on the 22.04 container).

# egress pre-check
juju ssh -m openstack magnum/0 \
  'curl -s -o /dev/null -w "pypi:%{http_code}\n" https://pypi.org/simple/ ; \
   curl -s -o /dev/null -w "helm:%{http_code}\n" https://get.helm.sh/' </dev/null

# helm v3.17.3 (if not already present from a prior graft)
juju ssh -m openstack magnum/0 'command -v helm && helm version --short || echo "helm absent -- install v3.17.3 from get.helm.sh tarball to /usr/local/bin/helm"' </dev/null

# install the RELEASED contract-coherent driver (supersedes 1.3.0)
juju ssh -m openstack magnum/0 'sudo python3 -m pip install --no-deps --upgrade "magnum-capi-helm==1.4.0"' </dev/null

# verify the install + entry point
juju ssh -m openstack magnum/0 \
  'pip show magnum-capi-helm | egrep "Version|Location"; \
   python3 -c "import importlib.metadata as m; print([e.name for e in m.entry_points(group=\"magnum.drivers\")])"' </dev/null

Expect: Version 1.4.0; k8s_capi_helm_v1 present in the entry points.

Step 7.5 -- api_resources (D-042; set EXPLICITLY to an empty map on this cluster)

1.4.0 exposes ONE [capi_helm] option for this -- api_resources, a JSON string mapping CAPI kinds (Cluster, OpenstackCluster, MachineDeployment, K8sControlPlane, Machine, Manifests, HelmRelease) to {api_version, plural_name}. The driver's CODE falls back to v1beta1 for every CAPI core kind when that kind is absent from the map (Step 7.3), and this cluster serves v1beta1 -- so the map's CONTENTS are empty here. But set it EXPLICITLY to {} in the drop-in (Step 7.6) rather than omit it: the option's registered default is a Python dict {} and the driver runs json.loads() on the value, so an explicit string {} avoids depending on how oslo coerces a non-string default (not empirically testable in the build environment -- explicit-set is the safe choice). Override a specific kind ONLY if Step 7.3 showed it serves ONLY v1beta2, e.g. api_resources = {"Cluster": {"api_version": "cluster.x-k8s.io/v1beta2"}}.

Step 7.6 -- Stage the [capi_helm] conf.d drop-in (D-037)

# RUN: jumphost -> magnum/0 0644 root, NO secrets (it points at the 0600 kubeconfig). The default_helm_chart_version = 0.25.1 line is LOAD-BEARING (driver built-in default is 0.10.1, the retired v1alpha6-era chart). api_resources is set to an explicit empty map {} (Step 7.5 -- the driver's code falls back to v1beta1 for every CAPI kind, which this cluster serves; explicit {} avoids the dict-default json.loads question). ASCII only.

juju ssh -m openstack magnum/0 "sudo tee /etc/magnum/magnum.conf.d/00-capi-helm.conf >/dev/null <<'CONF'
[capi_helm]
kubeconfig_file = /etc/magnum/kubeconfig
helm_chart_repo = https://azimuth-cloud.github.io/capi-helm-charts
helm_chart_name = openstack-cluster
default_helm_chart_version = 0.25.1
api_resources = {}
CONF" </dev/null

If (and only if) Step 7.3 showed a core kind is v1beta2-only, append the override -- ONE line, a JSON value naming just the kinds that need it:

    # api_resources = {"Cluster": {"api_version": "cluster.x-k8s.io/v1beta2"}, ...}

Re-check ASCII cleanliness:

juju ssh -m openstack magnum/0 \
  'LC_ALL=C grep -nP "[^\x00-\x7F]" /etc/magnum/magnum.conf.d/00-capi-helm.conf && echo NON-ASCII || echo "ASCII clean"' </dev/null

Step 7.7 -- Wire config-dir injection via /etc/default (D-037 REVISED; NOT a systemd drop-in)

# RUN: jumphost -> magnum/0 These OpenStack debs run the daemon through an LSB init script wrapped by systemd systemd-start; a systemd ExecStart drop-in is INERT (appendix-A: D-037, L-P6-1/L-P6-2). The sanctioned extension point is /etc/default/magnum-conductor, sourced inside the init script AFTER the base --config-file is assembled. The charm does not manage that file.

# confirm the daemon currently has NO --config-dir (the problem we are fixing)
juju ssh -m openstack magnum/0 'ps -ww -C magnum-conductor -o args=' </dev/null

# create the per-service extension (literal $DAEMON_ARGS -- it expands at source time)
juju ssh -m openstack magnum/0 \
  "echo 'DAEMON_ARGS=\"\$DAEMON_ARGS --config-dir /etc/magnum/magnum.conf.d\"' \
   | sudo tee /etc/default/magnum-conductor >/dev/null && \
   sudo chmod 0644 /etc/default/magnum-conductor" </dev/null

# DRY-RUN verify WITHOUT restarting: the init script's own show-args echoes the assembled cmdline
juju ssh -m openstack magnum/0 '/etc/init.d/magnum-conductor show-args' </dev/null

GATE: show-args must show BOTH --config-file=/etc/magnum/magnum.conf AND --config-dir /etc/magnum/magnum.conf.d. Do not restart until this passes. RESIDUAL (logged): if a future charm hook ever writes /etc/default/magnum-conductor, the append is lost and [capi_helm] silently stops being read -- detect via show-args/ps.

Step 7.8 -- Restart conductor + verify driver + HEALTHY (P6e + D-042 Stage 6)

# RUN: jumphost -> magnum/0, then jumphost health poll.

juju ssh -m openstack magnum/0 \
  'sudo systemctl restart magnum-conductor && sleep 3 && systemctl is-active magnum-conductor && \
   ps -ww -C magnum-conductor -o args=' </dev/null
# expect: active; live cmdline carries --config-dir.

juju ssh -m openstack magnum/0 'sudo magnum-driver-manage list-drivers 2>/dev/null | grep capi || \
   echo "driver list (full):"; sudo magnum-driver-manage list-drivers' </dev/null
# expect: k8s_capi_helm_v1 listed.

Health poll (the D-042 fix target -- this is what 1.3.0 reported UNHEALTHY):

FRESH DEPLOY ROUTING: on a clean redeploy NO cluster exists yet, so there is nothing to poll -- SKIP this poll; the gate is discharged in phase-08 step 8.2 (capi-test-1 reaching health_status = HEALTHY). The poll below applies when grafting onto a cloud that already has a CAPI-driver cluster: substitute that cluster's name and the current ENV(project) id (both are run-specific).

( {
  source ~/admin-openrc
  unset OS_PROJECT_NAME OS_PROJECT_ID OS_TENANT_NAME OS_TENANT_ID
  export OS_PROJECT_ID=674171fd28d446d3a37073b6a761e910       # ENV(project)
  for i in $(seq 1 10); do
    echo "[$i] health=$(openstack coe cluster show capi-test-1 -f value -c health_status 2>/dev/null)"
    echo "    reason=$(openstack coe cluster show capi-test-1 -f value -c health_status_reason 2>/dev/null)"
    sleep 20
  done
} )

GATE (existing-cluster graft only): health_status -> HEALTHY, with the infrastructure sub-check now Ready (it was the only failing axis under 1.3.0). On a FRESH DEPLOY this gate is deferred to phase-08 step 8.2 -- do not block here. If it does not clear on an existing-cluster graft, go to Rollback.

Step 7.9 -- Regression check (confirm create/manage path intact)

# RUN: jumphost (capi-mgmt scope). Prove the upgraded driver still creates+deletes.

FRESH DEPLOY ROUTING: SKIP this step -- the capi-k8s-v1-32 template does not exist yet (phase-08 step 8.0 creates it), and phase-08 itself (create capi-test-1 to CREATE_COMPLETE, full acceptance, then 8.5 delete) is a superset of this check. Run 7.9 as written only when grafting onto an existing cloud where the template is present.

openstack coe cluster create capi-fix-check --cluster-template capi-k8s-v1-32 \
  --keypair capi-mgmt-key --master-count 1 --node-count 1
# watch to CREATE_COMPLETE, then:
openstack coe cluster delete capi-fix-check    # watch to gone

Rollback (TEMPORARY holding state only -- if 7.8 health does not clear or 7.9 regresses)

# RUN: jumphost -> magnum/0 Reverts to the as-first-built functional (cosmetic-UNHEALTHY) state on 1.3.0 -- a TEMPORARY holding state to keep the conductor serving while the 1.4.0 issue is diagnosed, NOT a v1 end state. v1 is NOT complete until magnum-capi-helm==1.4.0 is installed and health_status = HEALTHY (D-011). Re-attempt 7.3-7.9 after diagnosis.

juju ssh -m openstack magnum/0 'sudo python3 -m pip install --no-deps --force-reinstall "magnum-capi-helm==1.3.0"' </dev/null
# restore the config backup if you snapshotted one, then:
juju ssh -m openstack magnum/0 'sudo systemctl restart magnum-conductor' </dev/null

EXIT GATE (phase-07 complete)

  • Conductor reaches the mgmt apiserver via the FIP (TCP-OK); kubeconfig 0600/magnum; helm list OK.
  • magnum-capi-helm 1.4.0 installed (contract-coherent, RELEASED); k8s_capi_helm_v1 enumerated.
  • [capi_helm] drop-in read by the conductor (--config-dir present in the live cmdline).
  • health_status = HEALTHY (infrastructure Ready) on a CAPI-driver cluster -- D-042 issue eliminated. FRESH DEPLOY: no cluster exists yet; this item is DEFERRED to phase-08 step 8.2 (existing-cluster graft: verify here on that cluster).
  • Regression create/delete passed (FRESH DEPLOY: deferred -- phase-08 8.1-8.5 is the superset proof).
  • Proceed to phase-08 (workload-cluster acceptance + D-011).

As-built reference (2026-06-08/09 graft -- audit trail)

  • magnum/0: LXD 1/lxd/2 on openstack1, addr 10.12.4.76, charm magnum 2024.1/stable rev 70, DEB magnum 18.0.1, python3.10, container ubuntu 22.04; conductor user magnum.
  • As-FIRST-built driver: 1.3.0 (pip --no-deps) -> read the version-less v1beta2 ref -> health UNHEALTHY (D-042). PHASE-07 BASELINE supersedes this with the RELEASED magnum-capi-helm==1.4.0 (api_resources; default v1beta1).
  • kubeconfig: /etc/magnum/kubeconfig, -rw------- magnum, ~5657 bytes, server = FIP 10.12.7.40:6443.
  • conf.d drop-in /etc/magnum/magnum.conf.d/00-capi-helm.conf: kubeconfig_file, helm_chart_repo (azimuth), helm_chart_name openstack-cluster, default_helm_chart_version 0.25.1 (api_resources left default -- v1beta1 served by CAPI v1.13.2 / CAPO v0.14.4).
  • config-dir injection: /etc/default/magnum-conductor DAEMON_ARGS="$DAEMON_ARGS --config-dir /etc/magnum/magnum.conf.d"; verified live via ps and the init script show-args.
  • helm v3.17.3 at /usr/local/bin/helm.
  • Driver internals (reference, from installed source): routes on (server_type vm, os ubuntu, coe kubernetes); k8s version comes from the IMAGE kube_version property (NOT a template label), os_distro=ubuntu; flavor floor 2048 MB / 2 vCPU; auto-mints an app credential (workload nodes use the PUBLIC keystone interface); apiServer ALWAYS provisions an Octavia LB (+FIP default).

Next

phase-08 -- workload-cluster acceptance: create a tenant cluster from template capi-k8s-v1-32, confirm CREATE_COMPLETE + Ready nodes + Calico + LB, and run the D-011 (amended per D-019) acceptance criteria.