Graft the magnum-capi-helm CAPI driver onto the charm-managed conductor (magnum/0), point it at the in-cloud management cluster (phase-06) via the FIP, and land on a CONTRACT-COHERENT driver so coe cluster health reports HEALTHY. The driver upgrade (D-042) is part of the v1 baseline here, not a follow-up -- the as-first-built 1.3.0 read the version-less v1beta2 infrastructureRef and reported a cosmetic UNHEALTHY; it is superseded by the RELEASED magnum-capi-helm==1.4.0, which is the v1 end state.
Decisions: D-031 (driver/engine/surface), D-037 (conf.d drop-in + config-dir via /etc/default, NOT a systemd ExecStart drop-in), D-042 (driver must be contract-coherent with the Layer-A core; amends D-034). D-036 (driver/engine/ chart coherence). Troubleshooting: appendix-A DOCFIX-021, D-037, D-042, and lessons L-P6-1..4.
capi-mgmt-v2 Ready, CAPI stack up (ORC Image CRD present, no crash-looping CAPO), ~/capi-mgmt.kubeconfig (server = FIP) works from the jumphost.magnum/0); the Keystone trustee domain is auto-configured by the magnum charm via its keystone (identity-credentials) relation -- verify [trust] (trustee_domain_id / trustee_domain_admin_id / trustee_domain_admin_password) is populated in magnum.conf; no manual step.admin-openrc on the jumphost; juju (model openstack); jq.ENV(conductor-unit) magnum/0 (LXD 1/lxd/2 on openstack1; addr 10.12.4.76)ENV(conductor-src) 10.12.4.76/32 (the conductor's provider IP; SG source)ENV(mgmt-fip) 10.12.7.40 (mgmt apiserver; kubeconfig server)ENV(mgmt-sg) capi-mgmt-sg (in the capi-mgmt project)ENV(project) capi-mgmt (id 674171fd28d446d3a37073b6a761e910)ENV(magnum-ns) magnum-674171fd28d446d3a37073b6a761e910 (driver namespace per project)ENV(chart-ver) 0.25.1 (capi-helm-charts; load-bearing -- driver default is 0.10.1)ENV(helm-ver) v3.17.3# RUN: jumphost -- vopenstack-jesse as jessea123 (admin-openrc).# RUN: jumphost -> magnum/0-- shipped to the conductor via juju ssh -m openstack magnum/0 '...' </dev/null (DOCFIX-021: </dev/null on every juju ssh / sudo so the remote command does not eat the heredoc/pipe).magnum; daemon launched by an LSB init script wrapped by systemd systemd-start (NOT a direct ExecStart) -- see Step 7.7.# RUN: jumphost (scoped to the capi-mgmt project). Idempotent.
( {
set -u
# scope openstack CLI to the capi-mgmt project (id form -- robust to name/domain)
source ~/admin-openrc
unset OS_PROJECT_NAME OS_PROJECT_ID OS_TENANT_NAME OS_TENANT_ID
export OS_PROJECT_ID=674171fd28d446d3a37073b6a761e910 # ENV(project)
SG=$(openstack security group show capi-mgmt-sg -f value -c id) # ENV(mgmt-sg)
echo "SG=$SG"
echo "=== add ingress tcp/6443 from the conductor 10.12.4.76/32 (if absent) ==="
openstack security group rule list "$SG" -f value -c "IP Range" -c "Port Range" \
| grep -q '10.12.4.76/32 6443:6443' \
|| openstack security group rule create --proto tcp --dst-port 6443 \
--remote-ip 10.12.4.76/32 "$SG"
openstack security group rule list "$SG" -f value -c Protocol -c "Port Range" -c "IP Range"
} )
Then prove conductor -> mgmt apiserver reachability:
# RUN: jumphost -> magnum/0 juju ssh -m openstack magnum/0 \ "timeout 6 bash -c 'exec 3<>/dev/tcp/10.12.7.40/6443' && echo TCP-OK || echo TCP-FAIL" </dev/null
GATE: require TCP-OK. (Pre-existing jumphost rules tcp/22+6443 from 10.12.4.1/32 remain.)
# RUN: jumphost -> magnum/0 The source ~/capi-mgmt.kubeconfig already has its server rewritten to the FIP (phase-06 6.5). Transfer base64-piped straight into a root-written 0600 file owned by the conductor user -- never stage the admin kubeconfig in /tmp (appendix-A: L-P6-4).
# discover the conductor service user (expect: magnum) juju ssh -m openstack magnum/0 'systemctl show magnum-conductor -p User --value' </dev/null # transfer (umask 077; chown to the discovered user; 0600) # NOTE: NO trailing </dev/null here -- stdin IS the payload. A </dev/null would # override the pipe (SC2259) and silently write an EMPTY kubeconfig while the # && chain still exits 0. DOCFIX-021 applies only to commands whose stdin is # NOT in use; the discovery line above keeps it, this pipe must not. base64 ~/capi-mgmt.kubeconfig | juju ssh -m openstack magnum/0 \ "sudo bash -c 'umask 077; base64 -d > /etc/magnum/kubeconfig && \ getent passwd magnum >/dev/null && chown magnum: /etc/magnum/kubeconfig && \ chmod 0600 /etc/magnum/kubeconfig'" # verify byte-exact (hashes must match before proceeding) sha256sum ~/capi-mgmt.kubeconfig juju ssh -m openstack magnum/0 'sudo sha256sum /etc/magnum/kubeconfig' </dev/null
GATE: the two sha256 hashes are identical (an empty or truncated transfer fails here, not three steps later as a confusing conductor auth error). End-to-end proof (the conductor user authenticates to the mgmt cluster via the FIP):
juju ssh -m openstack magnum/0 \ 'sudo -u magnum env HOME=/tmp helm --kubeconfig /etc/magnum/kubeconfig list -A' </dev/null
Expect: the mgmt-cluster helm releases listed (cert-manager, ck-dns, ck-network cilium, cluster-api-addon-provider, cluster-api-janitor-openstack, metrics-server). GATE: a populated list = reach + auth OK. (Hardening, Roosevelt: replace this cluster-admin kubeconfig with a scoped ServiceAccount kubeconfig.)
# RUN: jumphost + jumphost kubectl. The fix is the RELEASED tag magnum-capi-helm==1.4.0 (the "generalize-api-resources" feature). 1.3.0 read the version-less v1beta2 infrastructureRef and failed the health GET; 1.4.0 resolves each resource query as api_resources.get(<Kind>,{}).get("api_version", <code-default>), where the driver's CODE defaults are v1beta1 for every CAPI core kind (Cluster / MachineDeployment / Machine -> cluster.x-k8s.io/v1beta1; OpenstackCluster -> infrastructure.cluster.x-k8s.io/v1beta1; K8sControlPlane -> controlplane.cluster.x-k8s.io/v1beta1). IMPORTANT: the api_resources OPTION itself defaults to an EMPTY map {} -- the v1beta1 values are code-level fallbacks, NOT option defaults. This cluster serves v1beta1 (CAPI v1.13 still serves it; unserved only in v1.16), so an empty api_resources yields v1beta1 lookups that match -- no per-kind override needed.
Sanity-confirm v1beta1 is served per group before installing:
( {
export KUBECONFIG="$HOME/capi-mgmt.kubeconfig"
for g in cluster.x-k8s.io controlplane.cluster.x-k8s.io infrastructure.cluster.x-k8s.io \
bootstrap.cluster.x-k8s.io addons.cluster.x-k8s.io; do
echo "== $g =="; kubectl api-resources --api-group="$g" 2>/dev/null | awk 'NR==1 || /v1beta1/'
done
} )
# Expect v1beta1 for: cluster.x-k8s.io (Cluster/MachineDeployment/Machine),
# controlplane.cluster.x-k8s.io (KubeadmControlPlane), infrastructure.cluster.x-k8s.io
# (OpenStackCluster -- verified anchor). If a CORE kind serves ONLY v1beta2, override
# just that kind via api_resources in Step 7.6; otherwise the defaults work as-is.
# RUN: jumphost -> magnum/0 --no-deps preserves the deb-managed oslo stack (no PEP668 issue on the 22.04 container).
# egress pre-check
juju ssh -m openstack magnum/0 \
'curl -s -o /dev/null -w "pypi:%{http_code}\n" https://pypi.org/simple/ ; \
curl -s -o /dev/null -w "helm:%{http_code}\n" https://get.helm.sh/' </dev/null
# helm v3.17.3 (if not already present from a prior graft)
juju ssh -m openstack magnum/0 'command -v helm && helm version --short || echo "helm absent -- install v3.17.3 from get.helm.sh tarball to /usr/local/bin/helm"' </dev/null
# install the RELEASED contract-coherent driver (supersedes 1.3.0)
juju ssh -m openstack magnum/0 'sudo python3 -m pip install --no-deps --upgrade "magnum-capi-helm==1.4.0"' </dev/null
# verify the install + entry point
juju ssh -m openstack magnum/0 \
'pip show magnum-capi-helm | egrep "Version|Location"; \
python3 -c "import importlib.metadata as m; print([e.name for e in m.entry_points(group=\"magnum.drivers\")])"' </dev/null
Expect: Version 1.4.0; k8s_capi_helm_v1 present in the entry points.
1.4.0 exposes ONE [capi_helm] option for this -- api_resources, a JSON string mapping CAPI kinds (Cluster, OpenstackCluster, MachineDeployment, K8sControlPlane, Machine, Manifests, HelmRelease) to {api_version, plural_name}. The driver's CODE falls back to v1beta1 for every CAPI core kind when that kind is absent from the map (Step 7.3), and this cluster serves v1beta1 -- so the map's CONTENTS are empty here. But set it EXPLICITLY to {} in the drop-in (Step 7.6) rather than omit it: the option's registered default is a Python dict {} and the driver runs json.loads() on the value, so an explicit string {} avoids depending on how oslo coerces a non-string default (not empirically testable in the build environment -- explicit-set is the safe choice). Override a specific kind ONLY if Step 7.3 showed it serves ONLY v1beta2, e.g. api_resources = {"Cluster": {"api_version": "cluster.x-k8s.io/v1beta2"}}.
# RUN: jumphost -> magnum/0 0644 root, NO secrets (it points at the 0600 kubeconfig). The default_helm_chart_version = 0.25.1 line is LOAD-BEARING (driver built-in default is 0.10.1, the retired v1alpha6-era chart). api_resources is set to an explicit empty map {} (Step 7.5 -- the driver's code falls back to v1beta1 for every CAPI kind, which this cluster serves; explicit {} avoids the dict-default json.loads question). ASCII only.
juju ssh -m openstack magnum/0 "sudo tee /etc/magnum/magnum.conf.d/00-capi-helm.conf >/dev/null <<'CONF'
[capi_helm]
kubeconfig_file = /etc/magnum/kubeconfig
helm_chart_repo = https://azimuth-cloud.github.io/capi-helm-charts
helm_chart_name = openstack-cluster
default_helm_chart_version = 0.25.1
api_resources = {}
CONF" </dev/null
If (and only if) Step 7.3 showed a core kind is v1beta2-only, append the override -- ONE line, a JSON value naming just the kinds that need it:
# api_resources = {"Cluster": {"api_version": "cluster.x-k8s.io/v1beta2"}, ...}
Re-check ASCII cleanliness:
juju ssh -m openstack magnum/0 \ 'LC_ALL=C grep -nP "[^\x00-\x7F]" /etc/magnum/magnum.conf.d/00-capi-helm.conf && echo NON-ASCII || echo "ASCII clean"' </dev/null
# RUN: jumphost -> magnum/0 These OpenStack debs run the daemon through an LSB init script wrapped by systemd systemd-start; a systemd ExecStart drop-in is INERT (appendix-A: D-037, L-P6-1/L-P6-2). The sanctioned extension point is /etc/default/magnum-conductor, sourced inside the init script AFTER the base --config-file is assembled. The charm does not manage that file.
# confirm the daemon currently has NO --config-dir (the problem we are fixing) juju ssh -m openstack magnum/0 'ps -ww -C magnum-conductor -o args=' </dev/null # create the per-service extension (literal $DAEMON_ARGS -- it expands at source time) juju ssh -m openstack magnum/0 \ "echo 'DAEMON_ARGS=\"\$DAEMON_ARGS --config-dir /etc/magnum/magnum.conf.d\"' \ | sudo tee /etc/default/magnum-conductor >/dev/null && \ sudo chmod 0644 /etc/default/magnum-conductor" </dev/null # DRY-RUN verify WITHOUT restarting: the init script's own show-args echoes the assembled cmdline juju ssh -m openstack magnum/0 '/etc/init.d/magnum-conductor show-args' </dev/null
GATE: show-args must show BOTH --config-file=/etc/magnum/magnum.conf AND --config-dir /etc/magnum/magnum.conf.d. Do not restart until this passes. RESIDUAL (logged): if a future charm hook ever writes /etc/default/magnum-conductor, the append is lost and [capi_helm] silently stops being read -- detect via show-args/ps.
# RUN: jumphost -> magnum/0, then jumphost health poll.
juju ssh -m openstack magnum/0 \ 'sudo systemctl restart magnum-conductor && sleep 3 && systemctl is-active magnum-conductor && \ ps -ww -C magnum-conductor -o args=' </dev/null # expect: active; live cmdline carries --config-dir. juju ssh -m openstack magnum/0 'sudo magnum-driver-manage list-drivers 2>/dev/null | grep capi || \ echo "driver list (full):"; sudo magnum-driver-manage list-drivers' </dev/null # expect: k8s_capi_helm_v1 listed.
Health poll (the D-042 fix target -- this is what 1.3.0 reported UNHEALTHY):
FRESH DEPLOY ROUTING: on a clean redeploy NO cluster exists yet, so there is nothing to poll -- SKIP this poll; the gate is discharged in phase-08 step 8.2 (capi-test-1 reaching health_status = HEALTHY). The poll below applies when grafting onto a cloud that already has a CAPI-driver cluster: substitute that cluster's name and the current ENV(project) id (both are run-specific).
( {
source ~/admin-openrc
unset OS_PROJECT_NAME OS_PROJECT_ID OS_TENANT_NAME OS_TENANT_ID
export OS_PROJECT_ID=674171fd28d446d3a37073b6a761e910 # ENV(project)
for i in $(seq 1 10); do
echo "[$i] health=$(openstack coe cluster show capi-test-1 -f value -c health_status 2>/dev/null)"
echo " reason=$(openstack coe cluster show capi-test-1 -f value -c health_status_reason 2>/dev/null)"
sleep 20
done
} )
GATE (existing-cluster graft only): health_status -> HEALTHY, with the infrastructure sub-check now Ready (it was the only failing axis under 1.3.0). On a FRESH DEPLOY this gate is deferred to phase-08 step 8.2 -- do not block here. If it does not clear on an existing-cluster graft, go to Rollback.
# RUN: jumphost (capi-mgmt scope). Prove the upgraded driver still creates+deletes.
FRESH DEPLOY ROUTING: SKIP this step -- the capi-k8s-v1-32 template does not exist yet (phase-08 step 8.0 creates it), and phase-08 itself (create capi-test-1 to CREATE_COMPLETE, full acceptance, then 8.5 delete) is a superset of this check. Run 7.9 as written only when grafting onto an existing cloud where the template is present.
openstack coe cluster create capi-fix-check --cluster-template capi-k8s-v1-32 \ --keypair capi-mgmt-key --master-count 1 --node-count 1 # watch to CREATE_COMPLETE, then: openstack coe cluster delete capi-fix-check # watch to gone
# RUN: jumphost -> magnum/0 Reverts to the as-first-built functional (cosmetic-UNHEALTHY) state on 1.3.0 -- a TEMPORARY holding state to keep the conductor serving while the 1.4.0 issue is diagnosed, NOT a v1 end state. v1 is NOT complete until magnum-capi-helm==1.4.0 is installed and health_status = HEALTHY (D-011). Re-attempt 7.3-7.9 after diagnosis.
juju ssh -m openstack magnum/0 'sudo python3 -m pip install --no-deps --force-reinstall "magnum-capi-helm==1.3.0"' </dev/null # restore the config backup if you snapshotted one, then: juju ssh -m openstack magnum/0 'sudo systemctl restart magnum-conductor' </dev/null
k8s_capi_helm_v1 enumerated.--config-dir present in the live cmdline).health_status = HEALTHY (infrastructure Ready) on a CAPI-driver cluster -- D-042 issue eliminated. FRESH DEPLOY: no cluster exists yet; this item is DEFERRED to phase-08 step 8.2 (existing-cluster graft: verify here on that cluster).magnum.DAEMON_ARGS="$DAEMON_ARGS --config-dir /etc/magnum/magnum.conf.d"; verified live via ps and the init script show-args.kube_version property (NOT a template label), os_distro=ubuntu; flavor floor 2048 MB / 2 vCPU; auto-mints an app credential (workload nodes use the PUBLIC keystone interface); apiServer ALWAYS provisions an Octavia LB (+FIP default).phase-08 -- workload-cluster acceptance: create a tenant cluster from template capi-k8s-v1-32, confirm CREATE_COMPLETE + Ready nodes + Calico + LB, and run the D-011 (amended per D-019) acceptance criteria.