Newer
Older
openstack-caracal-ipv4 / runbooks / 05-magnum-capi-driver.md
@JANeumatrix JANeumatrix 18 hours ago 21 KB Updates

Runbook 05 — Magnum CAPI Helm driver install

Status: Executes after 04-magnum-domain.md (Keystone wiring) and 04a-capi-bootstrap-cluster.md (workload cluster + kubeconfig staged). Final post-deploy step to make Magnum capable of creating CAPI-managed tenant K8s clusters.

Cross-references:

  • D-007 Layer B (Magnum two-layer install)
  • D-017 (CAPI bootstrap cluster lifecycle)
  • Runbook 04a §19 (workload kubeconfig handoff)
  • Workstream 3c decision (2026-05-22): magnum-capi-helm 1.1.0 from PyPI; workload-cluster kubeconfig (NOT bootstrap k3s)

Known doc inconsistency (tracked for cleanup): D-007's Layer B currently states the kubeconfig points at "capi-mgmt.maas bootstrap k3s". That language is correct for Bobcat (no pivot) but obsolete post-workstream-3b (pivot mandatory). This runbook uses the workload cluster kubeconfig as the canonical target. D-007 patch to follow in a workstream-3 cleanup commit.


1. Purpose & scope

Graft the CAPI Helm driver onto the Charmed Magnum deployment so that openstack coe cluster create provisions tenant K8s clusters via CAPI (in the workload cluster) instead of via the deprecated Heat driver.

Output of this runbook:

  • magnum-capi-helm==1.1.0 installed on the magnum unit's system Python.
  • /etc/magnum/kubeconfig populated with the workload cluster's kubeconfig (post-pivot CAPI controller plane).
  • /etc/magnum/magnum.conf.d/99-capi.conf configured with enabled_drivers = k8s_capi_helm_v1 and [capi_helm] kubeconfig_file=.
  • Systemd overrides on magnum-api and magnum-conductor that replace the init.d wrapper's ExecStart with explicit --config-dir invocation.
  • Both services running cleanly with the CAPI driver loaded.

Scope: v1 testcloud. Roosevelt deltas in §12.

Out of scope:

  • Magnum domain setup (runbook 04)
  • Workload cluster lifecycle (runbook 04a)
  • Smoketest tenant cluster creation is OPTIONAL (§11) — full validation framework belongs in runbook 08.

2. Decisions captured

Decision Choice Reason
Driver pin magnum-capi-helm==1.1.0 from PyPI D-007 correction (stackhpc fork archived Dec 2024; canonical project on opendev/PyPI; 1.1.0 is last Caracal-cycle release)
Install method pip3 install --break-system-packages PEP 668 — Ubuntu 22.04+ requires explicit override for system-site-packages install
Install scope System Python on magnum unit (not venv) Magnum charm uses system-packaged python at /usr/lib/python3/dist-packages/magnum/; driver must import from same site
Kubeconfig target Workload cluster (post-pivot) Workstream 3b — bootstrap k3s is empty post-pivot; CAPI controllers live in workload
Kubeconfig source $HOME/magnum-capi/capi-mgmt-cluster.kubeconfig (staged by 04a §19) Documented handoff
Driver entry-point name k8s_capi_helm_v1 Per upstream magnum-capi-helm 1.1.0; verify in §10
Conf.d filename 99-capi.conf Numeric prefix ensures it loads AFTER any charm-managed conf, so enabled_drivers override wins
File encoding ASCII-only Non-ASCII in conf.d causes silent magnum daemon failures (handoff lesson; cf. Horizon local_settings.d issue)
Trustee credential Existing magnum-shared user (charm-managed) Roosevelt will use app-credential pattern

3. Prerequisites

Prereq Verification
Magnum charm active/idle `juju status magnum \ grep magnum/0showsactive idle`
Magnum domain setup completed (runbook 04) `openstack domain show magnum \ grep enabledreturnsTrue`
Workload cluster reachable from jumphost kubectl --kubeconfig $HOME/magnum-capi/capi-mgmt-cluster.kubeconfig get nodes returns Ready nodes
CAPI controllers running in workload cluster `kubectl --kubeconfig $HOME/magnum-capi/capi-mgmt-cluster.kubeconfig get pods -n capi-system \ grep -v Running \ grep -v NAME` empty
Workload kubeconfig staged at expected path test -r $HOME/magnum-capi/capi-mgmt-cluster.kubeconfig && stat -c %a $HOME/magnum-capi/capi-mgmt-cluster.kubeconfig shows 600
juju exec works to magnum/leader (use exec, NOT ssh, for non-interactive — handoff lesson) juju exec --unit magnum/leader -- hostname returns the unit hostname

Set shell context:

export WORK=$HOME/magnum-capi
export WORKLOAD_KUBECONFIG=$WORK/capi-mgmt-cluster.kubeconfig
export DRIVER_VERSION=magnum-capi-helm==1.1.0   # per D-007 correction
cd "$WORK"

juju ssh vs juju exec choice: the handoff lessons explicitly call out that juju ssh hangs when stdout is redirected (PTY allocation issue). This runbook uses juju exec for all non-interactive command execution and reserves juju ssh only for cases where you actually want an interactive shell.


4. Pre-flight: capture current state

Capture the magnum unit's state BEFORE making changes. Useful for diagnosis if anything goes wrong, and as a record of what was changed.

mkdir -p "$WORK/pre-state"

# Service unit files (as managed by charm)
juju exec --unit magnum/leader -- \
  'sudo systemctl cat magnum-api magnum-conductor 2>&1' \
  > "$WORK/pre-state/systemd-units.txt"

# Currently-enabled drivers
juju exec --unit magnum/leader -- \
  'sudo grep -r enabled_drivers /etc/magnum/ 2>/dev/null || echo "(no enabled_drivers found — charm default applies)"' \
  > "$WORK/pre-state/drivers-pre.txt"

# Python site-packages — see what's already installed
juju exec --unit magnum/leader -- \
  'sudo pip3 list 2>/dev/null | grep -iE "magnum|cluster|helm|kubernetes" || true' \
  > "$WORK/pre-state/pip-pre.txt"

# conf.d state
juju exec --unit magnum/leader -- \
  'sudo ls -la /etc/magnum/magnum.conf.d/ 2>/dev/null || echo "(no conf.d directory)"' \
  > "$WORK/pre-state/confd-pre.txt"

# Service running state
juju exec --unit magnum/leader -- \
  'sudo systemctl is-active magnum-api magnum-conductor' \
  > "$WORK/pre-state/service-state-pre.txt"

# Display the captured state
cat "$WORK/pre-state/"*.txt

What to look for in pre-state: the charm-managed enabled_drivers value probably includes Heat-based drivers (heat_kubernetes, etc.). The 99-capi.conf override in §7 replaces this with the single CAPI driver. The pre-state capture documents what was active before the override took effect.


5. Install magnum-capi-helm 1.1.0

juju exec --unit magnum/leader -- \
  "sudo pip3 install $DRIVER_VERSION --break-system-packages"

Verify install:

juju exec --unit magnum/leader -- \
  'sudo pip3 show magnum-capi-helm | head -10'
# Expect: Name: magnum-capi-helm
#         Version: 1.1.0
#         Location: /usr/lib/python3/dist-packages

juju exec --unit magnum/leader -- \
  'sudo python3 -c "import magnum_capi_helm; print(magnum_capi_helm.__file__)"'
# Expect: /usr/lib/python3/dist-packages/magnum_capi_helm/__init__.py

Check that the driver entry point is registered:

juju exec --unit magnum/leader -- \
  'sudo python3 -c "
from stevedore import driver
mgr = driver.DriverManager(
    namespace=\"magnum.drivers\",
    name=\"k8s_capi_helm_v1\",
    invoke_on_load=False
)
print(\"Driver class:\", mgr.driver)
"'
# Expect: Driver class: <class 'magnum_capi_helm.driver.Driver'>
# (or similar — the actual class path is package-version-dependent)

If the entry point check fails with "No 'k8s_capi_helm_v1' driver found", the driver name in 1.1.0 may differ from what D-007 documented. Inspect the installed package's entry_points.txt:

juju exec --unit magnum/leader -- \
  'sudo cat /usr/lib/python3/dist-packages/magnum_capi_helm*.dist-info/entry_points.txt 2>/dev/null'

Find the entry under [magnum.drivers] — use that exact name in §7.


6. Stage workload kubeconfig on magnum unit

# Transfer kubeconfig from jumphost to magnum unit
juju scp "$WORKLOAD_KUBECONFIG" magnum/leader:/tmp/kubeconfig

# Install with correct ownership/mode in one atomic step
juju exec --unit magnum/leader -- \
  'sudo install -m 0640 -o root -g magnum /tmp/kubeconfig /etc/magnum/kubeconfig && sudo rm /tmp/kubeconfig'

Verify:

juju exec --unit magnum/leader -- \
  'sudo ls -la /etc/magnum/kubeconfig'
# Expect: -rw-r----- 1 root magnum ... /etc/magnum/kubeconfig

# Confirm magnum user can read it
juju exec --unit magnum/leader -- \
  'sudo -u magnum cat /etc/magnum/kubeconfig | head -3'
# Expect: apiVersion: v1 / clusters: / - cluster:

# Confirm kubectl can use it from the magnum unit (sanity check on API reachability)
juju exec --unit magnum/leader -- \
  'sudo -u magnum kubectl --kubeconfig /etc/magnum/kubeconfig get nodes 2>&1 | head -10'
# Expect: NAME ... STATUS=Ready for control plane + workers
# OR: kubectl not installed (acceptable — magnum-capi-helm uses Python client, not kubectl)

Why mode 0640 and group magnum: kubeconfig contains auth tokens. Mode 0600 (owner-only) wouldn't let the magnum system user (which runs magnum-api/conductor) read it. Mode 0640 with group: magnum is the minimum-permission setup that works. NOT 0644 — keeps it off other users on the unit.


7. Configure /etc/magnum/magnum.conf.d/99-capi.conf

Generate the conf locally first (snap confinement does not apply to plain bash on jumphost, but we keep paths under $HOME for consistency), then transfer.

ASCII-only verification is critical — the handoff documents non-ASCII characters in conf.d files causing silent daemon failures (cf. Horizon local_settings.d). Use plain straight quotes, ASCII dashes, no smart typography.

# Write locally
cat > "$WORK/99-capi.conf" <<'EOF'
[DEFAULT]
enabled_drivers = k8s_capi_helm_v1

[capi_helm]
kubeconfig_file = /etc/magnum/kubeconfig
EOF

# Verify it is pure ASCII (no UTF-8 sneakers)
file "$WORK/99-capi.conf"
# Expect: ASCII text
# If it says "UTF-8 Unicode text", STOP and rewrite by hand — even one stray
# em-dash or smart quote will silently break magnum

# Hex dump check (paranoid mode)
xxd "$WORK/99-capi.conf" | grep -v "^[0-9a-f]*: [0-9a-f ]*  [a-zA-Z0-9 \[\]=._/]*$" | head -5
# Expect: empty output (all bytes are printable ASCII)

Stage and install:

juju scp "$WORK/99-capi.conf" magnum/leader:/tmp/99-capi.conf

juju exec --unit magnum/leader -- \
  'sudo mkdir -p /etc/magnum/magnum.conf.d && sudo install -m 0644 -o root -g root /tmp/99-capi.conf /etc/magnum/magnum.conf.d/99-capi.conf && sudo rm /tmp/99-capi.conf'

# Verify
juju exec --unit magnum/leader -- \
  'sudo ls -la /etc/magnum/magnum.conf.d/ && sudo cat /etc/magnum/magnum.conf.d/99-capi.conf'
# Expect: file listed; content matches what was written

8. Systemd override on magnum-api + magnum-conductor

The Charmed Magnum unit files use a wrapper pattern:

ExecStart=/etc/init.d/magnum-api systemd-start

The wrapper does NOT pass --config-dir to magnum-api, so /etc/magnum/magnum.conf.d/ is never loaded. The 99-capi.conf would have no effect.

Override with explicit --config-file + --config-dir invocation.

Generate override files locally:

cat > "$WORK/magnum-api-override.conf" <<'EOF'
[Service]
ExecStart=
ExecStart=/usr/bin/magnum-api --config-file=/etc/magnum/magnum.conf --config-dir=/etc/magnum/magnum.conf.d
EOF

cat > "$WORK/magnum-conductor-override.conf" <<'EOF'
[Service]
ExecStart=
ExecStart=/usr/bin/magnum-conductor --config-file=/etc/magnum/magnum.conf --config-dir=/etc/magnum/magnum.conf.d
EOF

# ASCII check
file "$WORK/magnum-api-override.conf" "$WORK/magnum-conductor-override.conf"
# Expect: ASCII text x2

The empty ExecStart= line is critical. Systemd accumulates ExecStart directives by default; an empty assignment is required to CLEAR the inherited directive before setting the replacement. Without the empty line, the unit would have BOTH the init.d wrapper AND the new direct invocation, and would likely fail to start.

Install on the unit:

juju scp "$WORK/magnum-api-override.conf" magnum/leader:/tmp/magnum-api-override.conf
juju scp "$WORK/magnum-conductor-override.conf" magnum/leader:/tmp/magnum-conductor-override.conf

juju exec --unit magnum/leader -- \
  'sudo mkdir -p /etc/systemd/system/magnum-api.service.d /etc/systemd/system/magnum-conductor.service.d && \
   sudo install -m 0644 -o root -g root /tmp/magnum-api-override.conf /etc/systemd/system/magnum-api.service.d/override.conf && \
   sudo install -m 0644 -o root -g root /tmp/magnum-conductor-override.conf /etc/systemd/system/magnum-conductor.service.d/override.conf && \
   sudo rm /tmp/magnum-api-override.conf /tmp/magnum-conductor-override.conf'

# Reload systemd to pick up the overrides
juju exec --unit magnum/leader -- 'sudo systemctl daemon-reload'

# Verify the overrides are effective (systemctl cat shows combined unit + overrides)
juju exec --unit magnum/leader -- 'sudo systemctl cat magnum-api | grep -A1 ExecStart'
# Expect: TWO ExecStart= lines — the empty clear-line and the new /usr/bin/magnum-api invocation
juju exec --unit magnum/leader -- 'sudo systemctl cat magnum-conductor | grep -A1 ExecStart'
# Expect: TWO ExecStart= lines as above for magnum-conductor

Charm reconciliation note: the Magnum charm may rewrite its own systemd units on config changes or upgrades. The drop-in override at /etc/systemd/system/magnum-api.service.d/override.conf is OUTSIDE the charm's writable zone and should survive. Verify after any juju refresh or juju config magnum command by re-running the systemctl cat check above.


9. Restart services + verify health

juju exec --unit magnum/leader -- \
  'sudo systemctl restart magnum-api magnum-conductor'

# Wait briefly for services to initialize
sleep 5

# Check active state
juju exec --unit magnum/leader -- \
  'sudo systemctl is-active magnum-api magnum-conductor'
# Expect: active (x2)

# Examine recent journal for errors (the critical step — magnum's silent failure
# mode means we must read logs, not just trust is-active)
juju exec --unit magnum/leader -- \
  'sudo journalctl -u magnum-api --since "2 minutes ago" --no-pager | tail -50'
juju exec --unit magnum/leader -- \
  'sudo journalctl -u magnum-conductor --since "2 minutes ago" --no-pager | tail -50'

Look for these red flags in the logs:

Symptom Likely cause Remediation
ImportError: No module named magnum_capi_helm §5 pip install failed Re-run §5; check pip3 output
EntryPointError: No 'k8s_capi_helm_v1' driver Driver entry-point name mismatch Verify name per §5 footnote; update §7
Service repeatedly restarts (look for "Started" appearing twice in 10s) Likely a config error in 99-capi.conf Re-check ASCII-only; check magnum.conf.d permissions
kubeconfig_file not honored --config-dir not being passed §8 override not active; re-run systemctl daemon-reload
Silent: no error but driver also not loading Non-ASCII char snuck into a conf file /etc/magnum/magnum.conf.d/99-capi.conf — if it says UTF-8, regenerate

10. CAPI driver enablement check

Verify the driver is actually loaded by Magnum and reachable via the API.

source $HOME/admin-openrc

# List supported COE drivers via the Magnum API
openstack coe cluster template list -f json
# (empty templates list is fine — we are checking the endpoint responds)

# Direct check on the unit: scan the service's loaded drivers
juju exec --unit magnum/leader -- \
  'sudo journalctl -u magnum-conductor --since "5 minutes ago" --no-pager | grep -iE "driver|enabled" | head -20'
# Expect: a line mentioning k8s_capi_helm_v1 having been loaded
# (Magnum logs the loaded drivers at startup)

# Definitive check: try creating a cluster template that requires the CAPI driver
openstack coe cluster template create magnum-capi-driver-check \
  --image noble-amd64 \
  --keypair capi-workload-key \
  --external-network ext_net \
  --master-flavor capi-mgmt-node \
  --flavor capi-mgmt-node \
  --coe kubernetes \
  --network-driver calico \
  --labels kube_tag=v1.31.4

openstack coe cluster template show magnum-capi-driver-check -c name -c coe -c labels

If template create fails with "driver not enabled" or similar: the Magnum API process is not loading the conf.d. Verify the systemd override took effect — sudo systemctl show magnum-api -p ExecStart on the unit should show the explicit --config-dir invocation. If it still shows the init.d wrapper, the daemon-reload + restart did not pick up the override.

Cleanup the driver-check template:

openstack coe cluster template delete magnum-capi-driver-check

11. Optional smoketest — create a tenant CAPI cluster

This step is optional. Full validation belongs in runbook 08. Use this smoketest only if you want immediate confirmation that the entire chain (Magnum API -> conductor -> magnum-capi-helm -> CAPI controllers in workload cluster -> tenant K8s cluster on tenant VMs) works end-to-end.

# Create a cluster template tuned for testcloud smoketest
openstack coe cluster template create magnum-smoketest-template \
  --image noble-amd64 \
  --keypair capi-workload-key \
  --external-network ext_net \
  --master-flavor capi-mgmt-node \
  --flavor capi-mgmt-node \
  --coe kubernetes \
  --network-driver calico \
  --labels boot_volume_size=20,kube_tag=v1.31.4,octavia_provider=ovn

# Create a 1+1 cluster (minimum for smoketest)
openstack coe cluster create magnum-smoketest \
  --cluster-template magnum-smoketest-template \
  --master-count 1 \
  --node-count 1

# Poll for status (15-20 min typical; CAPI provisions tenant VMs end-to-end)
for i in $(seq 1 60); do
  STATUS=$(openstack coe cluster show magnum-smoketest -c status -f value)
  echo "$(date -Is) status=$STATUS"
  case "$STATUS" in
    CREATE_COMPLETE) echo "Smoketest passed"; break ;;
    CREATE_FAILED)   echo "Smoketest FAILED"; openstack coe cluster show magnum-smoketest; exit 1 ;;
  esac
  sleep 30
done

# Retrieve the smoketest cluster's kubeconfig
openstack coe cluster config magnum-smoketest --dir "$WORK/smoketest-kubeconfig"

# Sanity-check the smoketest cluster
KUBECONFIG="$WORK/smoketest-kubeconfig/config" kubectl get nodes
KUBECONFIG="$WORK/smoketest-kubeconfig/config" kubectl get pods -A | head -20

# Cleanup the smoketest cluster
openstack coe cluster delete magnum-smoketest
openstack coe cluster template delete magnum-smoketest-template

What success looks like: the CAPI controllers in the workload cluster receive the new Cluster CR (created by magnum-capi-helm in response to the Magnum API call), CAPO talks to OpenStack to provision tenant VMs, the tenant VMs join the new K8s cluster, and the new cluster has 1 control plane + 1 worker Ready. Octavia provides the API server LB (visible as a Floating IP in the tenant project).


12. Roosevelt deltas (forward-look)

Aspect Testcloud (v1) Roosevelt
Driver pin source PyPI magnum-capi-helm==1.1.0 Internal mirror with checksum verification
Driver pin record Implicit in this runbook Captured in Vault as audit artifact alongside CAPI pins
Kubeconfig source Workload cluster (post-pivot per 04a §17) Same
Kubeconfig rotation Manual on capi-mgmt rebuild Automated when workload cluster cert rotates
Trustee credential Charm-default magnum-shared user Per-tenant app credentials via Vault auth method
Magnum HA num_units=1 (per D-009 testcloud) num_units=3 with hacluster + provider VIP
Driver upgrade discipline Manual re-run of §5 Tracked maintenance window; Vault audit log
Systemd override Drop-in at /etc/systemd/system/magnum-*.service.d/override.conf Same — but provided via a charm overlay package, not manual file install
ASCII-only enforcement Manual check (§7, §8) Pre-flight lint in scripts/pre-flight-checks.sh

13. Documented runtime gotchas (carry-forward from handoff)

These gotchas burned cycles during the Bobcat Magnum CAPI work. Each is explicitly handled in this runbook; collecting them here for visibility:

  1. PEP 668 --break-system-packages (§5). Ubuntu 22.04+ refuses pip install against system Python by default. The flag is required for the magnum-capi-helm install path used by Charmed Magnum.
  2. juju ssh hangs on stdout redirect. PTY allocation issue. This runbook uses juju exec for all non-interactive command execution.
  3. Heredoc nesting in juju ssh is fragile. This runbook writes conf files locally first and uses juju scp + juju exec install to transfer — single-level only.
  4. Non-ASCII characters in conf.d files cause silent daemon failures. §7 and §8 both include file <path> ASCII verification before transfer.
  5. openstack -f value -c X -c Y outputs in alphabetical field order, not flag order. This runbook uses single-column queries or -f json | jq throughout.
  6. Charm-managed enabled_drivers is overridden, not appended. The enabled_drivers = k8s_capi_helm_v1 line in 99-capi.conf REPLACES the charm-default value (which would include the deprecated Heat drivers).
  7. The systemd override empty ExecStart= line is required to clear the inherited ExecStart before setting the replacement (§8).
  8. Snap-confined openstack CLI cannot read /tmp. This runbook stages files under $WORK=$HOME/magnum-capi. The smoketest in §11 also writes to $WORK/smoketest-kubeconfig.

14. Change log

Date Change Reference
2026-05-22 Document created. magnum-capi-helm 1.1.0 from PyPI; workload-cluster kubeconfig (post-pivot per workstream 3b); systemd override pattern; ASCII-only conf.d. Workstream 3c