Newer
Older
openstack-caracal-ipv4 / runbooks / 05-magnum-capi-driver.md

Runbook 05 — Magnum CAPI Helm Driver Graft

Reference: D-007 Layer B (rescoped per D-017). Runs after 04a-capi-bootstrap-cluster.md.

Purpose: Install the stackhpc/magnum-capi-helm driver into the Magnum charm's Python environment, configure Magnum to load and use it, and verify end-to-end cluster creation via the driver against the bootstrap k3s management cluster on capi-mgmt.maas.

Prerequisites:

  • Runbook 04 complete (Magnum trustee domain created)
  • Runbook 04a complete (capi-mgmt bootstrap k3s + CAPI controllers + ORC running; workload cluster Available; kubeconfig at ~/magnum-capi/capi-mgmt-cluster.kubeconfig on jumphost)
  • Authenticated Juju session active

Key constraint: Charm-magnum's systemd units invoke /etc/init.d/magnum-{api,conductor} systemd-start (SysV-wrapped). Drop-in config dirs are NOT consumed by the init.d script as shipped. Phase 4 graft must REPLACE the systemd ExecStart entirely with a wrapper that adds --config-dir /etc/magnum/magnum.conf.d/. This pattern was validated on Bobcat and is expected to persist on Caracal — verify with juju exec at the start of execution.

Step 1 — Investigation block (D-017 rehearsal)

Before any grafting, inspect the live charm state. The init.d/systemd wrapping shape may have shifted between Bobcat and Caracal:

juju exec --unit magnum/leader -- 'cat /lib/systemd/system/magnum-api.service'
juju exec --unit magnum/leader -- 'cat /lib/systemd/system/magnum-conductor.service'
juju exec --unit magnum/leader -- 'ls /etc/init.d/ | grep magnum'
juju exec --unit magnum/leader -- 'cat /etc/init.d/magnum-api 2>/dev/null | head -40'
juju exec --unit magnum/leader -- 'ls /etc/default/magnum-* 2>/dev/null'
juju exec --unit magnum/leader -- 'python3 -c "import magnum; print(magnum.__file__)"'

Record results in execution notes. The Python import path tells us where to pip-install the driver (Bobcat: /usr/lib/python3/dist-packages/magnum/).

Step 2 — Pre-flight: confirm kubeconfig reachability

The Magnum charm unit must be able to reach the k3s API on capi-mgmt.maas:6443. The charm runs in an LXD container on the metal network; reach is expected via direct L2.

juju exec --unit magnum/leader -- "curl -sk --max-time 5 https://$(awk '/server:/ {print $2}' ~/magnum-capi/capi-mgmt-cluster.kubeconfig | head -1 | sed 's|https://||')/healthz"
# Expect: "ok"

Step 3 — Install the driver into the charm Python environment

juju ssh magnum/leader -- "sudo pip install --break-system-packages \
  'git+https://github.com/stackhpc/magnum-capi-helm@v0.13.0'"

# Verify
juju exec --unit magnum/leader -- 'python3 -c "import magnum_capi_helm; print(magnum_capi_helm.__file__)"'

Pin to a specific tag rather than main — the driver should not move under our feet between deploys. Version v0.13.0 was validated on Bobcat; verify it remains the chosen tag at Caracal execution time.

Step 4 — Deploy the kubeconfig to the charm unit

# Copy from jumphost to magnum/leader
juju scp ~/magnum-capi/capi-mgmt-cluster.kubeconfig magnum/leader:/tmp/capi-kubeconfig
juju ssh magnum/leader -- "sudo install -o root -g magnum -m 0640 /tmp/capi-kubeconfig /etc/magnum/kubeconfig && sudo rm /tmp/capi-kubeconfig"
juju ssh magnum/leader -- "ls -la /etc/magnum/kubeconfig"

Step 5 — Configure Magnum to use the CAPI Helm driver

Create the conf.d directory and drop-in:

juju ssh magnum/leader -- "sudo mkdir -p /etc/magnum/magnum.conf.d && sudo chown root:magnum /etc/magnum/magnum.conf.d && sudo chmod 0750 /etc/magnum/magnum.conf.d"

juju ssh magnum/leader -- "sudo tee /etc/magnum/magnum.conf.d/99-capi.conf > /dev/null" << 'EOC'
[DEFAULT]
enabled_drivers = k8s_capi_helm_v1

[capi_helm]
kubeconfig_file = /etc/magnum/kubeconfig
EOC

juju ssh magnum/leader -- "sudo chown root:magnum /etc/magnum/magnum.conf.d/99-capi.conf && sudo chmod 0640 /etc/magnum/magnum.conf.d/99-capi.conf"

Step 6 — Install the systemd ExecStart override

Because the charm's systemd units invoke an init.d wrapper that does NOT honor --config-dir, the override must replace the ExecStart entirely with a wrapper that invokes the Magnum binaries directly with both the default config file and our config dir.

juju ssh magnum/leader -- "sudo mkdir -p /etc/systemd/system/magnum-api.service.d"
juju ssh magnum/leader -- "sudo tee /etc/systemd/system/magnum-api.service.d/override.conf > /dev/null" << 'EOC'
[Service]
ExecStart=
ExecStart=/usr/bin/magnum-api --config-file /etc/magnum/magnum.conf --config-dir /etc/magnum/magnum.conf.d
EOC

juju ssh magnum/leader -- "sudo mkdir -p /etc/systemd/system/magnum-conductor.service.d"
juju ssh magnum/leader -- "sudo tee /etc/systemd/system/magnum-conductor.service.d/override.conf > /dev/null" << 'EOC'
[Service]
ExecStart=
ExecStart=/usr/bin/magnum-conductor --config-file /etc/magnum/magnum.conf --config-dir /etc/magnum/magnum.conf.d
EOC

juju ssh magnum/leader -- "sudo systemctl daemon-reload"
juju ssh magnum/leader -- "sudo systemctl restart magnum-api magnum-conductor"
juju ssh magnum/leader -- "sudo systemctl status magnum-api magnum-conductor --no-pager"

Verify the override took effect:

juju ssh magnum/leader -- "sudo systemctl cat magnum-api | grep ExecStart"
juju ssh magnum/leader -- "ps -ef | grep magnum-api | grep -v grep"
# Expect: /usr/bin/magnum-api with --config-dir flag

Step 7 — Verify driver loaded

juju ssh magnum/leader -- "sudo tail -100 /var/log/magnum/magnum-conductor.log | grep -i -E 'driver|capi'"
# Expect: log lines mentioning k8s_capi_helm_v1 driver loaded

Step 8 — Smoke test

Create a cluster template and small cluster to validate end-to-end:

source ~/admin-openrc

# Cluster template
openstack coe cluster template create \
  --name k8s-capi-test \
  --image noble-amd64 \
  --keypair capi-mgmt-key \
  --external-network <ext-net> \
  --master-flavor m1.medium \
  --flavor m1.medium \
  --coe kubernetes \
  --network-driver calico \
  --labels driver=k8s_capi_helm_v1,kube_tag=v1.32.2

# Cluster
openstack coe cluster create \
  --cluster-template k8s-capi-test \
  --master-count 1 \
  --node-count 1 \
  --keypair capi-mgmt-key \
  k8s-capi-smoke

# Poll
watch -n 30 'openstack coe cluster show k8s-capi-smoke -c status -c status_reason'
# Expect CREATE_COMPLETE within 15-20 min

Tear down the smoke cluster after validation:

openstack coe cluster delete k8s-capi-smoke
# Wait for DELETE_COMPLETE
openstack coe cluster template delete k8s-capi-test

Exit criteria

  • Magnum services running with --config-dir /etc/magnum/magnum.conf.d visible in the live process
  • k8s_capi_helm_v1 driver logged at conductor startup
  • Smoke-test cluster reached CREATE_COMPLETE and torn down cleanly

Idempotency and recovery notes

  • The systemd override survives charm config-changed (charm rewrites magnum.conf but doesn't touch the conf.d dir or systemd drop-ins)
  • The pip-installed driver may NOT survive a charm upgrade-charm — if the venv gets rebuilt, re-run Step 3
  • The kubeconfig at /etc/magnum/kubeconfig is operator-managed; survives charm hooks but if Magnum is redeployed, restore it

Recurring pitfalls

  • juju ssh HANGS when stdout is redirected — use juju exec --unit X -- 'cmd'
  • Python magnum at /usr/lib/python3/dist-packages/magnum/ needs --break-system-packages for PEP 668
  • Heredoc nesting in juju ssh is fragile — keep heredocs simple, single level
  • Non-ASCII characters in conf.d files cause silent daemon failures — ensure ASCII only