# Runbook 05 — Magnum CAPI Helm driver install

**Status:** Executes after `04-magnum-domain.md` (Keystone wiring) and
`04a-capi-bootstrap-cluster.md` (workload cluster + kubeconfig staged).
Final post-deploy step to make Magnum capable of creating CAPI-managed
tenant K8s clusters.

**Cross-references:**
- D-007 Layer B (Magnum two-layer install)
- D-017 (CAPI bootstrap cluster lifecycle)
- Runbook 04a §19 (workload kubeconfig handoff)
- Workstream 3c decision (2026-05-22): magnum-capi-helm 1.1.0 from PyPI; workload-cluster kubeconfig (NOT bootstrap k3s)

**Known doc inconsistency (tracked for cleanup):**
D-007's Layer B currently states the kubeconfig points at "capi-mgmt.maas
bootstrap k3s". That language is correct for Bobcat (no pivot) but obsolete
post-workstream-3b (pivot mandatory). This runbook uses the workload cluster
kubeconfig as the canonical target. D-007 patch to follow in a workstream-3
cleanup commit.

---

## 1. Purpose & scope

Graft the CAPI Helm driver onto the Charmed Magnum deployment so that
`openstack coe cluster create` provisions tenant K8s clusters via CAPI (in
the workload cluster) instead of via the deprecated Heat driver.

**Output of this runbook:**

- `magnum-capi-helm==1.1.0` installed on the magnum unit's system Python.
- `/etc/magnum/kubeconfig` populated with the workload cluster's
  kubeconfig (post-pivot CAPI controller plane).
- `/etc/magnum/magnum.conf.d/99-capi.conf` configured with
  `enabled_drivers = k8s_capi_helm_v1` and `[capi_helm] kubeconfig_file=`.
- Systemd overrides on `magnum-api` and `magnum-conductor` that replace
  the init.d wrapper's ExecStart with explicit `--config-dir` invocation.
- Both services running cleanly with the CAPI driver loaded.

**Scope:** v1 testcloud. Roosevelt deltas in §12.

**Out of scope:**
- Magnum domain setup (runbook 04)
- Workload cluster lifecycle (runbook 04a)
- Smoketest tenant cluster creation is OPTIONAL (§11) — full validation
  framework belongs in runbook 08.

---

## 2. Decisions captured

| Decision | Choice | Reason |
|---|---|---|
| Driver pin | `magnum-capi-helm==1.1.0` from PyPI | D-007 correction (stackhpc fork archived Dec 2024; canonical project on opendev/PyPI; 1.1.0 is last Caracal-cycle release) |
| Install method | `pip3 install --break-system-packages` | PEP 668 — Ubuntu 22.04+ requires explicit override for system-site-packages install |
| Install scope | System Python on magnum unit (not venv) | Magnum charm uses system-packaged python at `/usr/lib/python3/dist-packages/magnum/`; driver must import from same site |
| Kubeconfig target | Workload cluster (post-pivot) | Workstream 3b — bootstrap k3s is empty post-pivot; CAPI controllers live in workload |
| Kubeconfig source | `$HOME/magnum-capi/capi-mgmt-cluster.kubeconfig` (staged by 04a §19) | Documented handoff |
| Driver entry-point name | `k8s_capi_helm_v1` | Per upstream magnum-capi-helm 1.1.0; verify in §10 |
| Conf.d filename | `99-capi.conf` | Numeric prefix ensures it loads AFTER any charm-managed conf, so `enabled_drivers` override wins |
| File encoding | ASCII-only | Non-ASCII in conf.d causes silent magnum daemon failures (handoff lesson; cf. Horizon `local_settings.d` issue) |
| Trustee credential | Existing magnum-shared user (charm-managed) | Roosevelt will use app-credential pattern |

---

## 3. Prerequisites

| Prereq | Verification |
|---|---|
| Magnum charm active/idle | `juju status magnum \| grep magnum/0` shows `active idle` |
| Magnum domain setup completed (runbook 04) | `openstack domain show magnum \| grep enabled` returns `True` |
| Workload cluster reachable from jumphost | `kubectl --kubeconfig $HOME/magnum-capi/capi-mgmt-cluster.kubeconfig get nodes` returns Ready nodes |
| CAPI controllers running in workload cluster | `kubectl --kubeconfig $HOME/magnum-capi/capi-mgmt-cluster.kubeconfig get pods -n capi-system \| grep -v Running \| grep -v NAME` empty |
| Workload kubeconfig staged at expected path | `test -r $HOME/magnum-capi/capi-mgmt-cluster.kubeconfig && stat -c %a $HOME/magnum-capi/capi-mgmt-cluster.kubeconfig` shows `600` |
| `juju exec` works to magnum/leader (use exec, NOT ssh, for non-interactive — handoff lesson) | `juju exec --unit magnum/leader -- hostname` returns the unit hostname |

**Set shell context:**

```bash
export WORK=$HOME/magnum-capi
export WORKLOAD_KUBECONFIG=$WORK/capi-mgmt-cluster.kubeconfig
export DRIVER_VERSION=magnum-capi-helm==1.1.0   # per D-007 correction
cd "$WORK"
```

> **`juju ssh` vs `juju exec` choice:** the handoff lessons explicitly call
> out that `juju ssh` hangs when stdout is redirected (PTY allocation issue).
> This runbook uses `juju exec` for all non-interactive command execution and
> reserves `juju ssh` only for cases where you actually want an interactive
> shell.

---

## 4. Pre-flight: capture current state

Capture the magnum unit's state BEFORE making changes. Useful for diagnosis
if anything goes wrong, and as a record of what was changed.

```bash
mkdir -p "$WORK/pre-state"

# Service unit files (as managed by charm)
juju exec --unit magnum/leader -- \
  'sudo systemctl cat magnum-api magnum-conductor 2>&1' \
  > "$WORK/pre-state/systemd-units.txt"

# Currently-enabled drivers
juju exec --unit magnum/leader -- \
  'sudo grep -r enabled_drivers /etc/magnum/ 2>/dev/null || echo "(no enabled_drivers found — charm default applies)"' \
  > "$WORK/pre-state/drivers-pre.txt"

# Python site-packages — see what's already installed
juju exec --unit magnum/leader -- \
  'sudo pip3 list 2>/dev/null | grep -iE "magnum|cluster|helm|kubernetes" || true' \
  > "$WORK/pre-state/pip-pre.txt"

# conf.d state
juju exec --unit magnum/leader -- \
  'sudo ls -la /etc/magnum/magnum.conf.d/ 2>/dev/null || echo "(no conf.d directory)"' \
  > "$WORK/pre-state/confd-pre.txt"

# Service running state
juju exec --unit magnum/leader -- \
  'sudo systemctl is-active magnum-api magnum-conductor' \
  > "$WORK/pre-state/service-state-pre.txt"

# Display the captured state
cat "$WORK/pre-state/"*.txt
```

> **What to look for in pre-state:** the charm-managed `enabled_drivers` value
> probably includes Heat-based drivers (`heat_kubernetes`, etc.). The 99-capi.conf
> override in §7 replaces this with the single CAPI driver. The pre-state
> capture documents what was active before the override took effect.

---

## 5. Install magnum-capi-helm 1.1.0

```bash
juju exec --unit magnum/leader -- \
  "sudo pip3 install $DRIVER_VERSION --break-system-packages"
```

**Verify install:**

```bash
juju exec --unit magnum/leader -- \
  'sudo pip3 show magnum-capi-helm | head -10'
# Expect: Name: magnum-capi-helm
#         Version: 1.1.0
#         Location: /usr/lib/python3/dist-packages

juju exec --unit magnum/leader -- \
  'sudo python3 -c "import magnum_capi_helm; print(magnum_capi_helm.__file__)"'
# Expect: /usr/lib/python3/dist-packages/magnum_capi_helm/__init__.py
```

**Check that the driver entry point is registered:**

```bash
juju exec --unit magnum/leader -- \
  'sudo python3 -c "
from stevedore import driver
mgr = driver.DriverManager(
    namespace=\"magnum.drivers\",
    name=\"k8s_capi_helm_v1\",
    invoke_on_load=False
)
print(\"Driver class:\", mgr.driver)
"'
# Expect: Driver class: <class 'magnum_capi_helm.driver.Driver'>
# (or similar — the actual class path is package-version-dependent)
```

> If the entry point check fails with "No 'k8s_capi_helm_v1' driver found",
> the driver name in 1.1.0 may differ from what D-007 documented. Inspect the
> installed package's `entry_points.txt`:
>
> ```bash
> juju exec --unit magnum/leader -- \
>   'sudo cat /usr/lib/python3/dist-packages/magnum_capi_helm*.dist-info/entry_points.txt 2>/dev/null'
> ```
>
> Find the entry under `[magnum.drivers]` — use that exact name in §7.

---

## 6. Stage workload kubeconfig on magnum unit

```bash
# Transfer kubeconfig from jumphost to magnum unit
juju scp "$WORKLOAD_KUBECONFIG" magnum/leader:/tmp/kubeconfig

# Install with correct ownership/mode in one atomic step
juju exec --unit magnum/leader -- \
  'sudo install -m 0640 -o root -g magnum /tmp/kubeconfig /etc/magnum/kubeconfig && sudo rm /tmp/kubeconfig'
```

**Verify:**

```bash
juju exec --unit magnum/leader -- \
  'sudo ls -la /etc/magnum/kubeconfig'
# Expect: -rw-r----- 1 root magnum ... /etc/magnum/kubeconfig

# Confirm magnum user can read it
juju exec --unit magnum/leader -- \
  'sudo -u magnum cat /etc/magnum/kubeconfig | head -3'
# Expect: apiVersion: v1 / clusters: / - cluster:

# Confirm kubectl can use it from the magnum unit (sanity check on API reachability)
juju exec --unit magnum/leader -- \
  'sudo -u magnum kubectl --kubeconfig /etc/magnum/kubeconfig get nodes 2>&1 | head -10'
# Expect: NAME ... STATUS=Ready for control plane + workers
# OR: kubectl not installed (acceptable — magnum-capi-helm uses Python client, not kubectl)
```

> **Why mode 0640 and group magnum:** kubeconfig contains auth tokens. Mode
> 0600 (owner-only) wouldn't let the `magnum` system user (which runs
> magnum-api/conductor) read it. Mode 0640 with `group: magnum` is the
> minimum-permission setup that works. NOT 0644 — keeps it off other users
> on the unit.

---

## 7. Configure `/etc/magnum/magnum.conf.d/99-capi.conf`

Generate the conf locally first (snap confinement does not apply to plain
bash on jumphost, but we keep paths under `$HOME` for consistency), then
transfer.

**ASCII-only verification is critical** — the handoff documents non-ASCII
characters in `conf.d` files causing silent daemon failures (cf. Horizon
`local_settings.d`). Use plain straight quotes, ASCII dashes, no smart
typography.

```bash
# Write locally
cat > "$WORK/99-capi.conf" <<'EOF'
[DEFAULT]
enabled_drivers = k8s_capi_helm_v1

[capi_helm]
kubeconfig_file = /etc/magnum/kubeconfig
EOF

# Verify it is pure ASCII (no UTF-8 sneakers)
file "$WORK/99-capi.conf"
# Expect: ASCII text
# If it says "UTF-8 Unicode text", STOP and rewrite by hand — even one stray
# em-dash or smart quote will silently break magnum

# Hex dump check (paranoid mode)
xxd "$WORK/99-capi.conf" | grep -v "^[0-9a-f]*: [0-9a-f ]*  [a-zA-Z0-9 \[\]=._/]*$" | head -5
# Expect: empty output (all bytes are printable ASCII)
```

**Stage and install:**

```bash
juju scp "$WORK/99-capi.conf" magnum/leader:/tmp/99-capi.conf

juju exec --unit magnum/leader -- \
  'sudo mkdir -p /etc/magnum/magnum.conf.d && sudo install -m 0644 -o root -g root /tmp/99-capi.conf /etc/magnum/magnum.conf.d/99-capi.conf && sudo rm /tmp/99-capi.conf'

# Verify
juju exec --unit magnum/leader -- \
  'sudo ls -la /etc/magnum/magnum.conf.d/ && sudo cat /etc/magnum/magnum.conf.d/99-capi.conf'
# Expect: file listed; content matches what was written
```

---

## 8. Systemd override on magnum-api + magnum-conductor

The Charmed Magnum unit files use a wrapper pattern:

```
ExecStart=/etc/init.d/magnum-api systemd-start
```

The wrapper does NOT pass `--config-dir` to magnum-api, so `/etc/magnum/magnum.conf.d/`
is never loaded. The 99-capi.conf would have no effect.

Override with explicit `--config-file` + `--config-dir` invocation.

**Generate override files locally:**

```bash
cat > "$WORK/magnum-api-override.conf" <<'EOF'
[Service]
ExecStart=
ExecStart=/usr/bin/magnum-api --config-file=/etc/magnum/magnum.conf --config-dir=/etc/magnum/magnum.conf.d
EOF

cat > "$WORK/magnum-conductor-override.conf" <<'EOF'
[Service]
ExecStart=
ExecStart=/usr/bin/magnum-conductor --config-file=/etc/magnum/magnum.conf --config-dir=/etc/magnum/magnum.conf.d
EOF

# ASCII check
file "$WORK/magnum-api-override.conf" "$WORK/magnum-conductor-override.conf"
# Expect: ASCII text x2
```

> **The empty `ExecStart=` line is critical.** Systemd accumulates ExecStart
> directives by default; an empty assignment is required to CLEAR the inherited
> directive before setting the replacement. Without the empty line, the unit
> would have BOTH the init.d wrapper AND the new direct invocation, and would
> likely fail to start.

**Install on the unit:**

```bash
juju scp "$WORK/magnum-api-override.conf" magnum/leader:/tmp/magnum-api-override.conf
juju scp "$WORK/magnum-conductor-override.conf" magnum/leader:/tmp/magnum-conductor-override.conf

juju exec --unit magnum/leader -- \
  'sudo mkdir -p /etc/systemd/system/magnum-api.service.d /etc/systemd/system/magnum-conductor.service.d && \
   sudo install -m 0644 -o root -g root /tmp/magnum-api-override.conf /etc/systemd/system/magnum-api.service.d/override.conf && \
   sudo install -m 0644 -o root -g root /tmp/magnum-conductor-override.conf /etc/systemd/system/magnum-conductor.service.d/override.conf && \
   sudo rm /tmp/magnum-api-override.conf /tmp/magnum-conductor-override.conf'

# Reload systemd to pick up the overrides
juju exec --unit magnum/leader -- 'sudo systemctl daemon-reload'

# Verify the overrides are effective (systemctl cat shows combined unit + overrides)
juju exec --unit magnum/leader -- 'sudo systemctl cat magnum-api | grep -A1 ExecStart'
# Expect: TWO ExecStart= lines — the empty clear-line and the new /usr/bin/magnum-api invocation
juju exec --unit magnum/leader -- 'sudo systemctl cat magnum-conductor | grep -A1 ExecStart'
# Expect: TWO ExecStart= lines as above for magnum-conductor
```

> **Charm reconciliation note:** the Magnum charm may rewrite its own systemd
> units on config changes or upgrades. The drop-in override at
> `/etc/systemd/system/magnum-api.service.d/override.conf` is OUTSIDE the
> charm's writable zone and should survive. Verify after any `juju refresh` or
> `juju config magnum` command by re-running the `systemctl cat` check above.

---

## 9. Restart services + verify health

```bash
juju exec --unit magnum/leader -- \
  'sudo systemctl restart magnum-api magnum-conductor'

# Wait briefly for services to initialize
sleep 5

# Check active state
juju exec --unit magnum/leader -- \
  'sudo systemctl is-active magnum-api magnum-conductor'
# Expect: active (x2)

# Examine recent journal for errors (the critical step — magnum's silent failure
# mode means we must read logs, not just trust is-active)
juju exec --unit magnum/leader -- \
  'sudo journalctl -u magnum-api --since "2 minutes ago" --no-pager | tail -50'
juju exec --unit magnum/leader -- \
  'sudo journalctl -u magnum-conductor --since "2 minutes ago" --no-pager | tail -50'
```

**Look for these red flags in the logs:**

| Symptom | Likely cause | Remediation |
|---|---|---|
| `ImportError: No module named magnum_capi_helm` | §5 pip install failed | Re-run §5; check pip3 output |
| `EntryPointError: No 'k8s_capi_helm_v1' driver` | Driver entry-point name mismatch | Verify name per §5 footnote; update §7 |
| Service repeatedly restarts (look for "Started" appearing twice in 10s) | Likely a config error in 99-capi.conf | Re-check ASCII-only; check magnum.conf.d permissions |
| `kubeconfig_file` not honored | --config-dir not being passed | §8 override not active; re-run `systemctl daemon-reload` |
| Silent: no error but driver also not loading | Non-ASCII char snuck into a conf | `file /etc/magnum/magnum.conf.d/99-capi.conf` — if it says UTF-8, regenerate |

---

## 10. CAPI driver enablement check

Verify the driver is actually loaded by Magnum and reachable via the API.

```bash
source $HOME/admin-openrc

# List supported COE drivers via the Magnum API
openstack coe cluster template list -f json
# (empty templates list is fine — we are checking the endpoint responds)

# Direct check on the unit: scan the service's loaded drivers
juju exec --unit magnum/leader -- \
  'sudo journalctl -u magnum-conductor --since "5 minutes ago" --no-pager | grep -iE "driver|enabled" | head -20'
# Expect: a line mentioning k8s_capi_helm_v1 having been loaded
# (Magnum logs the loaded drivers at startup)

# Definitive check: try creating a cluster template that requires the CAPI driver
openstack coe cluster template create magnum-capi-driver-check \
  --image noble-amd64 \
  --keypair capi-workload-key \
  --external-network ext_net \
  --master-flavor capi-mgmt-node \
  --flavor capi-mgmt-node \
  --coe kubernetes \
  --network-driver calico \
  --labels kube_tag=v1.31.4

openstack coe cluster template show magnum-capi-driver-check -c name -c coe -c labels
```

> **If template create fails with "driver not enabled" or similar:** the
> Magnum API process is not loading the conf.d. Verify the systemd override
> took effect — `sudo systemctl show magnum-api -p ExecStart` on the unit
> should show the explicit `--config-dir` invocation. If it still shows the
> init.d wrapper, the daemon-reload + restart did not pick up the override.

**Cleanup the driver-check template:**

```bash
openstack coe cluster template delete magnum-capi-driver-check
```

---

## 11. Optional smoketest — create a tenant CAPI cluster

This step is **optional**. Full validation belongs in runbook 08. Use this
smoketest only if you want immediate confirmation that the entire chain
(Magnum API -> conductor -> magnum-capi-helm -> CAPI controllers in workload
cluster -> tenant K8s cluster on tenant VMs) works end-to-end.

```bash
# Create a cluster template tuned for testcloud smoketest
openstack coe cluster template create magnum-smoketest-template \
  --image noble-amd64 \
  --keypair capi-workload-key \
  --external-network ext_net \
  --master-flavor capi-mgmt-node \
  --flavor capi-mgmt-node \
  --coe kubernetes \
  --network-driver calico \
  --labels boot_volume_size=20,kube_tag=v1.31.4,octavia_provider=ovn

# Create a 1+1 cluster (minimum for smoketest)
openstack coe cluster create magnum-smoketest \
  --cluster-template magnum-smoketest-template \
  --master-count 1 \
  --node-count 1

# Poll for status (15-20 min typical; CAPI provisions tenant VMs end-to-end)
for i in $(seq 1 60); do
  STATUS=$(openstack coe cluster show magnum-smoketest -c status -f value)
  echo "$(date -Is) status=$STATUS"
  case "$STATUS" in
    CREATE_COMPLETE) echo "Smoketest passed"; break ;;
    CREATE_FAILED)   echo "Smoketest FAILED"; openstack coe cluster show magnum-smoketest; exit 1 ;;
  esac
  sleep 30
done

# Retrieve the smoketest cluster's kubeconfig
openstack coe cluster config magnum-smoketest --dir "$WORK/smoketest-kubeconfig"

# Sanity-check the smoketest cluster
KUBECONFIG="$WORK/smoketest-kubeconfig/config" kubectl get nodes
KUBECONFIG="$WORK/smoketest-kubeconfig/config" kubectl get pods -A | head -20

# Cleanup the smoketest cluster
openstack coe cluster delete magnum-smoketest
openstack coe cluster template delete magnum-smoketest-template
```

> **What success looks like:** the CAPI controllers in the workload cluster
> receive the new Cluster CR (created by magnum-capi-helm in response to the
> Magnum API call), CAPO talks to OpenStack to provision tenant VMs, the
> tenant VMs join the new K8s cluster, and the new cluster has 1 control
> plane + 1 worker Ready. Octavia provides the API server LB (visible as a
> Floating IP in the tenant project).

---

## 12. Roosevelt deltas (forward-look)

| Aspect | Testcloud (v1) | Roosevelt |
|---|---|---|
| Driver pin source | PyPI `magnum-capi-helm==1.1.0` | Internal mirror with checksum verification |
| Driver pin record | Implicit in this runbook | Captured in Vault as audit artifact alongside CAPI pins |
| Kubeconfig source | Workload cluster (post-pivot per 04a §17) | Same |
| Kubeconfig rotation | Manual on capi-mgmt rebuild | Automated when workload cluster cert rotates |
| Trustee credential | Charm-default magnum-shared user | Per-tenant app credentials via Vault auth method |
| Magnum HA | num_units=1 (per D-009 testcloud) | num_units=3 with hacluster + provider VIP |
| Driver upgrade discipline | Manual re-run of §5 | Tracked maintenance window; Vault audit log |
| Systemd override | Drop-in at `/etc/systemd/system/magnum-*.service.d/override.conf` | Same — but provided via a charm overlay package, not manual file install |
| ASCII-only enforcement | Manual check (§7, §8) | Pre-flight lint in `scripts/pre-flight-checks.sh` |

---

## 13. Documented runtime gotchas (carry-forward from handoff)

These gotchas burned cycles during the Bobcat Magnum CAPI work. Each is
explicitly handled in this runbook; collecting them here for visibility:

1. **PEP 668 `--break-system-packages`** (§5). Ubuntu 22.04+ refuses
   `pip install` against system Python by default. The flag is required for
   the magnum-capi-helm install path used by Charmed Magnum.
2. **`juju ssh` hangs on stdout redirect.** PTY allocation issue.
   This runbook uses `juju exec` for all non-interactive command execution.
3. **Heredoc nesting in `juju ssh` is fragile.** This runbook writes
   conf files locally first and uses `juju scp` + `juju exec install` to
   transfer — single-level only.
4. **Non-ASCII characters in `conf.d` files cause silent daemon failures.**
   §7 and §8 both include `file <path>` ASCII verification before transfer.
5. **`openstack -f value -c X -c Y` outputs in alphabetical field order,
   not flag order.** This runbook uses single-column queries or `-f json |
   jq` throughout.
6. **Charm-managed `enabled_drivers` is overridden, not appended.** The
   `enabled_drivers = k8s_capi_helm_v1` line in 99-capi.conf REPLACES the
   charm-default value (which would include the deprecated Heat drivers).
7. **The systemd override empty `ExecStart=` line is required** to clear
   the inherited ExecStart before setting the replacement (§8).
8. **Snap-confined `openstack` CLI cannot read `/tmp`.** This runbook stages
   files under `$WORK=$HOME/magnum-capi`. The smoketest in §11 also writes
   to `$WORK/smoketest-kubeconfig`.

---

## 14. Change log

| Date | Change | Reference |
|---|---|---|
| 2026-05-22 | Document created. magnum-capi-helm 1.1.0 from PyPI; workload-cluster kubeconfig (post-pivot per workstream 3b); systemd override pattern; ASCII-only conf.d. | Workstream 3c |