# Appendix A -- Troubleshooting / Known-Issues Index

Keyed by the same `D-NNN` / `DOCFIX-NNN` / `L-P6-N` identifiers used inline in the
phase runbooks. This is an OPERATIONAL index (symptom -> cause -> fix), NOT the
decision log: full rationale lives in `design-decisions.md` and the per-decision
files (`D-0NN-*.md`); the driver fix has its own `magnum-capi-helm-driver-fix-runbook`.
Each entry notes the phase(s) that reference it. ASCII-only.

================================================================================
## Remote execution / scripting
================================================================================

### DOCFIX-021 -- heredoc / stdin consumption  (phase-06, phase-07)
- Symptom: a multi-line `juju ssh`/`ssh ... bash -s` or remote `sudo` block dies
  early or behaves as if truncated; later commands in the heredoc never run.
- Cause: an inner `ssh`/`sudo`/`juju ssh` (or any stdin reader) consumes the rest
  of the heredoc/pipe that was feeding the outer command.
- Fix: append `</dev/null` to every inner `ssh`/`sudo`/`juju ssh` invocation
  (use `</dev/tty` instead only when the call genuinely needs an interactive prompt).
- Also: wrap multi-statement pasteable jumphost blocks in `( { ...; } )` so a stray
  `exit` cannot kill the interactive shell.
- SECOND MANIFESTATION (phase-03): a charm ACTION's human output silently corrupts a
  captured artifact. `juju run vault/leader get-root-ca` wraps the PEM in an INDENTED
  YAML `output: |-` block; `sed`-by-marker preserves the indent and an indented
  `-----BEGIN CERTIFICATE-----` is not valid PEM -> openssl "Unable to load
  certificate" -> keystone NO_CERTIFICATE_OR_CRL_FOUND. Fix: pull from the action JSON
  (real newlines, no indent): `juju run vault/leader get-root-ca -m openstack
  --format json | jq -r '[.. | strings | select(test("BEGIN CERTIFICATE"))][0]'`.
  (Same class as DOCFIX-006: never trust action human output for a captured secret/cert.)

### L-P6-4 -- admin-kubeconfig / secret transfer  (phase-07)
- Risk: staging the cluster-admin kubeconfig (or any secret) in `/tmp`, or letting a
  PTY mangle it in transit.
- Fix: pipe base64 straight into a root-written file with `umask 077`, then `chown`
  to the service user and `chmod 0600` -- never touch `/tmp`. (Pattern in phase-07 7.2.)
- Hardening (Roosevelt): replace the cluster-admin kubeconfig with a scoped
  ServiceAccount kubeconfig carrying only the RBAC the driver needs.

================================================================================
## k8s-snap bootstrap (mgmt cluster)
================================================================================

### DOCFIX-024 -- bootstrap config missing the cluster-config block  (phase-06)
- Symptom: `k8s bootstrap` "succeeds" but the node never reaches Ready; network and
  DNS are silently disabled; CoreDNS/Cilium absent.
- Cause: a bootstrap `--file` whose top level lacks a `cluster-config:` block leaves
  ALL features (network, dns, ...) at disabled defaults. Setting only `pod-cidr` /
  `service-cidr` / `extra-sans` does NOT enable them.
- Fix: include an explicit block:
      cluster-config:
        network: { enabled: true }
        dns:     { enabled: true }
  (See phase-06 6.4 for the full config.) Retry: `snap remove k8s --purge` then re-bootstrap.

================================================================================
## CAPI provider install (mgmt cluster)
================================================================================

### DOCFIX-025a -- cert-manager Helm flag  (phase-06)
- Symptom: cert-manager install fails / CRDs absent when using `--set installCRDs=true`.
- Cause: `installCRDs` was removed from the cert-manager chart (~v1.18). The current
  flag is `crds.enabled=true`.
- Fix: `helm install cert-manager jetstack/cert-manager ... --set crds.enabled=true`.

### D-034 -- CAPI install ordering (ORC before clusterctl init)  (phase-06)
- Symptom: after `clusterctl init`, `capo-controller-manager` CrashLoopBackOff
  (observed ~6 restarts / ~15 min) before self-healing.
- Cause: CAPO v0.14.4's `openstackserver` controller hard-depends on ORC's
  `Image.openstack.k-orc.cloud` CRD at startup. `clusterctl init` installs CAPO; if
  ORC is not yet present, CAPO crash-loops until it appears.
- Fix: install ORC (its manifest provides the `Image` CRD) BEFORE `clusterctl init`.
  Hardened order: cert-manager -> ORC -> clusterctl init -> CAAPH -> janitor.
- Related rule: source every provider version from the chosen `capi-helm-charts`
  tag's `dependencies.json` (read live with `jq`); do not hardcode semver.
  (Full rationale: design-decisions D-034; driver-coherence amendment: D-042.)

================================================================================
## Networking / pod egress
================================================================================

### D-035 -- dual-homed mgmt node pod-egress reverse-path failure  (phase-06)
- Symptom (the prior D-033 architecture): a pod's egress TCP connect to an external
  VIP hangs; the agnhost probe never reaches Completed. SYN leaves the correct NIC and
  the SYN-ACK arrives, but the reply is emitted back out the NIC instead of being
  redirected into the pod via `cilium_host` -- silent, asymmetric breakage. (The
  "do-07 pattern.")
- Cause: Cilium reverse-path handling on a node with multiple NICs.
- Fix (chosen): D-035 single-homed in-cloud tenant VM avoids it entirely; phase-06
  GATE 2 (agnhost pod -> Keystone VIP, must Complete) is the explicit proof. (The
  transferable alternative -- Cilium device pinning -- is a Roosevelt note, not v1.)

================================================================================
## Magnum conductor
================================================================================

### D-037 -- conductor config-dir injection (NOT a systemd ExecStart drop-in)  (phase-07)
- Symptom: the `[capi_helm]` conf.d drop-in is ignored; the conductor behaves as if it
  was never written, even though a systemd drop-in "looks" applied.
- Cause: these OpenStack debs (openstack-pkg-tools) run the daemon through an LSB init
  script wrapped by systemd `systemd-start`, NOT a direct `ExecStart=`. A systemd
  drop-in appending `--config-dir` passes it as a positional arg to the init script,
  which ignores it -- the flag never reaches the daemon. The args are assembled inside
  the init script from `DAEMON_ARGS` (base `--config-file` first), extensible only via
  `/etc/default/<service>`.
- Fix: create `/etc/default/magnum-conductor` (0644; the charm does not manage it):
      DAEMON_ARGS="$DAEMON_ARGS --config-dir /etc/magnum/magnum.conf.d"
  Verify with the init script's own `show-args` (dry-run) AND `ps -ww -C
  magnum-conductor -o args` on the live process -- behavioral, not string-presence.
- Residual: if a future charm hook ever writes `/etc/default/magnum-conductor`, the
  append is lost and `[capi_helm]` silently stops being read. Re-check via show-args/ps.

### L-P6-1 / L-P6-2 -- verify the launched cmdline, not the unit text  (phase-07)
- Rule: never assume the systemd `ExecStart` shape for OpenStack debs, and never treat
  "string present in the unit file" as "the daemon received the flag." Gate on the
  assembled/launched cmdline (`show-args`, then `ps` on the live process).

### L-P6-3 -- k8s version comes from the IMAGE, not a template label  (phase-08)
- Symptom: cluster create fails in the driver before provisioning.
- Cause: the magnum-capi-helm driver reads `kube_version` from the Glance image
  properties and routes on `os_distro`; it does NOT take k8s version from a template
  label.
- Fix: the workload image (e.g. `ubuntu-jammy-kube-v1.32.13`) MUST carry
  `kube_version` (e.g. v1.32.13) and `os_distro=ubuntu`. Verify before create (phase-08 8.0).

================================================================================
## Driver / cluster health
================================================================================

### D-042 -- driver contract-coherence; health "infrastructure: not found"  (phase-07, phase-08, appendix-B)
- Symptom: `coe cluster show` reports `health_status = UNHEALTHY` deterministically
  (survives a conductor restart); only the `infrastructure` sub-check fails
  ("Infrastructure resource not found"); cluster + control-plane + nodegroup are Ready.
- Cause: driver 1.3.0 reads `apiVersion` off `spec.infrastructureRef` to build its
  health GET, but the CAPI v1.13 (v1beta2 contract) ref carries apiGroup+kind+name with
  NO apiVersion. COSMETIC -- the create path is unaffected (the chart templates the
  resource versions); only the driver's direct health query breaks.
- Fix: upgrade to the RELEASED `magnum-capi-helm==1.4.0` (the "generalize-api-resources"
  feature). 1.4.0 builds each health GET from an explicit api_version via its
  `[capi_helm] api_resources` option, which DEFAULTS to v1beta1 for every CAPI kind --
  and CAPI v1.13.2 / CAPO v0.14.4 still serve v1beta1, so the default works (no override
  needed; phase-07 7.3-7.6). Set a per-kind override only if a kind is v1beta2-only.
  Rule (amends D-034): the Layer-B driver pin must be contract-coherent with the
  Layer-A CAPI core.
- Operational caveat while unfixed: do NOT wire magnum auto-healing to `health_status`
  (a persistent false UNHEALTHY could misfire); CAPI MachineHealthCheck heals independently.

================================================================================
## Cluster lifecycle / Octavia
================================================================================

### D-039 -- app-cred roles (load-balancer_member) / Octavia 403  (phase-08)
- Symptom: cluster create or delete wedges; CAPO gets 403 querying the Octavia LB.
- Cause: the Magnum-minted application credential lacks `load-balancer_member`
  (a pre-D-039 frozen app-cred cannot query Octavia to confirm LB state).
- Fix: ensure the service path mints app-creds carrying `load-balancer_member`
  (+ member, reader). Verify before acceptance (phase-08 prereqs).

### stuck-delete -- wedged CAPI cluster delete  (phase-08)
- Symptom: cluster stuck `DELETE_IN_PROGRESS`; helm release already gone; `Cluster`
  and `OpenStackCluster` CRs stuck Deleting (often on an Octavia 403, see D-039).
- Recovery: clear the `OpenStackCluster` finalizer on the mgmt cluster --
  `kubectl -n <magnum-ns> patch openstackcluster <cluster>-<suffix> --type=merge
  -p '{"metadata":{"finalizers":[]}}'`. The `Cluster` finalizer was only waiting on it,
  so the Cluster auto-finalizes and deletes. Then manually clean orphaned neutron
  resources in dependency order: router remove subnet -> router unset external-gateway
  -> router delete -> subnet delete -> network delete -> security group delete.

### LB-failover -- LB stuck provisioning_status=ERROR after a host event  (phase-08)
- Symptom: the kube-api Octavia LB shows `operating_status ONLINE` but
  `provisioning_status ERROR` after a host outage/OOM.
- Cause: a control-plane op on the amphora failed during the outage.
- Fix: `openstack loadbalancer failover <lb-id>` in ADMIN-project scope (amphora /
  failover ops 403 under tenant member scope). Watch ERROR -> PENDING_UPDATE -> ACTIVE
  (~100s); a single STANDALONE amphora gives a brief blip; operating_status holds ONLINE.

### uninitialized-taint -- workload addons Pending  (phase-08)
- Symptom: new workload nodes are kubelet-Ready but addon pods (metrics-server,
  node-feature-discovery, etc.) stay Pending; nodes carry
  `node.cluster.x-k8s.io/uninitialized`.
- Cause: that taint is removed by the CAPI machine controller on the MANAGEMENT
  cluster. If the mgmt cluster is down (see D-041), the taint persists.
- Fix: restore the mgmt cluster API; CAPI then removes the taint and addons schedule.

### CNI-label -- network_driver vs the chart-default Calico (1.4.0)  (phase-08)
- Note: under the as-FIRST-built driver 1.3.0 the legacy Magnum `network_driver` label
  was IGNORED and the capi-helm `openstack-cluster` chart's default CNI (Calico) always
  ran. Under the RELEASED 1.4.0 driver the `network_driver` template option IS honored
  (it maps through to the chart). To keep the as-built CNI (Calico), the `capi-k8s-v1-32`
  template OMITS `--network-driver` (phase-08); set `flannel` there only to intentionally
  switch the CNI. (Mgmt cluster CNI is separately Cilium, via k8s-snap.)

================================================================================
## Hyperconverged host / mgmt-VM resilience
================================================================================

### D-040 -- host OOM from low reserved-host-memory  (phase-08)
- Symptom: guests OOM-killed; a compute host may even present in Juju as
  `State=down` (heavy swap thrash stalls OVS/OVN heartbeats and the machine agent).
- Cause: `reserved-host-memory` default 512 MB does not cover the co-located
  LXD/Ceph/MySQL services on these hyperconverged hosts -> nova over-commits real RAM.
- Fix: `reserved-host-memory = 8192` on all compute units (baked into the hardened
  bundle). Diagnose a suspected OOM-vs-reboot with `who -b` / `uptime` (no recent boot)
  and `journalctl -k | grep -i oom`; the ovsdb "no response to inactivity probe ...
  disconnecting" storm is the swap-thrash signature.

### D-041 -- single-node mgmt cluster does not self-heal  (phase-08)
- Symptom: after a host event the mgmt VM (`capi-mgmt-v2`) is SHUTOFF; FIP
  unreachable; magnum cannot reach the mgmt API; workload addons go Pending (see
  uninitialized-taint).
- Cause: the D-035 single-node mgmt cluster is a SPOF with no MachineHealthCheck
  (unlike the workload cluster).
- Fix: `openstack server start capi-mgmt-v2` (API serves ~40s later; a brief TLS
  handshake timeout on the first kubectl is expected). Follow-up: HA mgmt cluster for
  Roosevelt.

### juju-macaroon -- "cannot get discharge ... EOF"  (phase-07, phase-08)
- Symptom: `juju ssh` (or other juju calls) fail mid-session with a discharge/EOF error.
- Cause: the juju macaroon expired during a long session.
- Fix: re-run `juju login`, then retry.

================================================================================
## Teardown / MAAS reset (phase-00)
================================================================================

### DOCFIX-016 -- never `maas list` (API-key leak)  (phase-00, phase-01, phase-04)
- Risk: `maas list` prints the stored API key to stdout (and into any transcript/log).
- Fix: the profile name is known (`admin`); call `maas admin ...` directly. Never run
  `maas list` in a runbook or paste block.

### DOCFIX-017 -- no `maas whoami`; hardcode the eyeballed system_ids  (phase-00)
- Risk: scripting machine selection via `maas <profile> whoami` + owner filters is
  fragile and, in this lab, unnecessary.
- Fix: the four host system_ids are fixed and eyeball-verified
  (openstack0=4na83t, openstack1=qdbqd6, openstack2=h8frng, openstack3=tmsafc) --
  iterate those literals. (The older 01-destroy-model.md used `maas list`/`whoami` and
  released 5 VMs incl. the retired D-033 capi-mgmt; the current rebuild releases 4.)

### R7 -- sudo for libvirt / qemu-img  (phase-00, phase-01)
- The OSD qcow2 files (`/var/lib/libvirt/images/<host>-1.qcow2`) are root:root / 600;
  `qemu-img info|create`, `virsh domstate`, `stat`, `rm` against them all need `sudo`.

### KI-P3-001 -- VIP / primary collision  (phase-00, phase-04)
- Symptom: a charm `vip:` address equals a MAAS-auto-assigned machine/container
  primary (observed: cinder public VIP .226 == magnum container 1/lxd/3 primary).
- Cause: MAAS auto-static allocation was not excluded over the VIP block (provider had
  NO VIP reservation), so MAAS handed primaries .225/.226/.227 onto the .224-.236 VIPs.
- Fix (durable): on EVERY space carrying VIPs (provider AND metal) reserve the
  front-loaded VIP /26 in MAAS, distinct from the primary range and any neutron
  allocation_pool (phase-00 Phase 4). A reserved range stops future auto-assign onto
  a configured VIP. Negative test post-deploy: no service vip == any unit primary.

================================================================================
## Deploy-time (phase-01)
================================================================================

### R14 -- VIP relocation .224-.236 -> .50-.60  (phase-01)
- The public + internal API VIPs were front-loaded out of the old high-end .224-.236
  block into .50-.60 (inside the reserved .2-.63 /26). Every bundle `vip:` is a dual
  provider+metal pair "10.12.4.5x 10.12.8.5x" (D-020). Pre-deploy guard: total provider
  VIPs=11, all in .50-.60, zero in the stale .10-.20 (phase-01 1.1). Any per-cloud
  consumer of a VIP (the Horizon reverse proxy, monitoring) must be repointed.

### R15 -- the .10 phantom resolver  (phase-01)
- Symptom: an unreachable region resolver `10.12.8.10` appears in a node's resolver
  list (sometimes as Current DNS Server) despite the subnet dns_servers override.
- Cause: MAAS advertises its region/rack controller as a DNS server on the
  MAAS-managed metal VLAN, independent of the subnet field; the override does not purge it.
- Impact: NON-BLOCKING -- systemd-resolved deprioritizes .10 and falls through to .1.
  Latent fragility if .1 ever drops. Understand/eliminate for Roosevelt (no libvirt split there).

### L1 -- no `set -e` on count-gate blocks; guard greps `|| true`  (phase-01)
- A guarded `grep -c` returning 0 is a VALID answer, not a failure. Under `set -e` a
  zero-count grep aborts the block. Pre-deploy verify blocks run WITHOUT `set -e`, and
  every count grep ends `|| true`. (`bash -n` would not catch this -- it is behavior.)

### L3 -- metal-side dual-VIP eyeball check  (phase-01)
- The provider-side VIP guard greps only the first token of each dual `vip:`. The metal
  side (second token, `10.12.8.5x`) must be eyeballed to confirm all 11 sit in .8.50-.60,
  clear of metal infra (.8.10 maas / .8.20 lxd / .8.21 capi / .8.30 juju).

================================================================================
## Vault / secrets (phase-02)
================================================================================

### DOCFIX-006 -- vault init is one-shot; stdout-only redirect loses the keys  (phase-02)
- Symptom: `vault operator init ... > file` captures stdout only; if the key block went
  to stderr (or the run is interrupted) you are left with an unusable/empty file and the
  5 shares + root token are GONE -- init runs exactly once and cannot be replayed.
- Fix: `vault operator init -key-shares=5 -key-threshold=3 2>&1 | tee ~/vault-init/init.txt`
  VERBATIM; gate on `grep -c '^Unseal Key' == 5` and `Initial Root Token` present; then
  save the file OFF-HOST before anything else. Never improvise this command.

### DOCFIX-011 -- authorize-charm parameter is `token`  (phase-02)
- The vault `authorize-charm` action takes `token` (a direct token string); there is no
  `token-secret-id` variant in this charm rev. Confirm via `juju actions vault --schema`.
  Authorize with a SHORT-LIVED CHILD token (juju run persists action params in the op log).

### DOCFIX-014 -- generate-root-ca is required  (phase-02)
- Symptom: after authorize-charm, vault stays BLOCKED "Missing CA cert".
- Fix: run `juju run vault/leader generate-root-ca` -- it mints the charm-pki-local
  root and clears the block straight to active. (Omitting it leaves vault hung.)

### L4 -- vault unseal via hidden prompt, not key-on-argv  (phase-02)
- Use Vault's own `vault operator unseal` (no argument) so it prompts hidden; the key is
  never on the command line / in a var / in `ps` / in scrollback. Do NOT use
  `vault operator unseal $KEY` (visible in `ps` on the unit). Unseal is re-runnable, so
  the verbatim-reference rule is looser here, but the security gain is real.

### R3 -- "HA Enabled false" is correct for vault-on-mysql  (phase-02)
- Expected post-unseal: Initialized true / Sealed false / Storage Type mysql /
  **HA Enabled false**. Single-unit vault on the mysql backend is non-HA by design; any
  reference to "HA Enabled true (etcd backend)" is STALE (etcd was dropped).

================================================================================
## Identity / openrc (phase-03)
================================================================================

### DOCFIX-018 -- IP-only OS_AUTH_URL  (phase-03)
- This cloud is IP-only (no FQDN, no cloud DNS). The admin openrc must point at the
  keystone PUBLIC endpoint by IP: `OS_AUTH_URL=https://10.12.4.50:5000/v3`, with the
  vault root CA in `OS_CACERT` (B5 IP-SAN certs validate). No /etc/hosts, no FQDN.

### DOCFIX-022 -- discover the admin project; do not hardcode it  (phase-03)
- Symptom: with TLS working, keystone returns HTTP 401.
- Cause: wrong project scope. The scoping project name varies by charm rev (here it is
  `admin`, living in domain `admin_domain`; an older doc's `OS_PROJECT_NAME=admin_domain`
  401s). Credential good, scope wrong.
- Fix: a candidate loop -- try each of "admin admin_domain"; the first that issues a
  SCOPED token wins (phase-03 3.2). Costs 2 extra token requests; self-corrects across
  revs instead of re-introducing the 401-by-hardcode.

================================================================================
## Octavia enablement (phase-05)
================================================================================

### L7 -- the openstack snap cannot read /tmp  (phase-05, also phase-01 PKI sanity)
- Symptom: `openstack image create --file /tmp/...` -> "[Errno 2] No such file or
  directory" even though `sha256sum` just read the same path.
- Cause: the openstack CLI snap is confined and cannot read `/tmp`; it CAN read `$HOME`
  (home interface).
- Fix: stage any file the snap must read under `$HOME` (e.g. `$HOME/amphora-base/...`),
  never `/tmp`.

### octavia-configure-resources -- long-running action; o-hm0 transient is normal  (phase-05)
- `configure-resources` is long-running: juju's default action wait may time out
  ("timed out waiting for results") while the hook KEEPS RUNNING -- do NOT treat the
  wait-timeout as failure or re-fire blindly. Use a bound `--wait` and confirm completion
  via `juju show-operation <N>` (authoritative), not the streamed log.
- NORMAL (not faults) during/after: lb-mgmt-net is IPv6-ULA (fc00::/..) by design; a
  "Virtual network for access to Amphorae is down" transient self-heals as o-hm0 comes
  up; the lb-mgmt `network:distributed` port shows DOWN (logical OVN port, never chassis-bound).

### amp-image-tag-mismatch -- LP#1937003  (phase-05)
- Octavia looks up the amphora image by `octavia amp-image-tag`; it MUST equal the tag
  the retrofit stamps (`octavia-diskimage-retrofit amp-image-tag`), both `octavia-amphora`.
  A mismatch means octavia cannot find the image even though it is built and ACTIVE.
  The amphora pipeline gate asserts the two are equal before building (phase-05 5.2).

================================================================================
## Notes
================================================================================
- This index covers phases 00-08. It grows the same way for any future phase: keyed by
  D-NNN / DOCFIX-NNN / L-N / R-N / named-symptom, each entry symptom -> cause -> fix
  with a "phase NN" back-reference, and decision rationale left to design-decisions.md.
- memcached track drift is recorded in appendix-B (B.1), not here (it is a
  version-lock note, not a troubleshooting entry).
