# Appendix A -- Troubleshooting / Known-Issues Index

Keyed by the same `D-NNN` / `DOCFIX-NNN` / `L-P6-N` identifiers used inline in the
phase runbooks. This is an OPERATIONAL index (symptom -> cause -> fix), NOT the
decision log: full rationale lives in `design-decisions.md` and the per-decision
files (`D-0NN-*.md`); the driver fix has its own `magnum-capi-helm-driver-fix-runbook`.
Each entry notes the phase(s) that reference it. ASCII-only.

================================================================================
## Remote execution / scripting
================================================================================

### DOCFIX-021 -- heredoc / stdin consumption  (phase-06, phase-07)
- Symptom: a multi-line `juju ssh`/`ssh ... bash -s` or remote `sudo` block dies
  early or behaves as if truncated; later commands in the heredoc never run.
- Cause: an inner `ssh`/`sudo`/`juju ssh` (or any stdin reader) consumes the rest
  of the heredoc/pipe that was feeding the outer command.
- Fix: append `</dev/null` to every inner `ssh`/`sudo`/`juju ssh` invocation
  (use `</dev/tty` instead only when the call genuinely needs an interactive prompt).
- Also: wrap multi-statement pasteable jumphost blocks in `( { ...; } )` so a stray
  `exit` cannot kill the interactive shell.
- SECOND MANIFESTATION (phase-03): a charm ACTION's human output silently corrupts a
  captured artifact. `juju run vault/leader get-root-ca` wraps the PEM in an INDENTED
  YAML `output: |-` block; `sed`-by-marker preserves the indent and an indented
  `-----BEGIN CERTIFICATE-----` is not valid PEM -> openssl "Unable to load
  certificate" -> keystone NO_CERTIFICATE_OR_CRL_FOUND. Fix: pull from the action JSON
  (real newlines, no indent): `juju run vault/leader get-root-ca -m openstack
  --format json | jq -r '[.. | strings | select(test("BEGIN CERTIFICATE"))][0]'`.
  (Same class as DOCFIX-006: never trust action human output for a captured secret/cert.)

### L-P6-4 -- admin-kubeconfig / secret transfer  (phase-07)
- Risk: staging the cluster-admin kubeconfig (or any secret) in `/tmp`, or letting a
  PTY mangle it in transit.
- Fix: pipe base64 straight into a root-written file with `umask 077`, then `chown`
  to the service user and `chmod 0600` -- never touch `/tmp`. (Pattern in phase-07 7.2.)
- Hardening (Roosevelt): replace the cluster-admin kubeconfig with a scoped
  ServiceAccount kubeconfig carrying only the RBAC the driver needs.

================================================================================
## k8s-snap bootstrap (mgmt cluster)
================================================================================

### DOCFIX-024 -- bootstrap config missing the cluster-config block  (phase-06)
- Symptom: `k8s bootstrap` "succeeds" but the node never reaches Ready; network and
  DNS are silently disabled; CoreDNS/Cilium absent.
- Cause: a bootstrap `--file` whose top level lacks a `cluster-config:` block leaves
  ALL features (network, dns, ...) at disabled defaults. Setting only `pod-cidr` /
  `service-cidr` / `extra-sans` does NOT enable them.
- Fix: include an explicit block:
      cluster-config:
        network: { enabled: true }
        dns:     { enabled: true }
  (See phase-06 6.4 for the full config.) Retry: `snap remove k8s --purge` then re-bootstrap.

================================================================================
## CAPI provider install (mgmt cluster)
================================================================================

### DOCFIX-025a -- cert-manager Helm flag  (phase-06)
- Symptom: cert-manager install fails / CRDs absent when using `--set installCRDs=true`.
- Cause: `installCRDs` was removed from the cert-manager chart (~v1.18). The current
  flag is `crds.enabled=true`.
- Fix: `helm install cert-manager jetstack/cert-manager ... --set crds.enabled=true`.

### D-034 -- CAPI install ordering (ORC before clusterctl init)  (phase-06)
- Symptom: after `clusterctl init`, `capo-controller-manager` CrashLoopBackOff
  (observed ~6 restarts / ~15 min) before self-healing.
- Cause: CAPO v0.14.4's `openstackserver` controller hard-depends on ORC's
  `Image.openstack.k-orc.cloud` CRD at startup. `clusterctl init` installs CAPO; if
  ORC is not yet present, CAPO crash-loops until it appears.
- Fix: install ORC (its manifest provides the `Image` CRD) BEFORE `clusterctl init`.
  Hardened order: cert-manager -> ORC -> clusterctl init -> CAAPH -> janitor.
- Related rule: source every provider version from the chosen `capi-helm-charts`
  tag's `dependencies.json` (read live with `jq`); do not hardcode semver.
  (Full rationale: design-decisions D-034; driver-coherence amendment: D-042.)

================================================================================
## Networking / pod egress
================================================================================

### D-035 -- dual-homed mgmt node pod-egress reverse-path failure  (phase-06)
- Symptom (the prior D-033 architecture): a pod's egress TCP connect to an external
  VIP hangs; the agnhost probe never reaches Completed. SYN leaves the correct NIC and
  the SYN-ACK arrives, but the reply is emitted back out the NIC instead of being
  redirected into the pod via `cilium_host` -- silent, asymmetric breakage. (The
  "do-07 pattern.")
- Cause: Cilium reverse-path handling on a node with multiple NICs.
- Fix (chosen): D-035 single-homed in-cloud tenant VM avoids it entirely; phase-06
  GATE 2 (agnhost pod -> Keystone VIP, must Complete) is the explicit proof. (The
  transferable alternative -- Cilium device pinning -- is a Roosevelt note, not v1.)

================================================================================
## Magnum conductor
================================================================================

### D-037 -- conductor config-dir injection (NOT a systemd ExecStart drop-in)  (phase-07)
- Symptom: the `[capi_helm]` conf.d drop-in is ignored; the conductor behaves as if it
  was never written, even though a systemd drop-in "looks" applied.
- Cause: these OpenStack debs (openstack-pkg-tools) run the daemon through an LSB init
  script wrapped by systemd `systemd-start`, NOT a direct `ExecStart=`. A systemd
  drop-in appending `--config-dir` passes it as a positional arg to the init script,
  which ignores it -- the flag never reaches the daemon. The args are assembled inside
  the init script from `DAEMON_ARGS` (base `--config-file` first), extensible only via
  `/etc/default/<service>`.
- Fix: create `/etc/default/magnum-conductor` (0644; the charm does not manage it):
      DAEMON_ARGS="$DAEMON_ARGS --config-dir /etc/magnum/magnum.conf.d"
  Verify with the init script's own `show-args` (dry-run) AND `ps -ww -C
  magnum-conductor -o args` on the live process -- behavioral, not string-presence.
- Residual: if a future charm hook ever writes `/etc/default/magnum-conductor`, the
  append is lost and `[capi_helm]` silently stops being read. Re-check via show-args/ps.

### L-P6-1 / L-P6-2 -- verify the launched cmdline, not the unit text  (phase-07)
- Rule: never assume the systemd `ExecStart` shape for OpenStack debs, and never treat
  "string present in the unit file" as "the daemon received the flag." Gate on the
  assembled/launched cmdline (`show-args`, then `ps` on the live process).

### DOCFIX-035 -- helm not on the conductor's PATH  (phase-07)
- Symptom: the magnum-capi-helm driver fails shelling out to `helm` (cluster create errors on a
  helm invocation), yet `command -v helm` in an interactive `juju ssh magnum/0` shell finds it.
- Cause: the conductor runs via an LSB init script (systemd `systemd-start`) with the restricted
  init PATH (e.g. `/usr/sbin:/usr/bin:/sbin:/bin`), which EXCLUDES `/usr/local/bin` -- where a
  get.helm.sh tarball install lands. An interactive login shell has `/usr/local/bin` on PATH, so
  it masks the problem (the classic green-in-the-shell, broken-in-the-daemon trap).
- Fix: install the binary to `/usr/local/bin/helm` AND symlink `/usr/bin/helm -> it` (`/usr/bin`
  IS on the restricted PATH). Checksum-verify the tarball (sha256 vs get.helm.sh `.sha256sum`)
  before install. VERIFY against the restricted PATH, not a login shell:
  `env -i PATH=/usr/sbin:/usr/bin:/sbin:/bin sh -c 'command -v helm && helm version --short'`
  must print `/usr/bin/helm` (phase-07 7.4).

### L-P6-3 -- k8s version comes from the IMAGE, not a template label  (phase-08)
- Symptom: cluster create fails in the driver before provisioning.
- Cause: the magnum-capi-helm driver reads `kube_version` from the Glance image
  properties and routes on `os_distro`; it does NOT take k8s version from a template
  label.
- Fix: the workload image (e.g. `ubuntu-jammy-kube-v1.34.8`) MUST carry
  `kube_version` (e.g. v1.32.13) and `os_distro=ubuntu`. Verify before create (phase-08 8.0).

================================================================================
## Driver / cluster health
================================================================================

### D-042 -- driver contract-coherence; health "infrastructure: not found"  (phase-07, phase-08, appendix-B)
- Symptom: `coe cluster show` reports `health_status = UNHEALTHY` deterministically
  (survives a conductor restart); only the `infrastructure` sub-check fails
  ("Infrastructure resource not found"); cluster + control-plane + nodegroup are Ready.
- Cause: driver 1.3.0 reads `apiVersion` off `spec.infrastructureRef` to build its
  health GET, but the CAPI v1.13 (v1beta2 contract) ref carries apiGroup+kind+name with
  NO apiVersion. COSMETIC -- the create path is unaffected (the chart templates the
  resource versions); only the driver's direct health query breaks.
- Fix: upgrade to the RELEASED `magnum-capi-helm==1.4.0` (the "generalize-api-resources"
  feature). 1.4.0 builds each health GET from an explicit api_version via its
  `[capi_helm] api_resources` option, which DEFAULTS to v1beta1 for every CAPI kind --
  and CAPI v1.13.2 / CAPO v0.14.4 still serve v1beta1, so the default works (no override
  needed; phase-07 7.3-7.6). Set a per-kind override only if a kind is v1beta2-only.
  Rule (amends D-034): the Layer-B driver pin must be contract-coherent with the
  Layer-A CAPI core.
- Operational caveat while unfixed: do NOT wire magnum auto-healing to `health_status`
  (a persistent false UNHEALTHY could misfire); CAPI MachineHealthCheck heals independently.

================================================================================
## Cluster lifecycle / Octavia
================================================================================

### D-039 -- app-cred roles (load-balancer_member) / Octavia 403  (phase-08)
- Symptom: cluster create or delete wedges; CAPO gets 403 querying the Octavia LB.
- Cause: the Magnum-minted application credential lacks `load-balancer_member`
  (a pre-D-039 frozen app-cred cannot query Octavia to confirm LB state).
- Fix: ensure the service path mints app-creds carrying `load-balancer_member`
  (+ member, reader). Verify before acceptance (phase-08 prereqs).

### stuck-delete -- wedged CAPI cluster delete  (phase-08)
- Symptom: cluster stuck `DELETE_IN_PROGRESS`; helm release already gone; `Cluster`
  and `OpenStackCluster` CRs stuck Deleting (often on an Octavia 403, see D-039).
- Recovery: clear the `OpenStackCluster` finalizer on the mgmt cluster --
  `kubectl -n <magnum-ns> patch openstackcluster <cluster>-<suffix> --type=merge
  -p '{"metadata":{"finalizers":[]}}'`. The `Cluster` finalizer was only waiting on it,
  so the Cluster auto-finalizes and deletes. Then manually clean orphaned neutron
  resources in dependency order: router remove subnet -> router unset external-gateway
  -> router delete -> subnet delete -> network delete -> security group delete.
- Name-guard (FINDING-4): NEVER patch/delete a CR by an inferred name. The OpenStackCluster is
  named `<cluster>-<CAPI-suffix>` where the suffix is random per create (NOT the Magnum cluster
  name). LIST first -- `kubectl -n <magnum-ns> get openstackcluster` -- and operate on the EXACT
  name returned. The magnum-ns is `magnum-<project-id>` (resolve the project id; never hardcode).
  A wrong-name patch silently no-ops and the delete stays wedged.

### LB-failover -- LB stuck provisioning_status=ERROR after a host event  (phase-08)
- Symptom: the kube-api Octavia LB shows `operating_status ONLINE` but
  `provisioning_status ERROR` after a host outage/OOM.
- Cause: a control-plane op on the amphora failed during the outage.
- Fix: `openstack loadbalancer failover <lb-id>` in ADMIN-project scope (amphora /
  failover ops 403 under tenant member scope). Watch ERROR -> PENDING_UPDATE -> ACTIVE
  (~100s); a single STANDALONE amphora gives a brief blip; operating_status holds ONLINE.

### uninitialized-taint -- workload addons Pending  (phase-08)
- Symptom: new workload nodes are kubelet-Ready but addon pods (metrics-server,
  node-feature-discovery, etc.) stay Pending; nodes carry
  `node.cluster.x-k8s.io/uninitialized`.
- Cause: that taint is removed by the CAPI machine controller on the MANAGEMENT
  cluster. If the mgmt cluster is down (see D-041), the taint persists.
- Fix: restore the mgmt cluster API; CAPI then removes the taint and addons schedule.

### CNI-label / DOCFIX-032 -- network_driver under driver 1.4.0; pin calico explicitly  (phase-08)
- Note: under the as-FIRST-built driver 1.3.0 the legacy Magnum `network_driver` label was
  IGNORED and the capi-helm `openstack-cluster` chart's default CNI (Calico) always ran. Under
  the RELEASED 1.4.0 driver the `network_driver` template option IS honored (it maps through to
  the chart `network_driver`).
- DOCFIX-032: pin `--network-driver calico` EXPLICITLY on the `capi-k8s-v1-34` template
  (phase-08) rather than relying on the default staying Calico. Chart 0.25.1 ships ONLY Calico
  (flannel is not packaged), so `flannel` there would fail to converge -- do not set it. (Mgmt
  cluster CNI is separately Cilium, via k8s-snap.)

================================================================================
## Hyperconverged host / mgmt-VM resilience
================================================================================

### D-040 -- host OOM from low reserved-host-memory  (phase-08)
- Symptom: guests OOM-killed; a compute host may even present in Juju as
  `State=down` (heavy swap thrash stalls OVS/OVN heartbeats and the machine agent).
- Cause: `reserved-host-memory` default 512 MB does not cover the co-located
  LXD/Ceph/MySQL services on these hyperconverged hosts -> nova over-commits real RAM.
- Fix: `reserved-host-memory = 8192` on all compute units (baked into the hardened
  bundle). Diagnose a suspected OOM-vs-reboot with `who -b` / `uptime` (no recent boot)
  and `journalctl -k | grep -i oom`; the ovsdb "no response to inactivity probe ...
  disconnecting" storm is the swap-thrash signature.

### D-041 -- single-node mgmt cluster does not self-heal  (phase-08)
- Symptom: after a host event the mgmt VM (`capi-mgmt-v2`) is SHUTOFF; FIP
  unreachable; magnum cannot reach the mgmt API; workload addons go Pending (see
  uninitialized-taint).
- Cause: the D-035 single-node mgmt cluster is a SPOF with no MachineHealthCheck
  (unlike the workload cluster).
- Fix: `openstack server start capi-mgmt-v2` (API serves ~40s later; a brief TLS
  handshake timeout on the first kubectl is expected). Follow-up: HA mgmt cluster for
  Roosevelt.

### juju-macaroon -- "cannot get discharge ... EOF"  (phase-07, phase-08)
- Symptom: `juju ssh` (or other juju calls) fail mid-session with a discharge/EOF error.
- Cause: the juju macaroon expired during a long session.
- Fix: re-run `juju login`, then retry.

================================================================================
## Teardown / MAAS reset (phase-00)
================================================================================

### DOCFIX-016 -- never `maas list` (API-key leak)  (phase-00, phase-01, phase-04)
- Risk: `maas list` prints the stored API key to stdout (and into any transcript/log).
- Fix: the profile name is known (`admin`); call `maas admin ...` directly. Never run
  `maas list` in a runbook or paste block.

### DOCFIX-017 -- no `maas whoami`; hardcode the eyeballed system_ids  (phase-00)
- Risk: scripting machine selection via `maas <profile> whoami` + owner filters is
  fragile and, in this lab, unnecessary.
- Fix: the four host system_ids are fixed and eyeball-verified
  (openstack0=4na83t, openstack1=qdbqd6, openstack2=h8frng, openstack3=tmsafc) --
  iterate those literals. (The older 01-destroy-model.md used `maas list`/`whoami` and
  released 5 VMs incl. the retired D-033 capi-mgmt; the current rebuild releases 4.)

### R7 -- sudo for libvirt / qemu-img  (phase-00, phase-01)
- The OSD qcow2 files (`/var/lib/libvirt/images/<host>-1.qcow2`) are root:root / 600;
  `qemu-img info|create`, `virsh domstate`, `stat`, `rm` against them all need `sudo`.

### KI-P3-001 -- VIP / primary collision  (phase-00, phase-04)
- Symptom: a charm `vip:` address equals a MAAS-auto-assigned machine/container
  primary (observed: cinder public VIP .226 == magnum container 1/lxd/3 primary).
- Cause: MAAS auto-static allocation was not excluded over the VIP block (provider had
  NO VIP reservation), so MAAS handed primaries .225/.226/.227 onto the .224-.236 VIPs.
- Fix (durable): on EVERY space carrying VIPs (provider AND metal) reserve the
  front-loaded VIP /26 in MAAS, distinct from the primary range and any neutron
  allocation_pool (phase-00 Phase 4). A reserved range stops future auto-assign onto
  a configured VIP. Negative test post-deploy: no service vip == any unit primary.

### DEVIATION-2 -- raise a KVM host's RAM, then MAAS-recommission to Ready  (phase-00)
- Context (2026-06-11): the openstack0-3 KVM guests were bumped 16384 -> 32768 MiB on the 196 GB
  hypervisor to relieve memory pressure. Pattern: with the guest SHUT OFF (and after the OSD
  wipe), `virsh setmaxmem <dom> 32G --config` then `virsh setmem <dom> 32G --config`; boot; then
  MAAS RECOMMISSION the node so MAAS re-reads hardware and lands it back at Ready at the new size
  (4x Ready at 32768 in ~3 min). Do the maxmem change while shut off -- a live setmaxmem is rejected.
- D-040 `reserved-host-memory 8192` is RETAINED (correctness floor, independent of host size).
  Re-measure the per-host container/service footprint against the 32 GiB envelope before the
  Roosevelt node-role split -- 16 GiB-era pressure numbers do not map 1:1.

================================================================================
## Deploy-time (phase-01)
================================================================================

### R14 -- VIP relocation .224-.236 -> .50-.60  (phase-01)
- The public + internal API VIPs were front-loaded out of the old high-end .224-.236
  block into .50-.60 (inside the reserved .2-.63 /26). Every bundle `vip:` is a dual
  provider+metal pair "10.12.4.5x 10.12.8.5x" (D-020). Pre-deploy guard: total provider
  VIPs=11, all in .50-.60, zero in the stale .10-.20 (phase-01 1.1). Any per-cloud
  consumer of a VIP (the Horizon reverse proxy, monitoring) must be repointed.

### R15 -- the .10 phantom resolver  (phase-01)
- Symptom: an unreachable region resolver `10.12.8.10` appears in a node's resolver
  list (sometimes as Current DNS Server) despite the subnet dns_servers override.
- Cause: MAAS advertises its region/rack controller as a DNS server on the
  MAAS-managed metal VLAN, independent of the subnet field; the override does not purge it.
- Impact: NON-BLOCKING -- systemd-resolved deprioritizes .10 and falls through to .1.
  Latent fragility if .1 ever drops. Understand/eliminate for Roosevelt (no libvirt split there).

### L1 -- no `set -e` on count-gate blocks; guard greps `|| true`  (phase-01)
- A guarded `grep -c` returning 0 is a VALID answer, not a failure. Under `set -e` a
  zero-count grep aborts the block. Pre-deploy verify blocks run WITHOUT `set -e`, and
  every count grep ends `|| true`. (`bash -n` would not catch this -- it is behavior.)

### L3 -- metal-side dual-VIP eyeball check  (phase-01)
- The provider-side VIP guard greps only the first token of each dual `vip:`. The metal
  side (second token, `10.12.8.5x`) must be eyeballed to confirm all 11 sit in .8.50-.60,
  clear of metal infra (.8.10 maas / .8.20 lxd / .8.21 capi / .8.30 juju).

================================================================================
## Vault / secrets (phase-02)
================================================================================

### DOCFIX-006 -- vault init is one-shot; stdout-only redirect loses the keys  (phase-02)
- Symptom: `vault operator init ... > file` captures stdout only; if the key block went
  to stderr (or the run is interrupted) you are left with an unusable/empty file and the
  5 shares + root token are GONE -- init runs exactly once and cannot be replayed.
- Fix: `vault operator init -key-shares=5 -key-threshold=3 2>&1 | tee ~/vault-init/init.txt`
  VERBATIM; gate on `grep -c '^Unseal Key' == 5` and `Initial Root Token` present; then
  save the file OFF-HOST before anything else. Never improvise this command.

### DOCFIX-011 -- authorize-charm parameter is `token`  (phase-02)
- The vault `authorize-charm` action takes `token` (a direct token string); there is no
  `token-secret-id` variant in this charm rev. Confirm via `juju actions vault --schema`.
  Authorize with a SHORT-LIVED CHILD token (juju run persists action params in the op log).

### DOCFIX-014 -- generate-root-ca is required  (phase-02)
- Symptom: after authorize-charm, vault stays BLOCKED "Missing CA cert".
- Fix: run `juju run vault/leader generate-root-ca` -- it mints the charm-pki-local
  root and clears the block straight to active. (Omitting it leaves vault hung.)

### L4 -- vault unseal via hidden prompt, not key-on-argv  (phase-02)
- Use Vault's own `vault operator unseal` (no argument) so it prompts hidden; the key is
  never on the command line / in a var / in `ps` / in scrollback. Do NOT use
  `vault operator unseal $KEY` (visible in `ps` on the unit). Unseal is re-runnable, so
  the verbatim-reference rule is looser here, but the security gain is real.

### R3 -- "HA Enabled false" is correct for vault-on-mysql  (phase-02)
- Expected post-unseal: Initialized true / Sealed false / Storage Type mysql /
  **HA Enabled false**. Single-unit vault on the mysql backend is non-HA by design; any
  reference to "HA Enabled true (etcd backend)" is STALE (etcd was dropped).

================================================================================
## Identity / openrc (phase-03)
================================================================================

### DOCFIX-018 -- IP-only OS_AUTH_URL  (phase-03)
- This cloud is IP-only (no FQDN, no cloud DNS). The admin openrc must point at the
  keystone PUBLIC endpoint by IP: `OS_AUTH_URL=https://10.12.4.50:5000/v3`, with the
  vault root CA in `OS_CACERT` (B5 IP-SAN certs validate). No /etc/hosts, no FQDN.

### DOCFIX-022 -- discover the admin project; do not hardcode it  (phase-03)
- Symptom: with TLS working, keystone returns HTTP 401.
- Cause: wrong project scope. The scoping project name varies by charm rev (here it is
  `admin`, living in domain `admin_domain`; an older doc's `OS_PROJECT_NAME=admin_domain`
  401s). Credential good, scope wrong.
- Fix: a candidate loop -- try each of "admin admin_domain"; the first that issues a
  SCOPED token wins (phase-03 3.2). Costs 2 extra token requests; self-corrects across
  revs instead of re-introducing the 401-by-hardcode.

================================================================================
## Core services: HAProxy + reverse-proxy (phase-03)
================================================================================

### D-045 / DOCFIX-031 -- juju "active/idle" but an haproxy backend is DOWN  (phase-03)
- Symptom: `juju status` is all active/idle, yet a service VIP intermittently 503s or a unit's
  API is unreachable. juju health is BLIND to per-backend haproxy state.
- Cause: a charm-rendered haproxy backend can be silently DOWN without the charm going non-idle
  -- e.g. (D-045) haproxy was NOT reloaded after the TLS/cert cascade, so its health checks ran
  plaintext against an SSL backend and marked it DOWN. juju-green is necessary, not sufficient.
- Fix: sweep haproxy's OWN verdict on every unit via its admin socket, then remediate+reload.
  Per unit, read `/var/run/haproxy/admin.sock` (`show stat`) and `grep ',DOWN,'` (excluding the
  FRONTEND/BACKEND summary rows). For any flagged unit: `sudo haproxy -c -f
  /etc/haproxy/haproxy.cfg` (must say valid) then `sudo systemctl reload haproxy` (graceful
  master-worker; reload, not restart). Phase-03 3.1 gates on a zero-DOWN sweep cloud-wide --
  it closes the juju-green-but-backend-DOWN hole.

### nginx-reverse-proxy -- jumphost -> internal-VIP proxy gotchas  (phase-03)
- Context: the jumphost reaches internal-only dashboards/APIs via an nginx reverse proxy
  (phase-03 3.3). Four traps, each with the as-built fix:
- reload race: a `systemctl reload nginx` right after editing the vhost can be served by a
  still-draining old worker (a curl ~2s later hits stale behavior; the co-hosted MAAS proxy
  blips too). `nginx -t` FIRST; prefer `restart` for a definitive cutover when the listen/upstream
  set changed, reload only for content-equivalent edits.
- proxy_ssl_name / SNI: the upstream presents a DNS-SAN cert (a juju-internal name, e.g.
  `juju-ffe3b8-2-lxd-2`); set `proxy_ssl_name` to that SAN, `proxy_ssl_verify on`, and the vault
  CA in `proxy_ssl_trusted_certificate`, or verification fails on the IP-only connect.
- sed no-op: a `sed -i` that does not match silently changes nothing and the proxy keeps the old
  behavior -- assert the post-edit content, do not trust sed's exit code.
- scheme-mismatch redirect loop: the backend issues `https://` Location headers while the proxy
  listens `http`; without `proxy_redirect https:// http://` (or a matching listen scheme) the
  browser loops. Match the scheme end-to-end or rewrite the redirect.

================================================================================
## Octavia enablement (phase-05)
================================================================================

### L7 -- the openstack snap cannot read /tmp  (phase-05, also phase-01 PKI sanity)
- Symptom: `openstack image create --file /tmp/...` -> "[Errno 2] No such file or
  directory" even though `sha256sum` just read the same path.
- Cause: the openstack CLI snap is confined and cannot read `/tmp`; it CAN read `$HOME`
  (home interface).
- Fix: stage any file the snap must read under `$HOME` (e.g. `$HOME/amphora-base/...`),
  never `/tmp`.

### octavia-configure-resources -- long-running action; o-hm0 transient is normal  (phase-05)
- `configure-resources` is long-running: juju's default action wait may time out
  ("timed out waiting for results") while the hook KEEPS RUNNING -- do NOT treat the
  wait-timeout as failure or re-fire blindly. Use a bound `--wait` and confirm completion
  via `juju show-operation <N>` (authoritative), not the streamed log.
- NORMAL (not faults) during/after: lb-mgmt-net is IPv6-ULA (fc00::/..) by design; a
  "Virtual network for access to Amphorae is down" transient self-heals as o-hm0 comes
  up; the lb-mgmt `network:distributed` port shows DOWN (logical OVN port, never chassis-bound).

### amp-image-tag-mismatch -- LP#1937003  (phase-05)
- Octavia looks up the amphora image by `octavia amp-image-tag`; it MUST equal the tag
  the retrofit stamps (`octavia-diskimage-retrofit amp-image-tag`), both `octavia-amphora`.
  A mismatch means octavia cannot find the image even though it is built and ACTIVE.
  The amphora pipeline gate asserts the two are equal before building (phase-05 5.2).

================================================================================
## Image seeding (phase-05/06/08)
================================================================================

### FINDING-3 -- azimuth CDN 403s glance web-download; stage-and-verify is canonical  (phase-06, phase-08)
- Symptom: a glance web-download import (`--import-method web-download`) 202-accepts, then the
  image hangs in `queued` forever and never reaches `active`.
- Cause: glance's web-download plugin fetches with urllib (User-Agent `Python-urllib/3.x`); the
  azimuth-images CDN (`azimuth-images.stackhpc.cloud`) returns HTTP 403 to that UA. A curl/HEAD
  probe with a different UA passes -- which is why an earlier probe false-passed while the real
  import failed.
- Fix (canonical): STAGE-AND-VERIFY. curl the qcow2 to `$HOME` (snap-readable, NOT /tmp -- L7;
  curl's UA is not blocked), verify the checksum against the published manifest (azimuth-images
  manifest.json -- sha512 for kube images; the ubuntu cloud-images SHA256SUMS for noble), then
  `openstack image create --file --import` (the openstack snap's `--import` == glance-direct;
  image-conversion lands it `raw`). CORRECTION-1: a plain `--file` PUT (no `--import`) stores
  qcow2 -- fine for boot, but `--import` gives the raw Ceph fast-clone alignment.
- Clear a stuck record before retry: gated `openstack image delete <id>` on the `queued` remnant
  (verify the EXACT id first -- FINDING-4 name-guard discipline).
- Roosevelt: unify ALL image seeding (amphora base, noble mgmt, kube) on stage-and-verify for one
  provenance-verified path cloud-wide.

### web-download -- tested ALTERNATIVE to stage-and-verify  (phase-05/06/08)
- Web-download (`openstack image create --import --import-method web-download --uri <url>`) is
  retained as a tested ALTERNATIVE, not the canonical path (superseded 2026-06-17; see
  design-decisions). Caveats: (1) it cannot checksum-verify the fetched file against a published
  digest (the CDN redirect strips it) -- weaker provenance; (2) it 403s on the azimuth CDN
  (FINDING-3), so it is unusable for kube images; (3) for ubuntu cloud-images it works on the
  hardened bundle (the 2026-06-08 403 was transient/pre-hardening). Use only as an expedient.

================================================================================
## Notes
================================================================================
- This index covers phases 00-08. It grows the same way for any future phase: keyed by
  D-NNN / DOCFIX-NNN / L-N / R-N / named-symptom, each entry symptom -> cause -> fix
  with a "phase NN" back-reference, and decision rationale left to design-decisions.md.
- memcached track drift is recorded in appendix-B (B.1), not here (it is a
  version-lock note, not a troubleshooting entry).

<!-- patchset-20260610-appendix-addendum -->

---

## Addendum 2026-06-10 -- CAPI/Magnum operations findings

Five entries from the 2026-06-10 recovery session. Full procedures with
verified blocks: runbooks/ops-capi-recovery.md.

### Parked-state signatures (mgmt VM deliberately stopped)
While capi-mgmt-v2 is stopped: Magnum reports UNHEALTHY with an EMPTY
health_status_reason (distinct from the D-042 cosmetic, which has a populated
reason); the Horizon Container Infra panel may 504 through the jumphost nginx
proxy and `coe` CLI calls may stall; the workload cluster keeps serving (no
runtime dependency on the mgmt cluster). If jumphost secrets were filed during
parking, the convention is ~/sweep-YYYYMMDD/secrets/. See ops-capi-recovery
Section 0 (expectations table) and Section 1 (parking block).

### Amphora orphan/zombie sweep after host-pressure events
Causal chain (traced live 2026-06-10): host CPU/memory pressure -> amphora
heartbeats go stale -> Octavia health-manager marks amphorae ERROR and launches
auto-failovers -> failovers fail NoValidHost (no placement headroom) -> amphora
servers accumulate with NO Octavia DB row. Two variants: an ERROR server
(failed spawn) and an ACTIVE heartbeating zombie (health-manager logs "missing
from the DB ... An operator must manually delete it" every 10 s). Remedy:
verify-then-delete by SERVER UUID under admin scope -- the
`loadbalancer amphora list` output is the DB truth; Nova name lookup is
project-scoped (amphorae live in the Octavia services project). Procedure:
ops-capi-recovery 5a. Do NOT retry failover against the same blocker; each
attempt mints another zombie.

### Octavia failover requires +1 amphora placement headroom
STANDALONE failover builds the replacement amphora BEFORE reaping the old one,
so it transiently needs one extra amphora slot (charm-octavia: 1024 MB / 1 vCPU
/ 8 GB). Scheduler ceiling per host = physical_MB * ram_allocation_ratio (1.5)
- reserved_host_memory (8192 per D-040). A cloud allocated to that ceiling
cannot heal its own load balancers: the failover fast-fails to ERROR in
~15 seconds on NoValidHost. Verified to the megabyte 2026-06-10. Roosevelt
sizing requirement: reserve at least one amphora slot per concurrent failover
on top of workload allocation (feeds the node-role/rebalancing recommendation).

### juju ssh `</dev/null` vs an expired macaroon (DOCFIX-021 interaction)
DOCFIX-021's `</dev/null` on juju ssh assumes valid macaroon auth. When the
jumphost macaroon goes stale, juju falls back to an interactive password
prompt; `</dev/null` feeds that prompt EOF and the symptom is the misleading
"cannot get discharge from https://<controller>:17070/auth: EOF". Triage: run
`juju status` interactively -- if it succeeds after a password prompt, the
controller is healthy and only the credential cache is stale. Workaround for
the session: stdin from `</dev/tty`. Fix at a calm moment: `juju logout` then
`juju login`.

### Horizon visibility of CAPI instances, LBs, and amphorae
CAPI/Magnum VMs are owned by the capi-mgmt project; an empty Project ->
Compute -> Instances page under admin scope is correct, not a defect. Map:
tenant VMs -> Instances in the OWNING project's scope (use the header project
switcher; admin holds member on capi-mgmt per phase-06 6.0-BOOT); LB objects ->
Project -> Network -> Load Balancers in the owning project's scope; amphora
VMs -> Admin -> Compute -> Instances ONLY (they belong to the Octavia services
project); everything at once -> CLI `openstack server list --all-projects`.
Warning about the asymmetry: the Container Infra panel lists clusters
cross-project under admin policy, which makes the strictly-scoped Nova panel
look broken when it is not.

--------------------------------------------------------------------------------
SYMPTOM: link-subnet fails "IP address is already in use" but the IP is in no
         visible table (ipaddresses read empty, discovery cleared, not in DHCP
         dynamic range, interface on the correct VLAN).
--------------------------------------------------------------------------------
CAUSE:   A freshly re-enrolled host PXE-leases its own metal IP (10.12.8.4N) at
         commission; MAAS keeps it as a StaticIPAddress of alloc_type 6
         (DISCOVERED), tied to the node. Distinct from the network-discovery
         table AND from user allocations -- neither `discoveries
         clear-by-mac-and-ip` nor a plain `ipaddresses release` clears it.
AUTHORITATIVE READ (use FIRST, before guessing):
         maas admin subnet ip-addresses <SUBNET_ID>
         -> lists every in-use IP with .alloc_type and .node_summary. alloc_type 6
            = DISCOVERED. This is the definitive "who holds this IP and why".
FIX:     maas admin ipaddresses release ip=<IP> force=true discovered=true
         (BOTH flags; force alone -> "does not exist"). Only release when the
         discovered record's node is the SAME host -- a different node means a
         real address conflict; stop and investigate.
NOW AUTOMATED: scripts/carve-host-interfaces.sh release_self_discovered() does
         this, gated to self-owned records only.