# Appendix A -- Troubleshooting / Known-Issues Index

Keyed by the same `D-NNN` / `DOCFIX-NNN` / `L-P6-N` identifiers used inline in the
phase runbooks. This is an OPERATIONAL index (symptom -> cause -> fix), NOT the
decision log: full rationale lives in `design-decisions.md` and the per-decision
files (`D-0NN-*.md`); the driver fix has its own `magnum-capi-helm-driver-fix-runbook`.
Each entry notes the phase(s) that reference it. ASCII-only.

================================================================================
## Remote execution / scripting
================================================================================

### DOCFIX-021 -- heredoc / stdin consumption  (phase-06, phase-07)
- Symptom: a multi-line `juju ssh`/`ssh ... bash -s` or remote `sudo` block dies
  early or behaves as if truncated; later commands in the heredoc never run.
- Cause: an inner `ssh`/`sudo`/`juju ssh` (or any stdin reader) consumes the rest
  of the heredoc/pipe that was feeding the outer command.
- Fix: append `</dev/null` to every inner `ssh`/`sudo`/`juju ssh` invocation
  (use `</dev/tty` instead only when the call genuinely needs an interactive prompt).
- Also: wrap multi-statement pasteable jumphost blocks in `( { ...; } )` so a stray
  `exit` cannot kill the interactive shell.
- SECOND MANIFESTATION (phase-03): a charm ACTION's human output silently corrupts a
  captured artifact. `juju run vault/leader get-root-ca` wraps the PEM in an INDENTED
  YAML `output: |-` block; `sed`-by-marker preserves the indent and an indented
  `-----BEGIN CERTIFICATE-----` is not valid PEM -> openssl "Unable to load
  certificate" -> keystone NO_CERTIFICATE_OR_CRL_FOUND. Fix: pull from the action JSON
  (real newlines, no indent): `juju run vault/leader get-root-ca -m openstack
  --format json | jq -r '[.. | strings | select(test("BEGIN CERTIFICATE"))][0]'`.
  (Same class as DOCFIX-006: never trust action human output for a captured secret/cert.)

### L-P6-4 -- admin-kubeconfig / secret transfer  (phase-07)
- Risk: staging the cluster-admin kubeconfig (or any secret) in `/tmp`, or letting a
  PTY mangle it in transit.
- Fix: pipe base64 straight into a root-written file with `umask 077`, then `chown`
  to the service user and `chmod 0600` -- never touch `/tmp`. (Pattern in phase-07 7.2.)
- Hardening (Roosevelt): replace the cluster-admin kubeconfig with a scoped
  ServiceAccount kubeconfig carrying only the RBAC the driver needs.

================================================================================
## k8s-snap bootstrap (mgmt cluster)
================================================================================

### DOCFIX-024 -- bootstrap config missing the cluster-config block  (phase-06)
- Symptom: `k8s bootstrap` "succeeds" but the node never reaches Ready; network and
  DNS are silently disabled; CoreDNS/Cilium absent.
- Cause: a bootstrap `--file` whose top level lacks a `cluster-config:` block leaves
  ALL features (network, dns, ...) at disabled defaults. Setting only `pod-cidr` /
  `service-cidr` / `extra-sans` does NOT enable them.
- Fix: include an explicit block:
      cluster-config:
        network: { enabled: true }
        dns:     { enabled: true }
  (See phase-06 6.4 for the full config.) Retry: `snap remove k8s --purge` then re-bootstrap.

================================================================================
## CAPI provider install (mgmt cluster)
================================================================================

### DOCFIX-025a -- cert-manager Helm flag  (phase-06)
- Symptom: cert-manager install fails / CRDs absent when using `--set installCRDs=true`.
- Cause: `installCRDs` was removed from the cert-manager chart (~v1.18). The current
  flag is `crds.enabled=true`.
- Fix: `helm install cert-manager jetstack/cert-manager ... --set crds.enabled=true`.

### D-034 -- CAPI install ordering (ORC before clusterctl init)  (phase-06)
- Symptom: after `clusterctl init`, `capo-controller-manager` CrashLoopBackOff
  (observed ~6 restarts / ~15 min) before self-healing.
- Cause: CAPO v0.14.4's `openstackserver` controller hard-depends on ORC's
  `Image.openstack.k-orc.cloud` CRD at startup. `clusterctl init` installs CAPO; if
  ORC is not yet present, CAPO crash-loops until it appears.
- Fix: install ORC (its manifest provides the `Image` CRD) BEFORE `clusterctl init`.
  Hardened order: cert-manager -> ORC -> clusterctl init -> CAAPH -> janitor.
- Related rule: source every provider version from the chosen `capi-helm-charts`
  tag's `dependencies.json` (read live with `jq`); do not hardcode semver.
  (Full rationale: design-decisions D-034; driver-coherence amendment: D-042.)

================================================================================
## Networking / pod egress
================================================================================

### D-035 -- dual-homed mgmt node pod-egress reverse-path failure  (phase-06)
- Symptom (the prior D-033 architecture): a pod's egress TCP connect to an external
  VIP hangs; the agnhost probe never reaches Completed. SYN leaves the correct NIC and
  the SYN-ACK arrives, but the reply is emitted back out the NIC instead of being
  redirected into the pod via `cilium_host` -- silent, asymmetric breakage. (The
  "do-07 pattern.")
- Cause: Cilium reverse-path handling on a node with multiple NICs.
- Fix (chosen): D-035 single-homed in-cloud tenant VM avoids it entirely; phase-06
  GATE 2 (agnhost pod -> Keystone VIP, must Complete) is the explicit proof. (The
  transferable alternative -- Cilium device pinning -- is a Roosevelt note, not v1.)

================================================================================
## Magnum conductor
================================================================================

### D-037 -- conductor config-dir injection (NOT a systemd ExecStart drop-in)  (phase-07)
- Symptom: the `[capi_helm]` conf.d drop-in is ignored; the conductor behaves as if it
  was never written, even though a systemd drop-in "looks" applied.
- Cause: these OpenStack debs (openstack-pkg-tools) run the daemon through an LSB init
  script wrapped by systemd `systemd-start`, NOT a direct `ExecStart=`. A systemd
  drop-in appending `--config-dir` passes it as a positional arg to the init script,
  which ignores it -- the flag never reaches the daemon. The args are assembled inside
  the init script from `DAEMON_ARGS` (base `--config-file` first), extensible only via
  `/etc/default/<service>`.
- Fix: create `/etc/default/magnum-conductor` (0644; the charm does not manage it):
      DAEMON_ARGS="$DAEMON_ARGS --config-dir /etc/magnum/magnum.conf.d"
  Verify with the init script's own `show-args` (dry-run) AND `ps -ww -C
  magnum-conductor -o args` on the live process -- behavioral, not string-presence.
- Residual: if a future charm hook ever writes `/etc/default/magnum-conductor`, the
  append is lost and `[capi_helm]` silently stops being read. Re-check via show-args/ps.

### L-P6-1 / L-P6-2 -- verify the launched cmdline, not the unit text  (phase-07)
- Rule: never assume the systemd `ExecStart` shape for OpenStack debs, and never treat
  "string present in the unit file" as "the daemon received the flag." Gate on the
  assembled/launched cmdline (`show-args`, then `ps` on the live process).

### DOCFIX-035 -- helm not on the conductor's PATH  (phase-07)
- Symptom: the magnum-capi-helm driver fails shelling out to `helm` (cluster create errors on a
  helm invocation), yet `command -v helm` in an interactive `juju ssh magnum/0` shell finds it.
- Cause: the conductor runs via an LSB init script (systemd `systemd-start`) with the restricted
  init PATH (e.g. `/usr/sbin:/usr/bin:/sbin:/bin`), which EXCLUDES `/usr/local/bin` -- where a
  get.helm.sh tarball install lands. An interactive login shell has `/usr/local/bin` on PATH, so
  it masks the problem (the classic green-in-the-shell, broken-in-the-daemon trap).
- Fix: install the binary to `/usr/local/bin/helm` AND symlink `/usr/bin/helm -> it` (`/usr/bin`
  IS on the restricted PATH). Checksum-verify the tarball (sha256 vs get.helm.sh `.sha256sum`)
  before install. VERIFY against the restricted PATH, not a login shell:
  `env -i PATH=/usr/sbin:/usr/bin:/sbin:/bin sh -c 'command -v helm && helm version --short'`
  must print `/usr/bin/helm` (phase-07 7.4).

### L-P6-3 -- k8s version comes from the IMAGE, not a template label  (phase-08)
- Symptom: cluster create fails in the driver before provisioning.
- Cause: the magnum-capi-helm driver reads `kube_version` from the Glance image
  properties and routes on `os_distro`; it does NOT take k8s version from a template
  label.
- Fix: the workload image (e.g. `ubuntu-jammy-kube-v1.34.8`) MUST carry
  `kube_version` (e.g. v1.32.13) and `os_distro=ubuntu`. Verify before create (phase-08 8.0).

================================================================================
## Driver / cluster health
================================================================================

### D-042 -- driver contract-coherence; health "infrastructure: not found"  (phase-07, phase-08, appendix-B)
- Symptom: `coe cluster show` reports `health_status = UNHEALTHY` deterministically
  (survives a conductor restart); only the `infrastructure` sub-check fails
  ("Infrastructure resource not found"); cluster + control-plane + nodegroup are Ready.
- Cause: driver 1.3.0 reads `apiVersion` off `spec.infrastructureRef` to build its
  health GET, but the CAPI v1.13 (v1beta2 contract) ref carries apiGroup+kind+name with
  NO apiVersion. COSMETIC -- the create path is unaffected (the chart templates the
  resource versions); only the driver's direct health query breaks.
- Fix: upgrade to the RELEASED `magnum-capi-helm==1.4.0` (the "generalize-api-resources"
  feature). 1.4.0 builds each health GET from an explicit api_version via its
  `[capi_helm] api_resources` option, which DEFAULTS to v1beta1 for every CAPI kind --
  and CAPI v1.13.2 / CAPO v0.14.4 still serve v1beta1, so the default works (no override
  needed; phase-07 7.3-7.6). Set a per-kind override only if a kind is v1beta2-only.
  Rule (amends D-034): the Layer-B driver pin must be contract-coherent with the
  Layer-A CAPI core.
- Operational caveat while unfixed: do NOT wire magnum auto-healing to `health_status`
  (a persistent false UNHEALTHY could misfire); CAPI MachineHealthCheck heals independently.

================================================================================
## Cluster lifecycle / Octavia
================================================================================

### D-039 -- app-cred roles (load-balancer_member) / Octavia 403  (phase-08)
- Symptom: cluster create or delete wedges; CAPO gets 403 querying the Octavia LB.
- Cause: the Magnum-minted application credential lacks `load-balancer_member`
  (a pre-D-039 frozen app-cred cannot query Octavia to confirm LB state).
- Fix: ensure the service path mints app-creds carrying `load-balancer_member`
  (+ member, reader). Verify before acceptance (phase-08 prereqs).

### stuck-delete -- wedged CAPI cluster delete  (phase-08)
- Symptom: cluster stuck `DELETE_IN_PROGRESS`; helm release already gone; `Cluster`
  and `OpenStackCluster` CRs stuck Deleting (often on an Octavia 403, see D-039).
- Recovery: clear the `OpenStackCluster` finalizer on the mgmt cluster --
  `kubectl -n <magnum-ns> patch openstackcluster <cluster>-<suffix> --type=merge
  -p '{"metadata":{"finalizers":[]}}'`. The `Cluster` finalizer was only waiting on it,
  so the Cluster auto-finalizes and deletes. Then manually clean orphaned neutron
  resources in dependency order: router remove subnet -> router unset external-gateway
  -> router delete -> subnet delete -> network delete -> security group delete.
- Name-guard (FINDING-4): NEVER patch/delete a CR by an inferred name. The OpenStackCluster is
  named `<cluster>-<CAPI-suffix>` where the suffix is random per create (NOT the Magnum cluster
  name). LIST first -- `kubectl -n <magnum-ns> get openstackcluster` -- and operate on the EXACT
  name returned. The magnum-ns is `magnum-<project-id>` (resolve the project id; never hardcode).
  A wrong-name patch silently no-ops and the delete stays wedged.

### LB-failover -- LB stuck provisioning_status=ERROR after a host event  (phase-08)
- Symptom: the kube-api Octavia LB shows `operating_status ONLINE` but
  `provisioning_status ERROR` after a host outage/OOM.
- Cause: a control-plane op on the amphora failed during the outage.
- Fix: `openstack loadbalancer failover <lb-id>` in ADMIN-project scope (amphora /
  failover ops 403 under tenant member scope). Watch ERROR -> PENDING_UPDATE -> ACTIVE
  (~100s); a single STANDALONE amphora gives a brief blip; operating_status holds ONLINE.

### uninitialized-taint -- workload addons Pending  (phase-08)
- Symptom: new workload nodes are kubelet-Ready but addon pods (metrics-server,
  node-feature-discovery, etc.) stay Pending; nodes carry
  `node.cluster.x-k8s.io/uninitialized`.
- Cause: that taint is removed by the CAPI machine controller on the MANAGEMENT
  cluster. If the mgmt cluster is down (see D-041), the taint persists.
- Fix: restore the mgmt cluster API; CAPI then removes the taint and addons schedule.

### CNI-label / DOCFIX-032 -- network_driver under driver 1.4.0; pin calico explicitly  (phase-08)
- Note: under the as-FIRST-built driver 1.3.0 the legacy Magnum `network_driver` label was
  IGNORED and the capi-helm `openstack-cluster` chart's default CNI (Calico) always ran. Under
  the RELEASED 1.4.0 driver the `network_driver` template option IS honored (it maps through to
  the chart `network_driver`).
- DOCFIX-032: pin `--network-driver calico` EXPLICITLY on the `capi-k8s-v1-34` template
  (phase-08) rather than relying on the default staying Calico. Chart 0.25.1 ships ONLY Calico
  (flannel is not packaged), so `flannel` there would fail to converge -- do not set it. (Mgmt
  cluster CNI is separately Cilium, via k8s-snap.)

================================================================================
## Hyperconverged host / mgmt-VM resilience
================================================================================

### D-040 -- host OOM from low reserved-host-memory  (phase-08)
- Symptom: guests OOM-killed; a compute host may even present in Juju as
  `State=down` (heavy swap thrash stalls OVS/OVN heartbeats and the machine agent).
- Cause: `reserved-host-memory` default 512 MB does not cover the co-located
  LXD/Ceph/MySQL services on these hyperconverged hosts -> nova over-commits real RAM.
- Fix: `reserved-host-memory = 8192` on all compute units (baked into the hardened
  bundle). Diagnose a suspected OOM-vs-reboot with `who -b` / `uptime` (no recent boot)
  and `journalctl -k | grep -i oom`; the ovsdb "no response to inactivity probe ...
  disconnecting" storm is the swap-thrash signature.

### D-041 -- single-node mgmt cluster does not self-heal  (phase-08)
- Symptom: after a host event the mgmt VM (`capi-mgmt-v2`) is SHUTOFF; FIP
  unreachable; magnum cannot reach the mgmt API; workload addons go Pending (see
  uninitialized-taint).
- Cause: the D-035 single-node mgmt cluster is a SPOF with no MachineHealthCheck
  (unlike the workload cluster).
- Fix: `openstack server start capi-mgmt-v2` (API serves ~40s later; a brief TLS
  handshake timeout on the first kubectl is expected). Follow-up: HA mgmt cluster for
  Roosevelt.

### juju-macaroon -- "cannot get discharge ... EOF"  (phase-07, phase-08)
- Symptom: `juju ssh` (or other juju calls) fail mid-session with a discharge/EOF error.
- Cause: the juju macaroon expired during a long session.
- Fix: re-run `juju login`, then retry.

================================================================================
## Teardown / MAAS reset (phase-00)
================================================================================

### DOCFIX-016 -- never `maas list` (API-key leak)  (phase-00, phase-01, phase-04)
- Risk: `maas list` prints the stored API key to stdout (and into any transcript/log).
- Fix: the profile name is known (`admin`); call `maas admin ...` directly. Never run
  `maas list` in a runbook or paste block.

### DOCFIX-017 -- no `maas whoami`; hardcode the eyeballed system_ids  (phase-00)
- Risk: scripting machine selection via `maas <profile> whoami` + owner filters is
  fragile and, in this lab, unnecessary.
- Fix: the four host system_ids are fixed and eyeball-verified
  (openstack0=4na83t, openstack1=qdbqd6, openstack2=h8frng, openstack3=tmsafc) --
  iterate those literals. (The older 01-destroy-model.md used `maas list`/`whoami` and
  released 5 VMs incl. the retired D-033 capi-mgmt; the current rebuild releases 4.)

### R7 -- sudo for libvirt / qemu-img  (phase-00, phase-01)
- The OSD qcow2 files (`/var/lib/libvirt/images/<host>-1.qcow2`) are root:root / 600;
  `qemu-img info|create`, `virsh domstate`, `stat`, `rm` against them all need `sudo`.

### KI-P3-001 -- VIP / primary collision  (phase-00, phase-04)
- Symptom: a charm `vip:` address equals a MAAS-auto-assigned machine/container
  primary (observed: cinder public VIP .226 == magnum container 1/lxd/3 primary).
- Cause: MAAS auto-static allocation was not excluded over the VIP block (provider had
  NO VIP reservation), so MAAS handed primaries .225/.226/.227 onto the .224-.236 VIPs.
- Fix (durable): on EVERY space carrying VIPs (provider AND metal) reserve the
  front-loaded VIP /26 in MAAS, distinct from the primary range and any neutron
  allocation_pool (phase-00 Phase 4). A reserved range stops future auto-assign onto
  a configured VIP. Negative test post-deploy: no service vip == any unit primary.

### DEVIATION-2 -- raise a KVM host's RAM, then MAAS-recommission to Ready  (phase-00)
- Context (2026-06-11): the openstack0-3 KVM guests were bumped 16384 -> 32768 MiB on the 196 GB
  hypervisor to relieve memory pressure. Pattern: with the guest SHUT OFF (and after the OSD
  wipe), `virsh setmaxmem <dom> 32G --config` then `virsh setmem <dom> 32G --config`; boot; then
  MAAS RECOMMISSION the node so MAAS re-reads hardware and lands it back at Ready at the new size
  (4x Ready at 32768 in ~3 min). Do the maxmem change while shut off -- a live setmaxmem is rejected.
- D-040 `reserved-host-memory 8192` is RETAINED (correctness floor, independent of host size).
  Re-measure the per-host container/service footprint against the 32 GiB envelope before the
  Roosevelt node-role split -- 16 GiB-era pressure numbers do not map 1:1.

================================================================================
## Deploy-time (phase-01)
================================================================================

### R14 -- VIP relocation .224-.236 -> .50-.60  (phase-01)
- The public + internal API VIPs were front-loaded out of the old high-end .224-.236
  block into .50-.60 (inside the reserved .2-.63 /26). Every bundle `vip:` is a dual
  provider+metal pair "10.12.4.5x 10.12.8.5x" (D-020). Pre-deploy guard: total provider
  VIPs=11, all in .50-.60, zero in the stale .10-.20 (phase-01 1.1). Any per-cloud
  consumer of a VIP (the Horizon reverse proxy, monitoring) must be repointed.

### R15 -- the .10 phantom resolver  (phase-01)
- Symptom: an unreachable region resolver `10.12.8.10` appears in a node's resolver
  list (sometimes as Current DNS Server) despite the subnet dns_servers override.
- Cause: MAAS advertises its region/rack controller as a DNS server on the
  MAAS-managed metal VLAN, independent of the subnet field; the override does not purge it.
- Impact: NON-BLOCKING -- systemd-resolved deprioritizes .10 and falls through to .1.
  Latent fragility if .1 ever drops. Understand/eliminate for Roosevelt (no libvirt split there).

### L1 -- no `set -e` on count-gate blocks; guard greps `|| true`  (phase-01)
- A guarded `grep -c` returning 0 is a VALID answer, not a failure. Under `set -e` a
  zero-count grep aborts the block. Pre-deploy verify blocks run WITHOUT `set -e`, and
  every count grep ends `|| true`. (`bash -n` would not catch this -- it is behavior.)

### L3 -- metal-side dual-VIP eyeball check  (phase-01)
- The provider-side VIP guard greps only the first token of each dual `vip:`. The metal
  side (second token, `10.12.8.5x`) must be eyeballed to confirm all 11 sit in .8.50-.60,
  clear of metal infra (.8.10 maas / .8.20 lxd / .8.21 capi / .8.30 juju).

================================================================================
## Vault / secrets (phase-02)
================================================================================

### DOCFIX-006 -- vault init is one-shot; stdout-only redirect loses the keys  (phase-02)
- Symptom: `vault operator init ... > file` captures stdout only; if the key block went
  to stderr (or the run is interrupted) you are left with an unusable/empty file and the
  5 shares + root token are GONE -- init runs exactly once and cannot be replayed.
- Fix: `vault operator init -key-shares=5 -key-threshold=3 2>&1 | tee ~/vault-init/init.txt`
  VERBATIM; gate on `grep -c '^Unseal Key' == 5` and `Initial Root Token` present; then
  save the file OFF-HOST before anything else. Never improvise this command.

### DOCFIX-011 -- authorize-charm parameter is `token`  (phase-02)
- The vault `authorize-charm` action takes `token` (a direct token string); there is no
  `token-secret-id` variant in this charm rev. Confirm via `juju actions vault --schema`.
  Authorize with a SHORT-LIVED CHILD token (juju run persists action params in the op log).

### DOCFIX-014 -- generate-root-ca is required  (phase-02)
- Symptom: after authorize-charm, vault stays BLOCKED "Missing CA cert".
- Fix: run `juju run vault/leader generate-root-ca` -- it mints the charm-pki-local
  root and clears the block straight to active. (Omitting it leaves vault hung.)

### L4 -- vault unseal via hidden prompt, not key-on-argv  (phase-02)
- Use Vault's own `vault operator unseal` (no argument) so it prompts hidden; the key is
  never on the command line / in a var / in `ps` / in scrollback. Do NOT use
  `vault operator unseal $KEY` (visible in `ps` on the unit). Unseal is re-runnable, so
  the verbatim-reference rule is looser here, but the security gain is real.

### R3 -- "HA Enabled false" is correct for vault-on-mysql  (phase-02)
- Expected post-unseal: Initialized true / Sealed false / Storage Type mysql /
  **HA Enabled false**. Single-unit vault on the mysql backend is non-HA by design; any
  reference to "HA Enabled true (etcd backend)" is STALE (etcd was dropped).

================================================================================
## Identity / openrc (phase-03)
================================================================================

### DOCFIX-018 -- IP-only OS_AUTH_URL  (phase-03)
- This cloud is IP-only (no FQDN, no cloud DNS). The admin openrc must point at the
  keystone PUBLIC endpoint by IP: `OS_AUTH_URL=https://10.12.4.50:5000/v3`, with the
  vault root CA in `OS_CACERT` (B5 IP-SAN certs validate). No /etc/hosts, no FQDN.

### DOCFIX-022 -- discover the admin project; do not hardcode it  (phase-03)
- Symptom: with TLS working, keystone returns HTTP 401.
- Cause: wrong project scope. The scoping project name varies by charm rev (here it is
  `admin`, living in domain `admin_domain`; an older doc's `OS_PROJECT_NAME=admin_domain`
  401s). Credential good, scope wrong.
- Fix: a candidate loop -- try each of "admin admin_domain"; the first that issues a
  SCOPED token wins (phase-03 3.2). Costs 2 extra token requests; self-corrects across
  revs instead of re-introducing the 401-by-hardcode.

================================================================================
## Core services: HAProxy + reverse-proxy (phase-03)
================================================================================

### D-045 / DOCFIX-031 -- juju "active/idle" but an haproxy backend is DOWN  (phase-03)
- Symptom: `juju status` is all active/idle, yet a service VIP intermittently 503s or a unit's
  API is unreachable. juju health is BLIND to per-backend haproxy state.
- Cause: a charm-rendered haproxy backend can be silently DOWN without the charm going non-idle
  -- e.g. (D-045) haproxy was NOT reloaded after the TLS/cert cascade, so its health checks ran
  plaintext against an SSL backend and marked it DOWN. juju-green is necessary, not sufficient.
- Fix: sweep haproxy's OWN verdict on every unit via its admin socket, then remediate+reload.
  Per unit, read `/var/run/haproxy/admin.sock` (`show stat`) and `grep ',DOWN,'` (excluding the
  FRONTEND/BACKEND summary rows). For any flagged unit: `sudo haproxy -c -f
  /etc/haproxy/haproxy.cfg` (must say valid) then `sudo systemctl reload haproxy` (graceful
  master-worker; reload, not restart). Phase-03 3.1 gates on a zero-DOWN sweep cloud-wide --
  it closes the juju-green-but-backend-DOWN hole.

### nginx-reverse-proxy -- jumphost -> internal-VIP proxy gotchas  (phase-03)
- Context: the jumphost reaches internal-only dashboards/APIs via an nginx reverse proxy
  (phase-03 3.3). Four traps, each with the as-built fix:
- reload race: a `systemctl reload nginx` right after editing the vhost can be served by a
  still-draining old worker (a curl ~2s later hits stale behavior; the co-hosted MAAS proxy
  blips too). `nginx -t` FIRST; prefer `restart` for a definitive cutover when the listen/upstream
  set changed, reload only for content-equivalent edits.
- proxy_ssl_name / SNI: the upstream presents a DNS-SAN cert (a juju-internal name, e.g.
  `juju-ffe3b8-2-lxd-2`); set `proxy_ssl_name` to that SAN, `proxy_ssl_verify on`, and the vault
  CA in `proxy_ssl_trusted_certificate`, or verification fails on the IP-only connect.
- sed no-op: a `sed -i` that does not match silently changes nothing and the proxy keeps the old
  behavior -- assert the post-edit content, do not trust sed's exit code.
- scheme-mismatch redirect loop: the backend issues `https://` Location headers while the proxy
  listens `http`; without `proxy_redirect https:// http://` (or a matching listen scheme) the
  browser loops. Match the scheme end-to-end or rewrite the redirect.

================================================================================
## Octavia enablement (phase-05)
================================================================================

### L7 -- the openstack snap cannot read /tmp  (phase-05, also phase-01 PKI sanity)
- Symptom: `openstack image create --file /tmp/...` -> "[Errno 2] No such file or
  directory" even though `sha256sum` just read the same path.
- Cause: the openstack CLI snap is confined and cannot read `/tmp`; it CAN read `$HOME`
  (home interface).
- Fix: stage any file the snap must read under `$HOME` (e.g. `$HOME/amphora-base/...`),
  never `/tmp`.

### octavia-configure-resources -- long-running action; o-hm0 transient is normal  (phase-05)
- `configure-resources` is long-running: juju's default action wait may time out
  ("timed out waiting for results") while the hook KEEPS RUNNING -- do NOT treat the
  wait-timeout as failure or re-fire blindly. Use a bound `--wait` and confirm completion
  via `juju show-operation <N>` (authoritative), not the streamed log.
- NORMAL (not faults) during/after: lb-mgmt-net is IPv6-ULA (fc00::/..) by design; a
  "Virtual network for access to Amphorae is down" transient self-heals as o-hm0 comes
  up; the lb-mgmt `network:distributed` port shows DOWN (logical OVN port, never chassis-bound).

### amp-image-tag-mismatch -- LP#1937003  (phase-05)
- Octavia looks up the amphora image by `octavia amp-image-tag`; it MUST equal the tag
  the retrofit stamps (`octavia-diskimage-retrofit amp-image-tag`), both `octavia-amphora`.
  A mismatch means octavia cannot find the image even though it is built and ACTIVE.
  The amphora pipeline gate asserts the two are equal before building (phase-05 5.2).

================================================================================
## Image seeding (phase-05/06/08)
================================================================================

### FINDING-3 -- azimuth CDN 403s glance web-download; stage-and-verify is canonical  (phase-06, phase-08)
- Symptom: a glance web-download import (`--import-method web-download`) 202-accepts, then the
  image hangs in `queued` forever and never reaches `active`.
- Cause: glance's web-download plugin fetches with urllib (User-Agent `Python-urllib/3.x`); the
  azimuth-images CDN (`azimuth-images.stackhpc.cloud`) returns HTTP 403 to that UA. A curl/HEAD
  probe with a different UA passes -- which is why an earlier probe false-passed while the real
  import failed.
- Fix (canonical): STAGE-AND-VERIFY. curl the qcow2 to `$HOME` (snap-readable, NOT /tmp -- L7;
  curl's UA is not blocked), verify the checksum against the published manifest (azimuth-images
  manifest.json -- sha512 for kube images; the ubuntu cloud-images SHA256SUMS for noble), then
  `openstack image create --file --import` (the openstack snap's `--import` == glance-direct;
  image-conversion lands it `raw`). CORRECTION-1: a plain `--file` PUT (no `--import`) stores
  qcow2 -- fine for boot, but `--import` gives the raw Ceph fast-clone alignment.
- Clear a stuck record before retry: gated `openstack image delete <id>` on the `queued` remnant
  (verify the EXACT id first -- FINDING-4 name-guard discipline).
- Roosevelt: unify ALL image seeding (amphora base, noble mgmt, kube) on stage-and-verify for one
  provenance-verified path cloud-wide.

### web-download -- tested ALTERNATIVE to stage-and-verify  (phase-05/06/08)
- Web-download (`openstack image create --import --import-method web-download --uri <url>`) is
  retained as a tested ALTERNATIVE, not the canonical path (superseded 2026-06-17; see
  design-decisions). Caveats: (1) it cannot checksum-verify the fetched file against a published
  digest (the CDN redirect strips it) -- weaker provenance; (2) it 403s on the azimuth CDN
  (FINDING-3), so it is unusable for kube images; (3) for ubuntu cloud-images it works on the
  hardened bundle (the 2026-06-08 403 was transient/pre-hardening). Use only as an expedient.

================================================================================
## Notes
================================================================================
- This index covers phases 00-08. It grows the same way for any future phase: keyed by
  D-NNN / DOCFIX-NNN / L-N / R-N / named-symptom, each entry symptom -> cause -> fix
  with a "phase NN" back-reference, and decision rationale left to design-decisions.md.
- memcached track drift is recorded in appendix-B (B.1), not here (it is a
  version-lock note, not a troubleshooting entry).

<!-- patchset-20260610-appendix-addendum -->

---

## Addendum 2026-06-10 -- CAPI/Magnum operations findings

Five entries from the 2026-06-10 recovery session. Full procedures with
verified blocks: runbooks/ops-capi-recovery.md.

### Parked-state signatures (mgmt VM deliberately stopped)
While capi-mgmt-v2 is stopped: Magnum reports UNHEALTHY with an EMPTY
health_status_reason (distinct from the D-042 cosmetic, which has a populated
reason); the Horizon Container Infra panel may 504 through the jumphost nginx
proxy and `coe` CLI calls may stall; the workload cluster keeps serving (no
runtime dependency on the mgmt cluster). If jumphost secrets were filed during
parking, the convention is ~/sweep-YYYYMMDD/secrets/. See ops-capi-recovery
Section 0 (expectations table) and Section 1 (parking block).

### Amphora orphan/zombie sweep after host-pressure events
Causal chain (traced live 2026-06-10): host CPU/memory pressure -> amphora
heartbeats go stale -> Octavia health-manager marks amphorae ERROR and launches
auto-failovers -> failovers fail NoValidHost (no placement headroom) -> amphora
servers accumulate with NO Octavia DB row. Two variants: an ERROR server
(failed spawn) and an ACTIVE heartbeating zombie (health-manager logs "missing
from the DB ... An operator must manually delete it" every 10 s). Remedy:
verify-then-delete by SERVER UUID under admin scope -- the
`loadbalancer amphora list` output is the DB truth; Nova name lookup is
project-scoped (amphorae live in the Octavia services project). Procedure:
ops-capi-recovery 5a. Do NOT retry failover against the same blocker; each
attempt mints another zombie.

### Octavia failover requires +1 amphora placement headroom
STANDALONE failover builds the replacement amphora BEFORE reaping the old one,
so it transiently needs one extra amphora slot (charm-octavia: 1024 MB / 1 vCPU
/ 8 GB). Scheduler ceiling per host = physical_MB * ram_allocation_ratio (1.5)
- reserved_host_memory (8192 per D-040). A cloud allocated to that ceiling
cannot heal its own load balancers: the failover fast-fails to ERROR in
~15 seconds on NoValidHost. Verified to the megabyte 2026-06-10. Roosevelt
sizing requirement: reserve at least one amphora slot per concurrent failover
on top of workload allocation (feeds the node-role/rebalancing recommendation).

### juju ssh `</dev/null` vs an expired macaroon (DOCFIX-021 interaction)
DOCFIX-021's `</dev/null` on juju ssh assumes valid macaroon auth. When the
jumphost macaroon goes stale, juju falls back to an interactive password
prompt; `</dev/null` feeds that prompt EOF and the symptom is the misleading
"cannot get discharge from https://<controller>:17070/auth: EOF". Triage: run
`juju status` interactively -- if it succeeds after a password prompt, the
controller is healthy and only the credential cache is stale. Workaround for
the session: stdin from `</dev/tty`. Fix at a calm moment: `juju logout` then
`juju login`.

### Horizon visibility of CAPI instances, LBs, and amphorae
CAPI/Magnum VMs are owned by the capi-mgmt project; an empty Project ->
Compute -> Instances page under admin scope is correct, not a defect. Map:
tenant VMs -> Instances in the OWNING project's scope (use the header project
switcher; admin holds member on capi-mgmt per phase-06 6.0-BOOT); LB objects ->
Project -> Network -> Load Balancers in the owning project's scope; amphora
VMs -> Admin -> Compute -> Instances ONLY (they belong to the Octavia services
project); everything at once -> CLI `openstack server list --all-projects`.
Warning about the asymmetry: the Container Infra panel lists clusters
cross-project under admin policy, which makes the strictly-scoped Nova panel
look broken when it is not.
