Appendix A -- Troubleshooting / Known-Issues Index

Keyed by the same D-NNN / DOCFIX-NNN / L-P6-N identifiers used inline in the phase runbooks. This is an OPERATIONAL index (symptom -> cause -> fix), NOT the decision log: full rationale lives in design-decisions.md and the per-decision files (D-0NN-*.md); the driver fix has its own magnum-capi-helm-driver-fix-runbook. Each entry notes the phase(s) that reference it. ASCII-only.

================================================================================

Remote execution / scripting

================================================================================

DOCFIX-021 -- heredoc / stdin consumption (phase-06, phase-07)

Symptom: a multi-line juju ssh/ssh ... bash -s or remote sudo block dies early or behaves as if truncated; later commands in the heredoc never run.
Cause: an inner ssh/sudo/juju ssh (or any stdin reader) consumes the rest of the heredoc/pipe that was feeding the outer command.
Fix: append </dev/null to every inner ssh/sudo/juju ssh invocation (use </dev/tty instead only when the call genuinely needs an interactive prompt).
Also: wrap multi-statement pasteable jumphost blocks in ( { ...; } ) so a stray exit cannot kill the interactive shell.
SECOND MANIFESTATION (phase-03): a charm ACTION's human output silently corrupts a captured artifact. juju run vault/leader get-root-ca wraps the PEM in an INDENTED YAML output: |- block; sed-by-marker preserves the indent and an indented -----BEGIN CERTIFICATE----- is not valid PEM -> openssl "Unable to load certificate" -> keystone NO_CERTIFICATE_OR_CRL_FOUND. Fix: pull from the action JSON (real newlines, no indent): juju run vault/leader get-root-ca -m openstack --format json | jq -r '[.. | strings | select(test("BEGIN CERTIFICATE"))][0]'. (Same class as DOCFIX-006: never trust action human output for a captured secret/cert.)

L-P6-4 -- admin-kubeconfig / secret transfer (phase-07)

Risk: staging the cluster-admin kubeconfig (or any secret) in /tmp, or letting a PTY mangle it in transit.
Fix: pipe base64 straight into a root-written file with umask 077, then chown to the service user and chmod 0600 -- never touch /tmp. (Pattern in phase-07 7.2.)
Hardening (Roosevelt): replace the cluster-admin kubeconfig with a scoped ServiceAccount kubeconfig carrying only the RBAC the driver needs.

================================================================================

k8s-snap bootstrap (mgmt cluster)

================================================================================

DOCFIX-024 -- bootstrap config missing the cluster-config block (phase-06)

Symptom: k8s bootstrap "succeeds" but the node never reaches Ready; network and DNS are silently disabled; CoreDNS/Cilium absent.
Cause: a bootstrap --file whose top level lacks a cluster-config: block leaves ALL features (network, dns, ...) at disabled defaults. Setting only pod-cidr / service-cidr / extra-sans does NOT enable them.
Fix: include an explicit block:
```
cluster-config:
  network: { enabled: true }
  dns:     { enabled: true }
```
(See phase-06 6.4 for the full config.) Retry: snap remove k8s --purge then re-bootstrap.

================================================================================

CAPI provider install (mgmt cluster)

================================================================================

DOCFIX-025a -- cert-manager Helm flag (phase-06)

Symptom: cert-manager install fails / CRDs absent when using --set installCRDs=true.
Cause: installCRDs was removed from the cert-manager chart (~v1.18). The current flag is crds.enabled=true.
Fix: helm install cert-manager jetstack/cert-manager ... --set crds.enabled=true.

D-034 -- CAPI install ordering (ORC before clusterctl init) (phase-06)

Symptom: after clusterctl init, capo-controller-manager CrashLoopBackOff (observed ~6 restarts / ~15 min) before self-healing.
Cause: CAPO v0.14.4's openstackserver controller hard-depends on ORC's Image.openstack.k-orc.cloud CRD at startup. clusterctl init installs CAPO; if ORC is not yet present, CAPO crash-loops until it appears.
Fix: install ORC (its manifest provides the Image CRD) BEFORE clusterctl init. Hardened order: cert-manager -> ORC -> clusterctl init -> CAAPH -> janitor.
Related rule: source every provider version from the chosen capi-helm-charts tag's dependencies.json (read live with jq); do not hardcode semver. (Full rationale: design-decisions D-034; driver-coherence amendment: D-042.)

================================================================================

Networking / pod egress

================================================================================

D-035 -- dual-homed mgmt node pod-egress reverse-path failure (phase-06)

Symptom (the prior D-033 architecture): a pod's egress TCP connect to an external VIP hangs; the agnhost probe never reaches Completed. SYN leaves the correct NIC and the SYN-ACK arrives, but the reply is emitted back out the NIC instead of being redirected into the pod via cilium_host -- silent, asymmetric breakage. (The "do-07 pattern.")
Cause: Cilium reverse-path handling on a node with multiple NICs.
Fix (chosen): D-035 single-homed in-cloud tenant VM avoids it entirely; phase-06 GATE 2 (agnhost pod -> Keystone VIP, must Complete) is the explicit proof. (The transferable alternative -- Cilium device pinning -- is a Roosevelt note, not v1.)

================================================================================

Magnum conductor

================================================================================

D-037 -- conductor config-dir injection (NOT a systemd ExecStart drop-in) (phase-07)

Symptom: the [capi_helm] conf.d drop-in is ignored; the conductor behaves as if it was never written, even though a systemd drop-in "looks" applied.
Cause: these OpenStack debs (openstack-pkg-tools) run the daemon through an LSB init script wrapped by systemd systemd-start, NOT a direct ExecStart=. A systemd drop-in appending --config-dir passes it as a positional arg to the init script, which ignores it -- the flag never reaches the daemon. The args are assembled inside the init script from DAEMON_ARGS (base --config-file first), extensible only via /etc/default/<service>.
Fix: create /etc/default/magnum-conductor (0644; the charm does not manage it):
```
DAEMON_ARGS="$DAEMON_ARGS --config-dir /etc/magnum/magnum.conf.d"
```
Verify with the init script's own show-args (dry-run) AND ps -ww -C magnum-conductor -o args on the live process -- behavioral, not string-presence.
Residual: if a future charm hook ever writes /etc/default/magnum-conductor, the append is lost and [capi_helm] silently stops being read. Re-check via show-args/ps.

L-P6-1 / L-P6-2 -- verify the launched cmdline, not the unit text (phase-07)

Rule: never assume the systemd ExecStart shape for OpenStack debs, and never treat "string present in the unit file" as "the daemon received the flag." Gate on the assembled/launched cmdline (show-args, then ps on the live process).

DOCFIX-035 -- helm not on the conductor's PATH (phase-07)

Symptom: the magnum-capi-helm driver fails shelling out to helm (cluster create errors on a helm invocation), yet command -v helm in an interactive juju ssh magnum/0 shell finds it.
Cause: the conductor runs via an LSB init script (systemd systemd-start) with the restricted init PATH (e.g. /usr/sbin:/usr/bin:/sbin:/bin), which EXCLUDES /usr/local/bin -- where a get.helm.sh tarball install lands. An interactive login shell has /usr/local/bin on PATH, so it masks the problem (the classic green-in-the-shell, broken-in-the-daemon trap).
Fix: install the binary to /usr/local/bin/helm AND symlink /usr/bin/helm -> it (/usr/bin IS on the restricted PATH). Checksum-verify the tarball (sha256 vs get.helm.sh .sha256sum) before install. VERIFY against the restricted PATH, not a login shell: env -i PATH=/usr/sbin:/usr/bin:/sbin:/bin sh -c 'command -v helm && helm version --short' must print /usr/bin/helm (phase-07 7.4).

L-P6-3 -- k8s version comes from the IMAGE, not a template label (phase-08)

Symptom: cluster create fails in the driver before provisioning.
Cause: the magnum-capi-helm driver reads kube_version from the Glance image properties and routes on os_distro; it does NOT take k8s version from a template label.
Fix: the workload image (e.g. ubuntu-jammy-kube-v1.34.8) MUST carry kube_version (e.g. v1.32.13) and os_distro=ubuntu. Verify before create (phase-08 8.0).

================================================================================

Driver / cluster health

================================================================================

D-042 -- driver contract-coherence; health "infrastructure: not found" (phase-07, phase-08, appendix-B)

Symptom: coe cluster show reports health_status = UNHEALTHY deterministically (survives a conductor restart); only the infrastructure sub-check fails ("Infrastructure resource not found"); cluster + control-plane + nodegroup are Ready.
Cause: driver 1.3.0 reads apiVersion off spec.infrastructureRef to build its health GET, but the CAPI v1.13 (v1beta2 contract) ref carries apiGroup+kind+name with NO apiVersion. COSMETIC -- the create path is unaffected (the chart templates the resource versions); only the driver's direct health query breaks.
Fix: upgrade to the RELEASED magnum-capi-helm==1.4.0 (the "generalize-api-resources" feature). 1.4.0 builds each health GET from an explicit api_version via its [capi_helm] api_resources option, which DEFAULTS to v1beta1 for every CAPI kind -- and CAPI v1.13.2 / CAPO v0.14.4 still serve v1beta1, so the default works (no override needed; phase-07 7.3-7.6). Set a per-kind override only if a kind is v1beta2-only. Rule (amends D-034): the Layer-B driver pin must be contract-coherent with the Layer-A CAPI core.
Operational caveat while unfixed: do NOT wire magnum auto-healing to health_status (a persistent false UNHEALTHY could misfire); CAPI MachineHealthCheck heals independently.

================================================================================

Cluster lifecycle / Octavia

================================================================================

D-039 -- app-cred roles (load-balancer_member) / Octavia 403 (phase-08)

Symptom: cluster create or delete wedges; CAPO gets 403 querying the Octavia LB.
Cause: the Magnum-minted application credential lacks load-balancer_member (a pre-D-039 frozen app-cred cannot query Octavia to confirm LB state).
Fix: ensure the service path mints app-creds carrying load-balancer_member (+ member, reader). Verify before acceptance (phase-08 prereqs).

stuck-delete -- wedged CAPI cluster delete (phase-08)

Symptom: cluster stuck DELETE_IN_PROGRESS; helm release already gone; Cluster and OpenStackCluster CRs stuck Deleting (often on an Octavia 403, see D-039).
Recovery: clear the OpenStackCluster finalizer on the mgmt cluster -- kubectl -n <magnum-ns> patch openstackcluster <cluster>-<suffix> --type=merge -p '{"metadata":{"finalizers":[]}}'. The Cluster finalizer was only waiting on it, so the Cluster auto-finalizes and deletes. Then manually clean orphaned neutron resources in dependency order: router remove subnet -> router unset external-gateway -> router delete -> subnet delete -> network delete -> security group delete.
Name-guard (FINDING-4): NEVER patch/delete a CR by an inferred name. The OpenStackCluster is named <cluster>-<CAPI-suffix> where the suffix is random per create (NOT the Magnum cluster name). LIST first -- kubectl -n <magnum-ns> get openstackcluster -- and operate on the EXACT name returned. The magnum-ns is magnum-<project-id> (resolve the project id; never hardcode). A wrong-name patch silently no-ops and the delete stays wedged.

LB-failover -- LB stuck provisioning_status=ERROR after a host event (phase-08)

Symptom: the kube-api Octavia LB shows operating_status ONLINE but provisioning_status ERROR after a host outage/OOM.
Cause: a control-plane op on the amphora failed during the outage.
Fix: openstack loadbalancer failover <lb-id> in ADMIN-project scope (amphora / failover ops 403 under tenant member scope). Watch ERROR -> PENDING_UPDATE -> ACTIVE (~100s); a single STANDALONE amphora gives a brief blip; operating_status holds ONLINE.

uninitialized-taint -- workload addons Pending (phase-08)

Symptom: new workload nodes are kubelet-Ready but addon pods (metrics-server, node-feature-discovery, etc.) stay Pending; nodes carry node.cluster.x-k8s.io/uninitialized.
Cause: that taint is removed by the CAPI machine controller on the MANAGEMENT cluster. If the mgmt cluster is down (see D-041), the taint persists.
Fix: restore the mgmt cluster API; CAPI then removes the taint and addons schedule.

CNI-label / DOCFIX-032 -- network_driver under driver 1.4.0; pin calico explicitly (phase-08)

Note: under the as-FIRST-built driver 1.3.0 the legacy Magnum network_driver label was IGNORED and the capi-helm openstack-cluster chart's default CNI (Calico) always ran. Under the RELEASED 1.4.0 driver the network_driver template option IS honored (it maps through to the chart network_driver).
DOCFIX-032: pin --network-driver calico EXPLICITLY on the capi-k8s-v1-34 template (phase-08) rather than relying on the default staying Calico. Chart 0.25.1 ships ONLY Calico (flannel is not packaged), so flannel there would fail to converge -- do not set it. (Mgmt cluster CNI is separately Cilium, via k8s-snap.)

================================================================================

Hyperconverged host / mgmt-VM resilience

================================================================================

D-040 -- host OOM from low reserved-host-memory (phase-08)

Symptom: guests OOM-killed; a compute host may even present in Juju as State=down (heavy swap thrash stalls OVS/OVN heartbeats and the machine agent).
Cause: reserved-host-memory default 512 MB does not cover the co-located LXD/Ceph/MySQL services on these hyperconverged hosts -> nova over-commits real RAM.
Fix: reserved-host-memory = 8192 on all compute units (baked into the hardened bundle). Diagnose a suspected OOM-vs-reboot with who -b / uptime (no recent boot) and journalctl -k | grep -i oom; the ovsdb "no response to inactivity probe ... disconnecting" storm is the swap-thrash signature.

D-041 -- single-node mgmt cluster does not self-heal (phase-08)

Symptom: after a host event the mgmt VM (capi-mgmt-v2) is SHUTOFF; FIP unreachable; magnum cannot reach the mgmt API; workload addons go Pending (see uninitialized-taint).
Cause: the D-035 single-node mgmt cluster is a SPOF with no MachineHealthCheck (unlike the workload cluster).
Fix: openstack server start capi-mgmt-v2 (API serves ~40s later; a brief TLS handshake timeout on the first kubectl is expected). Follow-up: HA mgmt cluster for Roosevelt.

juju-macaroon -- "cannot get discharge ... EOF" (phase-07, phase-08)

Symptom: juju ssh (or other juju calls) fail mid-session with a discharge/EOF error.
Cause: the juju macaroon expired during a long session.
Fix: re-run juju login, then retry.

================================================================================

Teardown / MAAS reset (phase-00)

================================================================================

DOCFIX-016 -- never `maas list` (API-key leak) (phase-00, phase-01, phase-04)

Risk: maas list prints the stored API key to stdout (and into any transcript/log).
Fix: the profile name is known (admin); call maas admin ... directly. Never run maas list in a runbook or paste block.

DOCFIX-017 -- no `maas whoami`; hardcode the eyeballed system_ids (phase-00)

Risk: scripting machine selection via maas <profile> whoami + owner filters is fragile and, in this lab, unnecessary.
Fix: the four host system_ids are fixed and eyeball-verified (openstack0=4na83t, openstack1=qdbqd6, openstack2=h8frng, openstack3=tmsafc) -- iterate those literals. (The older 01-destroy-model.md used maas list/whoami and released 5 VMs incl. the retired D-033 capi-mgmt; the current rebuild releases 4.)

R7 -- sudo for libvirt / qemu-img (phase-00, phase-01)

The OSD qcow2 files (/var/lib/libvirt/images/<host>-1.qcow2) are root:root / 600; qemu-img info|create, virsh domstate, stat, rm against them all need sudo.

KI-P3-001 -- VIP / primary collision (phase-00, phase-04)

Symptom: a charm vip: address equals a MAAS-auto-assigned machine/container primary (observed: cinder public VIP .226 == magnum container 1/lxd/3 primary).
Cause: MAAS auto-static allocation was not excluded over the VIP block (provider had NO VIP reservation), so MAAS handed primaries .225/.226/.227 onto the .224-.236 VIPs.
Fix (durable): on EVERY space carrying VIPs (provider AND metal) reserve the front-loaded VIP /26 in MAAS, distinct from the primary range and any neutron allocation_pool (phase-00 Phase 4). A reserved range stops future auto-assign onto a configured VIP. Negative test post-deploy: no service vip == any unit primary.

DEVIATION-2 -- raise a KVM host's RAM, then MAAS-recommission to Ready (phase-00)

Context (2026-06-11): the openstack0-3 KVM guests were bumped 16384 -> 32768 MiB on the 196 GB hypervisor to relieve memory pressure. Pattern: with the guest SHUT OFF (and after the OSD wipe), virsh setmaxmem <dom> 32G --config then virsh setmem <dom> 32G --config; boot; then MAAS RECOMMISSION the node so MAAS re-reads hardware and lands it back at Ready at the new size (4x Ready at 32768 in ~3 min). Do the maxmem change while shut off -- a live setmaxmem is rejected.
D-040 reserved-host-memory 8192 is RETAINED (correctness floor, independent of host size). Re-measure the per-host container/service footprint against the 32 GiB envelope before the Roosevelt node-role split -- 16 GiB-era pressure numbers do not map 1:1.

================================================================================

Deploy-time (phase-01)

================================================================================

R14 -- VIP relocation .224-.236 -> .50-.60 (phase-01)

The public + internal API VIPs were front-loaded out of the old high-end .224-.236 block into .50-.60 (inside the reserved .2-.63 /26). Every bundle vip: is a dual provider+metal pair "10.12.4.5x 10.12.8.5x" (D-020). Pre-deploy guard: total provider VIPs=11, all in .50-.60, zero in the stale .10-.20 (phase-01 1.1). Any per-cloud consumer of a VIP (the Horizon reverse proxy, monitoring) must be repointed.

R15 -- the .10 phantom resolver (phase-01)

Symptom: an unreachable region resolver 10.12.8.10 appears in a node's resolver list (sometimes as Current DNS Server) despite the subnet dns_servers override.
Cause: MAAS advertises its region/rack controller as a DNS server on the MAAS-managed metal VLAN, independent of the subnet field; the override does not purge it.
Impact: NON-BLOCKING -- systemd-resolved deprioritizes .10 and falls through to .1. Latent fragility if .1 ever drops. Understand/eliminate for Roosevelt (no libvirt split there).

L1 -- no `set -e` on count-gate blocks; guard greps `|| true` (phase-01)

A guarded grep -c returning 0 is a VALID answer, not a failure. Under set -e a zero-count grep aborts the block. Pre-deploy verify blocks run WITHOUT set -e, and every count grep ends || true. (bash -n would not catch this -- it is behavior.)

L3 -- metal-side dual-VIP eyeball check (phase-01)

The provider-side VIP guard greps only the first token of each dual vip:. The metal side (second token, 10.12.8.5x) must be eyeballed to confirm all 11 sit in .8.50-.60, clear of metal infra (.8.10 maas / .8.20 lxd / .8.21 capi / .8.30 juju).

================================================================================

Vault / secrets (phase-02)

================================================================================

DOCFIX-006 -- vault init is one-shot; stdout-only redirect loses the keys (phase-02)

Symptom: vault operator init ... > file captures stdout only; if the key block went to stderr (or the run is interrupted) you are left with an unusable/empty file and the 5 shares + root token are GONE -- init runs exactly once and cannot be replayed.
Fix: vault operator init -key-shares=5 -key-threshold=3 2>&1 | tee ~/vault-init/init.txt VERBATIM; gate on grep -c '^Unseal Key' == 5 and Initial Root Token present; then save the file OFF-HOST before anything else. Never improvise this command.

DOCFIX-011 -- authorize-charm parameter is `token` (phase-02)

The vault authorize-charm action takes token (a direct token string); there is no token-secret-id variant in this charm rev. Confirm via juju actions vault --schema. Authorize with a SHORT-LIVED CHILD token (juju run persists action params in the op log).

DOCFIX-014 -- generate-root-ca is required (phase-02)

Symptom: after authorize-charm, vault stays BLOCKED "Missing CA cert".
Fix: run juju run vault/leader generate-root-ca -- it mints the charm-pki-local root and clears the block straight to active. (Omitting it leaves vault hung.)

L4 -- vault unseal via hidden prompt, not key-on-argv (phase-02)

Use Vault's own vault operator unseal (no argument) so it prompts hidden; the key is never on the command line / in a var / in ps / in scrollback. Do NOT use vault operator unseal $KEY (visible in ps on the unit). Unseal is re-runnable, so the verbatim-reference rule is looser here, but the security gain is real.

R3 -- "HA Enabled false" is correct for vault-on-mysql (phase-02)

Expected post-unseal: Initialized true / Sealed false / Storage Type mysql / HA Enabled false. Single-unit vault on the mysql backend is non-HA by design; any reference to "HA Enabled true (etcd backend)" is STALE (etcd was dropped).

================================================================================

Identity / openrc (phase-03)

================================================================================

DOCFIX-018 -- IP-only OS_AUTH_URL (phase-03)

This cloud is IP-only (no FQDN, no cloud DNS). The admin openrc must point at the keystone PUBLIC endpoint by IP: OS_AUTH_URL=https://10.12.4.50:5000/v3, with the vault root CA in OS_CACERT (B5 IP-SAN certs validate). No /etc/hosts, no FQDN.

DOCFIX-022 -- discover the admin project; do not hardcode it (phase-03)

Symptom: with TLS working, keystone returns HTTP 401.
Cause: wrong project scope. The scoping project name varies by charm rev (here it is admin, living in domain admin_domain; an older doc's OS_PROJECT_NAME=admin_domain 401s). Credential good, scope wrong.
Fix: a candidate loop -- try each of "admin admin_domain"; the first that issues a SCOPED token wins (phase-03 3.2). Costs 2 extra token requests; self-corrects across revs instead of re-introducing the 401-by-hardcode.

================================================================================

Core services: HAProxy + reverse-proxy (phase-03)

================================================================================

D-045 / DOCFIX-031 -- juju "active/idle" but an haproxy backend is DOWN (phase-03)

Symptom: juju status is all active/idle, yet a service VIP intermittently 503s or a unit's API is unreachable. juju health is BLIND to per-backend haproxy state.
Cause: a charm-rendered haproxy backend can be silently DOWN without the charm going non-idle -- e.g. (D-045) haproxy was NOT reloaded after the TLS/cert cascade, so its health checks ran plaintext against an SSL backend and marked it DOWN. juju-green is necessary, not sufficient.
Fix: sweep haproxy's OWN verdict on every unit via its admin socket, then remediate+reload. Per unit, read /var/run/haproxy/admin.sock (show stat) and grep ',DOWN,' (excluding the FRONTEND/BACKEND summary rows). For any flagged unit: sudo haproxy -c -f /etc/haproxy/haproxy.cfg (must say valid) then sudo systemctl reload haproxy (graceful master-worker; reload, not restart). Phase-03 3.1 gates on a zero-DOWN sweep cloud-wide -- it closes the juju-green-but-backend-DOWN hole.

nginx-reverse-proxy -- jumphost -> internal-VIP proxy gotchas (phase-03)

Context: the jumphost reaches internal-only dashboards/APIs via an nginx reverse proxy (phase-03 3.3). Four traps, each with the as-built fix:
reload race: a systemctl reload nginx right after editing the vhost can be served by a still-draining old worker (a curl ~2s later hits stale behavior; the co-hosted MAAS proxy blips too). nginx -t FIRST; prefer restart for a definitive cutover when the listen/upstream set changed, reload only for content-equivalent edits.
proxy_ssl_name / SNI: the upstream presents a DNS-SAN cert (a juju-internal name, e.g. juju-ffe3b8-2-lxd-2); set proxy_ssl_name to that SAN, proxy_ssl_verify on, and the vault CA in proxy_ssl_trusted_certificate, or verification fails on the IP-only connect.
sed no-op: a sed -i that does not match silently changes nothing and the proxy keeps the old behavior -- assert the post-edit content, do not trust sed's exit code.
scheme-mismatch redirect loop: the backend issues https:// Location headers while the proxy listens http; without proxy_redirect https:// http:// (or a matching listen scheme) the browser loops. Match the scheme end-to-end or rewrite the redirect.

================================================================================

Octavia enablement (phase-05)

================================================================================

L7 -- the openstack snap cannot read /tmp (phase-05, also phase-01 PKI sanity)

Symptom: openstack image create --file /tmp/... -> "[Errno 2] No such file or directory" even though sha256sum just read the same path.
Cause: the openstack CLI snap is confined and cannot read /tmp; it CAN read $HOME (home interface).
Fix: stage any file the snap must read under $HOME (e.g. $HOME/amphora-base/...), never /tmp.

octavia-configure-resources -- long-running action; o-hm0 transient is normal (phase-05)

configure-resources is long-running: juju's default action wait may time out ("timed out waiting for results") while the hook KEEPS RUNNING -- do NOT treat the wait-timeout as failure or re-fire blindly. Use a bound --wait and confirm completion via juju show-operation <N> (authoritative), not the streamed log.
NORMAL (not faults) during/after: lb-mgmt-net is IPv6-ULA (fc00::/..) by design; a "Virtual network for access to Amphorae is down" transient self-heals as o-hm0 comes up; the lb-mgmt network:distributed port shows DOWN (logical OVN port, never chassis-bound).

amp-image-tag-mismatch -- LP#1937003 (phase-05)

Octavia looks up the amphora image by octavia amp-image-tag; it MUST equal the tag the retrofit stamps (octavia-diskimage-retrofit amp-image-tag), both octavia-amphora. A mismatch means octavia cannot find the image even though it is built and ACTIVE. The amphora pipeline gate asserts the two are equal before building (phase-05 5.2).

================================================================================

Image seeding (phase-05/06/08)

================================================================================

FINDING-3 -- azimuth CDN 403s glance web-download; stage-and-verify is canonical (phase-06, phase-08)

Symptom: a glance web-download import (--import-method web-download) 202-accepts, then the image hangs in queued forever and never reaches active.
Cause: glance's web-download plugin fetches with urllib (User-Agent Python-urllib/3.x); the azimuth-images CDN (azimuth-images.stackhpc.cloud) returns HTTP 403 to that UA. A curl/HEAD probe with a different UA passes -- which is why an earlier probe false-passed while the real import failed.
Fix (canonical): STAGE-AND-VERIFY. curl the qcow2 to $HOME (snap-readable, NOT /tmp -- L7; curl's UA is not blocked), verify the checksum against the published manifest (azimuth-images manifest.json -- sha512 for kube images; the ubuntu cloud-images SHA256SUMS for noble), then openstack image create --file --import (the openstack snap's --import == glance-direct; image-conversion lands it raw). CORRECTION-1: a plain --file PUT (no --import) stores qcow2 -- fine for boot, but --import gives the raw Ceph fast-clone alignment.
Clear a stuck record before retry: gated openstack image delete <id> on the queued remnant (verify the EXACT id first -- FINDING-4 name-guard discipline).
Roosevelt: unify ALL image seeding (amphora base, noble mgmt, kube) on stage-and-verify for one provenance-verified path cloud-wide.

web-download -- tested ALTERNATIVE to stage-and-verify (phase-05/06/08)

Web-download (openstack image create --import --import-method web-download --uri <url>) is retained as a tested ALTERNATIVE, not the canonical path (superseded 2026-06-17; see design-decisions). Caveats: (1) it cannot checksum-verify the fetched file against a published digest (the CDN redirect strips it) -- weaker provenance; (2) it 403s on the azimuth CDN (FINDING-3), so it is unusable for kube images; (3) for ubuntu cloud-images it works on the hardened bundle (the 2026-06-08 403 was transient/pre-hardening). Use only as an expedient.

================================================================================

Notes

================================================================================

This index covers phases 00-08. It grows the same way for any future phase: keyed by D-NNN / DOCFIX-NNN / L-N / R-N / named-symptom, each entry symptom -> cause -> fix with a "phase NN" back-reference, and decision rationale left to design-decisions.md.
memcached track drift is recorded in appendix-B (B.1), not here (it is a version-lock note, not a troubleshooting entry).

Addendum 2026-06-10 -- CAPI/Magnum operations findings

Five entries from the 2026-06-10 recovery session. Full procedures with verified blocks: runbooks/ops-capi-recovery.md.

Parked-state signatures (mgmt VM deliberately stopped)

While capi-mgmt-v2 is stopped: Magnum reports UNHEALTHY with an EMPTY health_status_reason (distinct from the D-042 cosmetic, which has a populated reason); the Horizon Container Infra panel may 504 through the jumphost nginx proxy and coe CLI calls may stall; the workload cluster keeps serving (no runtime dependency on the mgmt cluster). If jumphost secrets were filed during parking, the convention is ~/sweep-YYYYMMDD/secrets/. See ops-capi-recovery Section 0 (expectations table) and Section 1 (parking block).

Amphora orphan/zombie sweep after host-pressure events

Causal chain (traced live 2026-06-10): host CPU/memory pressure -> amphora heartbeats go stale -> Octavia health-manager marks amphorae ERROR and launches auto-failovers -> failovers fail NoValidHost (no placement headroom) -> amphora servers accumulate with NO Octavia DB row. Two variants: an ERROR server (failed spawn) and an ACTIVE heartbeating zombie (health-manager logs "missing from the DB ... An operator must manually delete it" every 10 s). Remedy: verify-then-delete by SERVER UUID under admin scope -- the loadbalancer amphora list output is the DB truth; Nova name lookup is project-scoped (amphorae live in the Octavia services project). Procedure: ops-capi-recovery 5a. Do NOT retry failover against the same blocker; each attempt mints another zombie.

Octavia failover requires +1 amphora placement headroom

STANDALONE failover builds the replacement amphora BEFORE reaping the old one, so it transiently needs one extra amphora slot (charm-octavia: 1024 MB / 1 vCPU / 8 GB). Scheduler ceiling per host = physical_MB * ram_allocation_ratio (1.5)

reserved_host_memory (8192 per D-040). A cloud allocated to that ceiling cannot heal its own load balancers: the failover fast-fails to ERROR in ~15 seconds on NoValidHost. Verified to the megabyte 2026-06-10. Roosevelt sizing requirement: reserve at least one amphora slot per concurrent failover on top of workload allocation (feeds the node-role/rebalancing recommendation).

juju ssh `</dev/null` vs an expired macaroon (DOCFIX-021 interaction)

DOCFIX-021's </dev/null on juju ssh assumes valid macaroon auth. When the jumphost macaroon goes stale, juju falls back to an interactive password prompt; </dev/null feeds that prompt EOF and the symptom is the misleading "cannot get discharge from https://:17070/auth: EOF". Triage: run juju status interactively -- if it succeeds after a password prompt, the controller is healthy and only the credential cache is stale. Workaround for the session: stdin from </dev/tty. Fix at a calm moment: juju logout then juju login.

Horizon visibility of CAPI instances, LBs, and amphorae

CAPI/Magnum VMs are owned by the capi-mgmt project; an empty Project -> Compute -> Instances page under admin scope is correct, not a defect. Map: tenant VMs -> Instances in the OWNING project's scope (use the header project switcher; admin holds member on capi-mgmt per phase-06 6.0-BOOT); LB objects -> Project -> Network -> Load Balancers in the owning project's scope; amphora VMs -> Admin -> Compute -> Instances ONLY (they belong to the Octavia services project); everything at once -> CLI openstack server list --all-projects. Warning about the asymmetry: the Container Infra panel lists clusters cross-project under admin policy, which makes the strictly-scoped Nova panel look broken when it is not.

Appendix A -- Troubleshooting / Known-Issues Index

Remote execution / scripting

DOCFIX-021 -- heredoc / stdin consumption (phase-06, phase-07)

L-P6-4 -- admin-kubeconfig / secret transfer (phase-07)

k8s-snap bootstrap (mgmt cluster)

DOCFIX-024 -- bootstrap config missing the cluster-config block (phase-06)

CAPI provider install (mgmt cluster)

DOCFIX-025a -- cert-manager Helm flag (phase-06)

D-034 -- CAPI install ordering (ORC before clusterctl init) (phase-06)

Networking / pod egress

D-035 -- dual-homed mgmt node pod-egress reverse-path failure (phase-06)

Magnum conductor

D-037 -- conductor config-dir injection (NOT a systemd ExecStart drop-in) (phase-07)

L-P6-1 / L-P6-2 -- verify the launched cmdline, not the unit text (phase-07)

DOCFIX-035 -- helm not on the conductor's PATH (phase-07)

L-P6-3 -- k8s version comes from the IMAGE, not a template label (phase-08)

Driver / cluster health

D-042 -- driver contract-coherence; health "infrastructure: not found" (phase-07, phase-08, appendix-B)

Cluster lifecycle / Octavia

D-039 -- app-cred roles (load-balancer_member) / Octavia 403 (phase-08)

stuck-delete -- wedged CAPI cluster delete (phase-08)

LB-failover -- LB stuck provisioning_status=ERROR after a host event (phase-08)

uninitialized-taint -- workload addons Pending (phase-08)

CNI-label / DOCFIX-032 -- network_driver under driver 1.4.0; pin calico explicitly (phase-08)

Hyperconverged host / mgmt-VM resilience

D-040 -- host OOM from low reserved-host-memory (phase-08)

D-041 -- single-node mgmt cluster does not self-heal (phase-08)

juju-macaroon -- "cannot get discharge ... EOF" (phase-07, phase-08)

Teardown / MAAS reset (phase-00)

DOCFIX-016 -- never maas list (API-key leak) (phase-00, phase-01, phase-04)

DOCFIX-017 -- no maas whoami; hardcode the eyeballed system_ids (phase-00)

R7 -- sudo for libvirt / qemu-img (phase-00, phase-01)

KI-P3-001 -- VIP / primary collision (phase-00, phase-04)

DEVIATION-2 -- raise a KVM host's RAM, then MAAS-recommission to Ready (phase-00)

Deploy-time (phase-01)

R14 -- VIP relocation .224-.236 -> .50-.60 (phase-01)

R15 -- the .10 phantom resolver (phase-01)

L1 -- no set -e on count-gate blocks; guard greps || true (phase-01)

L3 -- metal-side dual-VIP eyeball check (phase-01)

Vault / secrets (phase-02)

DOCFIX-006 -- vault init is one-shot; stdout-only redirect loses the keys (phase-02)

DOCFIX-011 -- authorize-charm parameter is token (phase-02)

DOCFIX-014 -- generate-root-ca is required (phase-02)

L4 -- vault unseal via hidden prompt, not key-on-argv (phase-02)

R3 -- "HA Enabled false" is correct for vault-on-mysql (phase-02)

Identity / openrc (phase-03)

DOCFIX-018 -- IP-only OS_AUTH_URL (phase-03)

DOCFIX-022 -- discover the admin project; do not hardcode it (phase-03)

Core services: HAProxy + reverse-proxy (phase-03)

D-045 / DOCFIX-031 -- juju "active/idle" but an haproxy backend is DOWN (phase-03)

nginx-reverse-proxy -- jumphost -> internal-VIP proxy gotchas (phase-03)

Octavia enablement (phase-05)

L7 -- the openstack snap cannot read /tmp (phase-05, also phase-01 PKI sanity)

octavia-configure-resources -- long-running action; o-hm0 transient is normal (phase-05)

amp-image-tag-mismatch -- LP#1937003 (phase-05)

Image seeding (phase-05/06/08)

FINDING-3 -- azimuth CDN 403s glance web-download; stage-and-verify is canonical (phase-06, phase-08)

web-download -- tested ALTERNATIVE to stage-and-verify (phase-05/06/08)

Notes

Addendum 2026-06-10 -- CAPI/Magnum operations findings

Parked-state signatures (mgmt VM deliberately stopped)

Amphora orphan/zombie sweep after host-pressure events

Octavia failover requires +1 amphora placement headroom

juju ssh </dev/null vs an expired macaroon (DOCFIX-021 interaction)

Horizon visibility of CAPI instances, LBs, and amphorae

DOCFIX-016 -- never `maas list` (API-key leak) (phase-00, phase-01, phase-04)

DOCFIX-017 -- no `maas whoami`; hardcode the eyeballed system_ids (phase-00)

L1 -- no `set -e` on count-gate blocks; guard greps `|| true` (phase-01)

DOCFIX-011 -- authorize-charm parameter is `token` (phase-02)

juju ssh `</dev/null` vs an expired macaroon (DOCFIX-021 interaction)