Newer
Older
openstack-caracal-ipv4 / runbooks / appendix-A-troubleshooting.md

Appendix A -- Troubleshooting / Known-Issues Index

Keyed by the same D-NNN / DOCFIX-NNN / L-P6-N identifiers used inline in the phase runbooks. This is an OPERATIONAL index (symptom -> cause -> fix), NOT the decision log: full rationale lives in design-decisions.md and the per-decision files (D-0NN-*.md); the driver fix has its own magnum-capi-helm-driver-fix-runbook. Each entry notes the phase(s) that reference it. ASCII-only.

================================================================================

Remote execution / scripting

================================================================================

DOCFIX-021 -- heredoc / stdin consumption (phase-06, phase-07)

  • Symptom: a multi-line juju ssh/ssh ... bash -s or remote sudo block dies early or behaves as if truncated; later commands in the heredoc never run.
  • Cause: an inner ssh/sudo/juju ssh (or any stdin reader) consumes the rest of the heredoc/pipe that was feeding the outer command.
  • Fix: append </dev/null to every inner ssh/sudo/juju ssh invocation (use </dev/tty instead only when the call genuinely needs an interactive prompt).
  • Also: wrap multi-statement pasteable jumphost blocks in ( { ...; } ) so a stray exit cannot kill the interactive shell.
  • SECOND MANIFESTATION (phase-03): a charm ACTION's human output silently corrupts a captured artifact. juju run vault/leader get-root-ca wraps the PEM in an INDENTED YAML output: |- block; sed-by-marker preserves the indent and an indented -----BEGIN CERTIFICATE----- is not valid PEM -> openssl "Unable to load certificate" -> keystone NO_CERTIFICATE_OR_CRL_FOUND. Fix: pull from the action JSON (real newlines, no indent): juju run vault/leader get-root-ca -m openstack --format json | jq -r '[.. | strings | select(test("BEGIN CERTIFICATE"))][0]'. (Same class as DOCFIX-006: never trust action human output for a captured secret/cert.)

L-P6-4 -- admin-kubeconfig / secret transfer (phase-07)

  • Risk: staging the cluster-admin kubeconfig (or any secret) in /tmp, or letting a PTY mangle it in transit.
  • Fix: pipe base64 straight into a root-written file with umask 077, then chown to the service user and chmod 0600 -- never touch /tmp. (Pattern in phase-07 7.2.)
  • Hardening (Roosevelt): replace the cluster-admin kubeconfig with a scoped ServiceAccount kubeconfig carrying only the RBAC the driver needs.

================================================================================

k8s-snap bootstrap (mgmt cluster)

================================================================================

DOCFIX-024 -- bootstrap config missing the cluster-config block (phase-06)

  • Symptom: k8s bootstrap "succeeds" but the node never reaches Ready; network and DNS are silently disabled; CoreDNS/Cilium absent.
  • Cause: a bootstrap --file whose top level lacks a cluster-config: block leaves ALL features (network, dns, ...) at disabled defaults. Setting only pod-cidr / service-cidr / extra-sans does NOT enable them.
  • Fix: include an explicit block:
    cluster-config:
      network: { enabled: true }
      dns:     { enabled: true }
    (See phase-06 6.4 for the full config.) Retry: snap remove k8s --purge then re-bootstrap.

================================================================================

CAPI provider install (mgmt cluster)

================================================================================

DOCFIX-025a -- cert-manager Helm flag (phase-06)

  • Symptom: cert-manager install fails / CRDs absent when using --set installCRDs=true.
  • Cause: installCRDs was removed from the cert-manager chart (~v1.18). The current flag is crds.enabled=true.
  • Fix: helm install cert-manager jetstack/cert-manager ... --set crds.enabled=true.

D-034 -- CAPI install ordering (ORC before clusterctl init) (phase-06)

  • Symptom: after clusterctl init, capo-controller-manager CrashLoopBackOff (observed ~6 restarts / ~15 min) before self-healing.
  • Cause: CAPO v0.14.4's openstackserver controller hard-depends on ORC's Image.openstack.k-orc.cloud CRD at startup. clusterctl init installs CAPO; if ORC is not yet present, CAPO crash-loops until it appears.
  • Fix: install ORC (its manifest provides the Image CRD) BEFORE clusterctl init. Hardened order: cert-manager -> ORC -> clusterctl init -> CAAPH -> janitor.
  • Related rule: source every provider version from the chosen capi-helm-charts tag's dependencies.json (read live with jq); do not hardcode semver. (Full rationale: design-decisions D-034; driver-coherence amendment: D-042.)

================================================================================

Networking / pod egress

================================================================================

D-035 -- dual-homed mgmt node pod-egress reverse-path failure (phase-06)

  • Symptom (the prior D-033 architecture): a pod's egress TCP connect to an external VIP hangs; the agnhost probe never reaches Completed. SYN leaves the correct NIC and the SYN-ACK arrives, but the reply is emitted back out the NIC instead of being redirected into the pod via cilium_host -- silent, asymmetric breakage. (The "do-07 pattern.")
  • Cause: Cilium reverse-path handling on a node with multiple NICs.
  • Fix (chosen): D-035 single-homed in-cloud tenant VM avoids it entirely; phase-06 GATE 2 (agnhost pod -> Keystone VIP, must Complete) is the explicit proof. (The transferable alternative -- Cilium device pinning -- is a Roosevelt note, not v1.)

================================================================================

Magnum conductor

================================================================================

D-037 -- conductor config-dir injection (NOT a systemd ExecStart drop-in) (phase-07)

  • Symptom: the [capi_helm] conf.d drop-in is ignored; the conductor behaves as if it was never written, even though a systemd drop-in "looks" applied.
  • Cause: these OpenStack debs (openstack-pkg-tools) run the daemon through an LSB init script wrapped by systemd systemd-start, NOT a direct ExecStart=. A systemd drop-in appending --config-dir passes it as a positional arg to the init script, which ignores it -- the flag never reaches the daemon. The args are assembled inside the init script from DAEMON_ARGS (base --config-file first), extensible only via /etc/default/<service>.
  • Fix: create /etc/default/magnum-conductor (0644; the charm does not manage it):
    DAEMON_ARGS="$DAEMON_ARGS --config-dir /etc/magnum/magnum.conf.d"
    Verify with the init script's own show-args (dry-run) AND ps -ww -C magnum-conductor -o args on the live process -- behavioral, not string-presence.
  • Residual: if a future charm hook ever writes /etc/default/magnum-conductor, the append is lost and [capi_helm] silently stops being read. Re-check via show-args/ps.

L-P6-1 / L-P6-2 -- verify the launched cmdline, not the unit text (phase-07)

  • Rule: never assume the systemd ExecStart shape for OpenStack debs, and never treat "string present in the unit file" as "the daemon received the flag." Gate on the assembled/launched cmdline (show-args, then ps on the live process).

DOCFIX-035 -- helm not on the conductor's PATH (phase-07)

  • Symptom: the magnum-capi-helm driver fails shelling out to helm (cluster create errors on a helm invocation), yet command -v helm in an interactive juju ssh magnum/0 shell finds it.
  • Cause: the conductor runs via an LSB init script (systemd systemd-start) with the restricted init PATH (e.g. /usr/sbin:/usr/bin:/sbin:/bin), which EXCLUDES /usr/local/bin -- where a get.helm.sh tarball install lands. An interactive login shell has /usr/local/bin on PATH, so it masks the problem (the classic green-in-the-shell, broken-in-the-daemon trap).
  • Fix: install the binary to /usr/local/bin/helm AND symlink /usr/bin/helm -> it (/usr/bin IS on the restricted PATH). Checksum-verify the tarball (sha256 vs get.helm.sh .sha256sum) before install. VERIFY against the restricted PATH, not a login shell: env -i PATH=/usr/sbin:/usr/bin:/sbin:/bin sh -c 'command -v helm && helm version --short' must print /usr/bin/helm (phase-07 7.4).

L-P6-3 -- k8s version comes from the IMAGE, not a template label (phase-08)

  • Symptom: cluster create fails in the driver before provisioning.
  • Cause: the magnum-capi-helm driver reads kube_version from the Glance image properties and routes on os_distro; it does NOT take k8s version from a template label.
  • Fix: the workload image (e.g. ubuntu-jammy-kube-v1.34.8) MUST carry kube_version (e.g. v1.32.13) and os_distro=ubuntu. Verify before create (phase-08 8.0).

================================================================================

Driver / cluster health

================================================================================

D-042 -- driver contract-coherence; health "infrastructure: not found" (phase-07, phase-08, appendix-B)

  • Symptom: coe cluster show reports health_status = UNHEALTHY deterministically (survives a conductor restart); only the infrastructure sub-check fails ("Infrastructure resource not found"); cluster + control-plane + nodegroup are Ready.
  • Cause: driver 1.3.0 reads apiVersion off spec.infrastructureRef to build its health GET, but the CAPI v1.13 (v1beta2 contract) ref carries apiGroup+kind+name with NO apiVersion. COSMETIC -- the create path is unaffected (the chart templates the resource versions); only the driver's direct health query breaks.
  • Fix: upgrade to the RELEASED magnum-capi-helm==1.4.0 (the "generalize-api-resources" feature). 1.4.0 builds each health GET from an explicit api_version via its [capi_helm] api_resources option, which DEFAULTS to v1beta1 for every CAPI kind -- and CAPI v1.13.2 / CAPO v0.14.4 still serve v1beta1, so the default works (no override needed; phase-07 7.3-7.6). Set a per-kind override only if a kind is v1beta2-only. Rule (amends D-034): the Layer-B driver pin must be contract-coherent with the Layer-A CAPI core.
  • Operational caveat while unfixed: do NOT wire magnum auto-healing to health_status (a persistent false UNHEALTHY could misfire); CAPI MachineHealthCheck heals independently.

================================================================================

Cluster lifecycle / Octavia

================================================================================

D-039 -- app-cred roles (load-balancer_member) / Octavia 403 (phase-08)

  • Symptom: cluster create or delete wedges; CAPO gets 403 querying the Octavia LB.
  • Cause: the Magnum-minted application credential lacks load-balancer_member (a pre-D-039 frozen app-cred cannot query Octavia to confirm LB state).
  • Fix: ensure the service path mints app-creds carrying load-balancer_member (+ member, reader). Verify before acceptance (phase-08 prereqs).

stuck-delete -- wedged CAPI cluster delete (phase-08)

  • Symptom: cluster stuck DELETE_IN_PROGRESS; helm release already gone; Cluster and OpenStackCluster CRs stuck Deleting (often on an Octavia 403, see D-039).
  • Recovery: clear the OpenStackCluster finalizer on the mgmt cluster -- kubectl -n <magnum-ns> patch openstackcluster <cluster>-<suffix> --type=merge -p '{"metadata":{"finalizers":[]}}'. The Cluster finalizer was only waiting on it, so the Cluster auto-finalizes and deletes. Then manually clean orphaned neutron resources in dependency order: router remove subnet -> router unset external-gateway -> router delete -> subnet delete -> network delete -> security group delete.
  • Name-guard (FINDING-4): NEVER patch/delete a CR by an inferred name. The OpenStackCluster is named <cluster>-<CAPI-suffix> where the suffix is random per create (NOT the Magnum cluster name). LIST first -- kubectl -n <magnum-ns> get openstackcluster -- and operate on the EXACT name returned. The magnum-ns is magnum-<project-id> (resolve the project id; never hardcode). A wrong-name patch silently no-ops and the delete stays wedged.

LB-failover -- LB stuck provisioning_status=ERROR after a host event (phase-08)

  • Symptom: the kube-api Octavia LB shows operating_status ONLINE but provisioning_status ERROR after a host outage/OOM.
  • Cause: a control-plane op on the amphora failed during the outage.
  • Fix: openstack loadbalancer failover <lb-id> in ADMIN-project scope (amphora / failover ops 403 under tenant member scope). Watch ERROR -> PENDING_UPDATE -> ACTIVE (~100s); a single STANDALONE amphora gives a brief blip; operating_status holds ONLINE.

uninitialized-taint -- workload addons Pending (phase-08)

  • Symptom: new workload nodes are kubelet-Ready but addon pods (metrics-server, node-feature-discovery, etc.) stay Pending; nodes carry node.cluster.x-k8s.io/uninitialized.
  • Cause: that taint is removed by the CAPI machine controller on the MANAGEMENT cluster. If the mgmt cluster is down (see D-041), the taint persists.
  • Fix: restore the mgmt cluster API; CAPI then removes the taint and addons schedule.

CNI-label / DOCFIX-032 -- network_driver under driver 1.4.0; pin calico explicitly (phase-08)

  • Note: under the as-FIRST-built driver 1.3.0 the legacy Magnum network_driver label was IGNORED and the capi-helm openstack-cluster chart's default CNI (Calico) always ran. Under the RELEASED 1.4.0 driver the network_driver template option IS honored (it maps through to the chart network_driver).
  • DOCFIX-032: pin --network-driver calico EXPLICITLY on the capi-k8s-v1-34 template (phase-08) rather than relying on the default staying Calico. Chart 0.25.1 ships ONLY Calico (flannel is not packaged), so flannel there would fail to converge -- do not set it. (Mgmt cluster CNI is separately Cilium, via k8s-snap.)

================================================================================

Hyperconverged host / mgmt-VM resilience

================================================================================

D-040 -- host OOM from low reserved-host-memory (phase-08)

  • Symptom: guests OOM-killed; a compute host may even present in Juju as State=down (heavy swap thrash stalls OVS/OVN heartbeats and the machine agent).
  • Cause: reserved-host-memory default 512 MB does not cover the co-located LXD/Ceph/MySQL services on these hyperconverged hosts -> nova over-commits real RAM.
  • Fix: reserved-host-memory = 8192 on all compute units (baked into the hardened bundle). Diagnose a suspected OOM-vs-reboot with who -b / uptime (no recent boot) and journalctl -k | grep -i oom; the ovsdb "no response to inactivity probe ... disconnecting" storm is the swap-thrash signature.

D-041 -- single-node mgmt cluster does not self-heal (phase-08)

  • Symptom: after a host event the mgmt VM (capi-mgmt-v2) is SHUTOFF; FIP unreachable; magnum cannot reach the mgmt API; workload addons go Pending (see uninitialized-taint).
  • Cause: the D-035 single-node mgmt cluster is a SPOF with no MachineHealthCheck (unlike the workload cluster).
  • Fix: openstack server start capi-mgmt-v2 (API serves ~40s later; a brief TLS handshake timeout on the first kubectl is expected). Follow-up: HA mgmt cluster for Roosevelt.

juju-macaroon -- "cannot get discharge ... EOF" (phase-07, phase-08)

  • Symptom: juju ssh (or other juju calls) fail mid-session with a discharge/EOF error.
  • Cause: the juju macaroon expired during a long session.
  • Fix: re-run juju login, then retry.

================================================================================

Teardown / MAAS reset (phase-00)

================================================================================

DOCFIX-016 -- never maas list (API-key leak) (phase-00, phase-01, phase-04)

  • Risk: maas list prints the stored API key to stdout (and into any transcript/log).
  • Fix: the profile name is known (admin); call maas admin ... directly. Never run maas list in a runbook or paste block.

DOCFIX-017 -- no maas whoami; hardcode the eyeballed system_ids (phase-00)

  • Risk: scripting machine selection via maas <profile> whoami + owner filters is fragile and, in this lab, unnecessary.
  • Fix: the four host system_ids are fixed and eyeball-verified (openstack0=4na83t, openstack1=qdbqd6, openstack2=h8frng, openstack3=tmsafc) -- iterate those literals. (The older 01-destroy-model.md used maas list/whoami and released 5 VMs incl. the retired D-033 capi-mgmt; the current rebuild releases 4.)

R7 -- sudo for libvirt / qemu-img (phase-00, phase-01)

  • The OSD qcow2 files (/var/lib/libvirt/images/<host>-1.qcow2) are root:root / 600; qemu-img info|create, virsh domstate, stat, rm against them all need sudo.

KI-P3-001 -- VIP / primary collision (phase-00, phase-04)

  • Symptom: a charm vip: address equals a MAAS-auto-assigned machine/container primary (observed: cinder public VIP .226 == magnum container 1/lxd/3 primary).
  • Cause: MAAS auto-static allocation was not excluded over the VIP block (provider had NO VIP reservation), so MAAS handed primaries .225/.226/.227 onto the .224-.236 VIPs.
  • Fix (durable): on EVERY space carrying VIPs (provider AND metal) reserve the front-loaded VIP /26 in MAAS, distinct from the primary range and any neutron allocation_pool (phase-00 Phase 4). A reserved range stops future auto-assign onto a configured VIP. Negative test post-deploy: no service vip == any unit primary.

DEVIATION-2 -- raise a KVM host's RAM, then MAAS-recommission to Ready (phase-00)

  • Context (2026-06-11): the openstack0-3 KVM guests were bumped 16384 -> 32768 MiB on the 196 GB hypervisor to relieve memory pressure. Pattern: with the guest SHUT OFF (and after the OSD wipe), virsh setmaxmem <dom> 32G --config then virsh setmem <dom> 32G --config; boot; then MAAS RECOMMISSION the node so MAAS re-reads hardware and lands it back at Ready at the new size (4x Ready at 32768 in ~3 min). Do the maxmem change while shut off -- a live setmaxmem is rejected.
  • D-040 reserved-host-memory 8192 is RETAINED (correctness floor, independent of host size). Re-measure the per-host container/service footprint against the 32 GiB envelope before the Roosevelt node-role split -- 16 GiB-era pressure numbers do not map 1:1.

================================================================================

Deploy-time (phase-01)

================================================================================

R14 -- VIP relocation .224-.236 -> .50-.60 (phase-01)

  • The public + internal API VIPs were front-loaded out of the old high-end .224-.236 block into .50-.60 (inside the reserved .2-.63 /26). Every bundle vip: is a dual provider+metal pair "10.12.4.5x 10.12.8.5x" (D-020). Pre-deploy guard: total provider VIPs=11, all in .50-.60, zero in the stale .10-.20 (phase-01 1.1). Any per-cloud consumer of a VIP (the Horizon reverse proxy, monitoring) must be repointed.

R15 -- the .10 phantom resolver (phase-01)

  • Symptom: an unreachable region resolver 10.12.8.10 appears in a node's resolver list (sometimes as Current DNS Server) despite the subnet dns_servers override.
  • Cause: MAAS advertises its region/rack controller as a DNS server on the MAAS-managed metal VLAN, independent of the subnet field; the override does not purge it.
  • Impact: NON-BLOCKING -- systemd-resolved deprioritizes .10 and falls through to .1. Latent fragility if .1 ever drops. Understand/eliminate for Roosevelt (no libvirt split there).

L1 -- no set -e on count-gate blocks; guard greps || true (phase-01)

  • A guarded grep -c returning 0 is a VALID answer, not a failure. Under set -e a zero-count grep aborts the block. Pre-deploy verify blocks run WITHOUT set -e, and every count grep ends || true. (bash -n would not catch this -- it is behavior.)

L3 -- metal-side dual-VIP eyeball check (phase-01)

  • The provider-side VIP guard greps only the first token of each dual vip:. The metal side (second token, 10.12.8.5x) must be eyeballed to confirm all 11 sit in .8.50-.60, clear of metal infra (.8.10 maas / .8.20 lxd / .8.21 capi / .8.30 juju).

================================================================================

Vault / secrets (phase-02)

================================================================================

DOCFIX-006 -- vault init is one-shot; stdout-only redirect loses the keys (phase-02)

  • Symptom: vault operator init ... > file captures stdout only; if the key block went to stderr (or the run is interrupted) you are left with an unusable/empty file and the 5 shares + root token are GONE -- init runs exactly once and cannot be replayed.
  • Fix: vault operator init -key-shares=5 -key-threshold=3 2>&1 | tee ~/vault-init/init.txt VERBATIM; gate on grep -c '^Unseal Key' == 5 and Initial Root Token present; then save the file OFF-HOST before anything else. Never improvise this command.

DOCFIX-011 -- authorize-charm parameter is token (phase-02)

  • The vault authorize-charm action takes token (a direct token string); there is no token-secret-id variant in this charm rev. Confirm via juju actions vault --schema. Authorize with a SHORT-LIVED CHILD token (juju run persists action params in the op log).

DOCFIX-014 -- generate-root-ca is required (phase-02)

  • Symptom: after authorize-charm, vault stays BLOCKED "Missing CA cert".
  • Fix: run juju run vault/leader generate-root-ca -- it mints the charm-pki-local root and clears the block straight to active. (Omitting it leaves vault hung.)

L4 -- vault unseal via hidden prompt, not key-on-argv (phase-02)

  • Use Vault's own vault operator unseal (no argument) so it prompts hidden; the key is never on the command line / in a var / in ps / in scrollback. Do NOT use vault operator unseal $KEY (visible in ps on the unit). Unseal is re-runnable, so the verbatim-reference rule is looser here, but the security gain is real.

R3 -- "HA Enabled false" is correct for vault-on-mysql (phase-02)

  • Expected post-unseal: Initialized true / Sealed false / Storage Type mysql / HA Enabled false. Single-unit vault on the mysql backend is non-HA by design; any reference to "HA Enabled true (etcd backend)" is STALE (etcd was dropped).

================================================================================

Identity / openrc (phase-03)

================================================================================

DOCFIX-018 -- IP-only OS_AUTH_URL (phase-03)

  • This cloud is IP-only (no FQDN, no cloud DNS). The admin openrc must point at the keystone PUBLIC endpoint by IP: OS_AUTH_URL=https://10.12.4.50:5000/v3, with the vault root CA in OS_CACERT (B5 IP-SAN certs validate). No /etc/hosts, no FQDN.

DOCFIX-022 -- discover the admin project; do not hardcode it (phase-03)

  • Symptom: with TLS working, keystone returns HTTP 401.
  • Cause: wrong project scope. The scoping project name varies by charm rev (here it is admin, living in domain admin_domain; an older doc's OS_PROJECT_NAME=admin_domain 401s). Credential good, scope wrong.
  • Fix: a candidate loop -- try each of "admin admin_domain"; the first that issues a SCOPED token wins (phase-03 3.2). Costs 2 extra token requests; self-corrects across revs instead of re-introducing the 401-by-hardcode.

================================================================================

Core services: HAProxy + reverse-proxy (phase-03)

================================================================================

D-045 / DOCFIX-031 -- juju "active/idle" but an haproxy backend is DOWN (phase-03)

  • Symptom: juju status is all active/idle, yet a service VIP intermittently 503s or a unit's API is unreachable. juju health is BLIND to per-backend haproxy state.
  • Cause: a charm-rendered haproxy backend can be silently DOWN without the charm going non-idle -- e.g. (D-045) haproxy was NOT reloaded after the TLS/cert cascade, so its health checks ran plaintext against an SSL backend and marked it DOWN. juju-green is necessary, not sufficient.
  • Fix: sweep haproxy's OWN verdict on every unit via its admin socket, then remediate+reload. Per unit, read /var/run/haproxy/admin.sock (show stat) and grep ',DOWN,' (excluding the FRONTEND/BACKEND summary rows). For any flagged unit: sudo haproxy -c -f /etc/haproxy/haproxy.cfg (must say valid) then sudo systemctl reload haproxy (graceful master-worker; reload, not restart). Phase-03 3.1 gates on a zero-DOWN sweep cloud-wide -- it closes the juju-green-but-backend-DOWN hole.

nginx-reverse-proxy -- jumphost -> internal-VIP proxy gotchas (phase-03)

  • Context: the jumphost reaches internal-only dashboards/APIs via an nginx reverse proxy (phase-03 3.3). Four traps, each with the as-built fix:
  • reload race: a systemctl reload nginx right after editing the vhost can be served by a still-draining old worker (a curl ~2s later hits stale behavior; the co-hosted MAAS proxy blips too). nginx -t FIRST; prefer restart for a definitive cutover when the listen/upstream set changed, reload only for content-equivalent edits.
  • proxy_ssl_name / SNI: the upstream presents a DNS-SAN cert (a juju-internal name, e.g. juju-ffe3b8-2-lxd-2); set proxy_ssl_name to that SAN, proxy_ssl_verify on, and the vault CA in proxy_ssl_trusted_certificate, or verification fails on the IP-only connect.
  • sed no-op: a sed -i that does not match silently changes nothing and the proxy keeps the old behavior -- assert the post-edit content, do not trust sed's exit code.
  • scheme-mismatch redirect loop: the backend issues https:// Location headers while the proxy listens http; without proxy_redirect https:// http:// (or a matching listen scheme) the browser loops. Match the scheme end-to-end or rewrite the redirect.

================================================================================

Octavia enablement (phase-05)

================================================================================

L7 -- the openstack snap cannot read /tmp (phase-05, also phase-01 PKI sanity)

  • Symptom: openstack image create --file /tmp/... -> "[Errno 2] No such file or directory" even though sha256sum just read the same path.
  • Cause: the openstack CLI snap is confined and cannot read /tmp; it CAN read $HOME (home interface).
  • Fix: stage any file the snap must read under $HOME (e.g. $HOME/amphora-base/...), never /tmp.

octavia-configure-resources -- long-running action; o-hm0 transient is normal (phase-05)

  • configure-resources is long-running: juju's default action wait may time out ("timed out waiting for results") while the hook KEEPS RUNNING -- do NOT treat the wait-timeout as failure or re-fire blindly. Use a bound --wait and confirm completion via juju show-operation <N> (authoritative), not the streamed log.
  • NORMAL (not faults) during/after: lb-mgmt-net is IPv6-ULA (fc00::/..) by design; a "Virtual network for access to Amphorae is down" transient self-heals as o-hm0 comes up; the lb-mgmt network:distributed port shows DOWN (logical OVN port, never chassis-bound).

amp-image-tag-mismatch -- LP#1937003 (phase-05)

  • Octavia looks up the amphora image by octavia amp-image-tag; it MUST equal the tag the retrofit stamps (octavia-diskimage-retrofit amp-image-tag), both octavia-amphora. A mismatch means octavia cannot find the image even though it is built and ACTIVE. The amphora pipeline gate asserts the two are equal before building (phase-05 5.2).

================================================================================

Image seeding (phase-05/06/08)

================================================================================

FINDING-3 -- azimuth CDN 403s glance web-download; stage-and-verify is canonical (phase-06, phase-08)

  • Symptom: a glance web-download import (--import-method web-download) 202-accepts, then the image hangs in queued forever and never reaches active.
  • Cause: glance's web-download plugin fetches with urllib (User-Agent Python-urllib/3.x); the azimuth-images CDN (azimuth-images.stackhpc.cloud) returns HTTP 403 to that UA. A curl/HEAD probe with a different UA passes -- which is why an earlier probe false-passed while the real import failed.
  • Fix (canonical): STAGE-AND-VERIFY. curl the qcow2 to $HOME (snap-readable, NOT /tmp -- L7; curl's UA is not blocked), verify the checksum against the published manifest (azimuth-images manifest.json -- sha512 for kube images; the ubuntu cloud-images SHA256SUMS for noble), then openstack image create --file --import (the openstack snap's --import == glance-direct; image-conversion lands it raw). CORRECTION-1: a plain --file PUT (no --import) stores qcow2 -- fine for boot, but --import gives the raw Ceph fast-clone alignment.
  • Clear a stuck record before retry: gated openstack image delete <id> on the queued remnant (verify the EXACT id first -- FINDING-4 name-guard discipline).
  • Roosevelt: unify ALL image seeding (amphora base, noble mgmt, kube) on stage-and-verify for one provenance-verified path cloud-wide.

web-download -- tested ALTERNATIVE to stage-and-verify (phase-05/06/08)

  • Web-download (openstack image create --import --import-method web-download --uri <url>) is retained as a tested ALTERNATIVE, not the canonical path (superseded 2026-06-17; see design-decisions). Caveats: (1) it cannot checksum-verify the fetched file against a published digest (the CDN redirect strips it) -- weaker provenance; (2) it 403s on the azimuth CDN (FINDING-3), so it is unusable for kube images; (3) for ubuntu cloud-images it works on the hardened bundle (the 2026-06-08 403 was transient/pre-hardening). Use only as an expedient.

================================================================================

Notes

================================================================================

  • This index covers phases 00-08. It grows the same way for any future phase: keyed by D-NNN / DOCFIX-NNN / L-N / R-N / named-symptom, each entry symptom -> cause -> fix with a "phase NN" back-reference, and decision rationale left to design-decisions.md.
  • memcached track drift is recorded in appendix-B (B.1), not here (it is a version-lock note, not a troubleshooting entry).

Addendum 2026-06-10 -- CAPI/Magnum operations findings

Five entries from the 2026-06-10 recovery session. Full procedures with verified blocks: runbooks/ops-capi-recovery.md.

Parked-state signatures (mgmt VM deliberately stopped)

While capi-mgmt-v2 is stopped: Magnum reports UNHEALTHY with an EMPTY health_status_reason (distinct from the D-042 cosmetic, which has a populated reason); the Horizon Container Infra panel may 504 through the jumphost nginx proxy and coe CLI calls may stall; the workload cluster keeps serving (no runtime dependency on the mgmt cluster). If jumphost secrets were filed during parking, the convention is ~/sweep-YYYYMMDD/secrets/. See ops-capi-recovery Section 0 (expectations table) and Section 1 (parking block).

Amphora orphan/zombie sweep after host-pressure events

Causal chain (traced live 2026-06-10): host CPU/memory pressure -> amphora heartbeats go stale -> Octavia health-manager marks amphorae ERROR and launches auto-failovers -> failovers fail NoValidHost (no placement headroom) -> amphora servers accumulate with NO Octavia DB row. Two variants: an ERROR server (failed spawn) and an ACTIVE heartbeating zombie (health-manager logs "missing from the DB ... An operator must manually delete it" every 10 s). Remedy: verify-then-delete by SERVER UUID under admin scope -- the loadbalancer amphora list output is the DB truth; Nova name lookup is project-scoped (amphorae live in the Octavia services project). Procedure: ops-capi-recovery 5a. Do NOT retry failover against the same blocker; each attempt mints another zombie.

Octavia failover requires +1 amphora placement headroom

STANDALONE failover builds the replacement amphora BEFORE reaping the old one, so it transiently needs one extra amphora slot (charm-octavia: 1024 MB / 1 vCPU / 8 GB). Scheduler ceiling per host = physical_MB * ram_allocation_ratio (1.5)

  • reserved_host_memory (8192 per D-040). A cloud allocated to that ceiling cannot heal its own load balancers: the failover fast-fails to ERROR in ~15 seconds on NoValidHost. Verified to the megabyte 2026-06-10. Roosevelt sizing requirement: reserve at least one amphora slot per concurrent failover on top of workload allocation (feeds the node-role/rebalancing recommendation).

juju ssh </dev/null vs an expired macaroon (DOCFIX-021 interaction)

DOCFIX-021's </dev/null on juju ssh assumes valid macaroon auth. When the jumphost macaroon goes stale, juju falls back to an interactive password prompt; </dev/null feeds that prompt EOF and the symptom is the misleading "cannot get discharge from https://:17070/auth: EOF". Triage: run juju status interactively -- if it succeeds after a password prompt, the controller is healthy and only the credential cache is stale. Workaround for the session: stdin from </dev/tty. Fix at a calm moment: juju logout then juju login.

Horizon visibility of CAPI instances, LBs, and amphorae

CAPI/Magnum VMs are owned by the capi-mgmt project; an empty Project -> Compute -> Instances page under admin scope is correct, not a defect. Map: tenant VMs -> Instances in the OWNING project's scope (use the header project switcher; admin holds member on capi-mgmt per phase-06 6.0-BOOT); LB objects -> Project -> Network -> Load Balancers in the owning project's scope; amphora VMs -> Admin -> Compute -> Instances ONLY (they belong to the Octavia services project); everything at once -> CLI openstack server list --all-projects. Warning about the asymmetry: the Container Infra panel lists clusters cross-project under admin policy, which makes the strictly-scoped Nova panel look broken when it is not.


SYMPTOM: link-subnet fails "IP address is already in use" but the IP is in no visible table (ipaddresses read empty, discovery cleared, not in DHCP

     dynamic range, interface on the correct VLAN).

CAUSE: A freshly re-enrolled host PXE-leases its own metal IP (10.12.8.4N) at commission; MAAS keeps it as a StaticIPAddress of alloc_type 6 (DISCOVERED), tied to the node. Distinct from the network-discovery table AND from user allocations -- neither discoveries clear-by-mac-and-ip nor a plain ipaddresses release clears it. AUTHORITATIVE READ (use FIRST, before guessing): maas admin subnet ip-addresses -> lists every in-use IP with .alloc_type and .node_summary. alloc_type 6 = DISCOVERED. This is the definitive "who holds this IP and why". FIX: maas admin ipaddresses release ip= force=true discovered=true (BOTH flags; force alone -> "does not exist"). Only release when the discovered record's node is the SAME host -- a different node means a real address conflict; stop and investigate. NOW AUTOMATED: scripts/carve-host-interfaces.sh release_self_discovered() does this, gated to self-owned records only.