Newer
Older
openstack-caracal-ipv4 / runbooks / appendix-A-troubleshooting.md

Appendix A -- Troubleshooting / Known-Issues Index

Keyed by the same D-NNN / DOCFIX-NNN / L-P6-N identifiers used inline in the phase runbooks. This is an OPERATIONAL index (symptom -> cause -> fix), NOT the decision log: full rationale lives in design-decisions.md and the per-decision files (D-0NN-*.md); the driver fix has its own magnum-capi-helm-driver-fix-runbook. Each entry notes the phase(s) that reference it. ASCII-only.

================================================================================

Remote execution / scripting

================================================================================

DOCFIX-021 -- heredoc / stdin consumption (phase-06, phase-07)

  • Symptom: a multi-line juju ssh/ssh ... bash -s or remote sudo block dies early or behaves as if truncated; later commands in the heredoc never run.
  • Cause: an inner ssh/sudo/juju ssh (or any stdin reader) consumes the rest of the heredoc/pipe that was feeding the outer command.
  • Fix: append </dev/null to every inner ssh/sudo/juju ssh invocation (use </dev/tty instead only when the call genuinely needs an interactive prompt).
  • Also: wrap multi-statement pasteable jumphost blocks in ( { ...; } ) so a stray exit cannot kill the interactive shell.
  • SECOND MANIFESTATION (phase-03): a charm ACTION's human output silently corrupts a captured artifact. juju run vault/leader get-root-ca wraps the PEM in an INDENTED YAML output: |- block; sed-by-marker preserves the indent and an indented -----BEGIN CERTIFICATE----- is not valid PEM -> openssl "Unable to load certificate" -> keystone NO_CERTIFICATE_OR_CRL_FOUND. Fix: pull from the action JSON (real newlines, no indent): juju run vault/leader get-root-ca -m openstack --format json | jq -r '[.. | strings | select(test("BEGIN CERTIFICATE"))][0]'. (Same class as DOCFIX-006: never trust action human output for a captured secret/cert.)

L-P6-4 -- admin-kubeconfig / secret transfer (phase-07)

  • Risk: staging the cluster-admin kubeconfig (or any secret) in /tmp, or letting a PTY mangle it in transit.
  • Fix: pipe base64 straight into a root-written file with umask 077, then chown to the service user and chmod 0600 -- never touch /tmp. (Pattern in phase-07 7.2.)
  • Hardening (Roosevelt): replace the cluster-admin kubeconfig with a scoped ServiceAccount kubeconfig carrying only the RBAC the driver needs.

================================================================================

k8s-snap bootstrap (mgmt cluster)

================================================================================

DOCFIX-024 -- bootstrap config missing the cluster-config block (phase-06)

  • Symptom: k8s bootstrap "succeeds" but the node never reaches Ready; network and DNS are silently disabled; CoreDNS/Cilium absent.
  • Cause: a bootstrap --file whose top level lacks a cluster-config: block leaves ALL features (network, dns, ...) at disabled defaults. Setting only pod-cidr / service-cidr / extra-sans does NOT enable them.
  • Fix: include an explicit block:
    cluster-config:
      network: { enabled: true }
      dns:     { enabled: true }
    (See phase-06 6.4 for the full config.) Retry: snap remove k8s --purge then re-bootstrap.

================================================================================

CAPI provider install (mgmt cluster)

================================================================================

DOCFIX-025a -- cert-manager Helm flag (phase-06)

  • Symptom: cert-manager install fails / CRDs absent when using --set installCRDs=true.
  • Cause: installCRDs was removed from the cert-manager chart (~v1.18). The current flag is crds.enabled=true.
  • Fix: helm install cert-manager jetstack/cert-manager ... --set crds.enabled=true.

D-034 -- CAPI install ordering (ORC before clusterctl init) (phase-06)

  • Symptom: after clusterctl init, capo-controller-manager CrashLoopBackOff (observed ~6 restarts / ~15 min) before self-healing.
  • Cause: CAPO v0.14.4's openstackserver controller hard-depends on ORC's Image.openstack.k-orc.cloud CRD at startup. clusterctl init installs CAPO; if ORC is not yet present, CAPO crash-loops until it appears.
  • Fix: install ORC (its manifest provides the Image CRD) BEFORE clusterctl init. Hardened order: cert-manager -> ORC -> clusterctl init -> CAAPH -> janitor.
  • Related rule: source every provider version from the chosen capi-helm-charts tag's dependencies.json (read live with jq); do not hardcode semver. (Full rationale: design-decisions D-034; driver-coherence amendment: D-042.)

================================================================================

Networking / pod egress

================================================================================

D-035 -- dual-homed mgmt node pod-egress reverse-path failure (phase-06)

  • Symptom (the prior D-033 architecture): a pod's egress TCP connect to an external VIP hangs; the agnhost probe never reaches Completed. SYN leaves the correct NIC and the SYN-ACK arrives, but the reply is emitted back out the NIC instead of being redirected into the pod via cilium_host -- silent, asymmetric breakage. (The "do-07 pattern.")
  • Cause: Cilium reverse-path handling on a node with multiple NICs.
  • Fix (chosen): D-035 single-homed in-cloud tenant VM avoids it entirely; phase-06 GATE 2 (agnhost pod -> Keystone VIP, must Complete) is the explicit proof. (The transferable alternative -- Cilium device pinning -- is a Roosevelt note, not v1.)

================================================================================

Magnum conductor

================================================================================

D-037 -- conductor config-dir injection (NOT a systemd ExecStart drop-in) (phase-07)

  • Symptom: the [capi_helm] conf.d drop-in is ignored; the conductor behaves as if it was never written, even though a systemd drop-in "looks" applied.
  • Cause: these OpenStack debs (openstack-pkg-tools) run the daemon through an LSB init script wrapped by systemd systemd-start, NOT a direct ExecStart=. A systemd drop-in appending --config-dir passes it as a positional arg to the init script, which ignores it -- the flag never reaches the daemon. The args are assembled inside the init script from DAEMON_ARGS (base --config-file first), extensible only via /etc/default/<service>.
  • Fix: create /etc/default/magnum-conductor (0644; the charm does not manage it):
    DAEMON_ARGS="$DAEMON_ARGS --config-dir /etc/magnum/magnum.conf.d"
    Verify with the init script's own show-args (dry-run) AND ps -ww -C magnum-conductor -o args on the live process -- behavioral, not string-presence.
  • Residual: if a future charm hook ever writes /etc/default/magnum-conductor, the append is lost and [capi_helm] silently stops being read. Re-check via show-args/ps.

L-P6-1 / L-P6-2 -- verify the launched cmdline, not the unit text (phase-07)

  • Rule: never assume the systemd ExecStart shape for OpenStack debs, and never treat "string present in the unit file" as "the daemon received the flag." Gate on the assembled/launched cmdline (show-args, then ps on the live process).

L-P6-3 -- k8s version comes from the IMAGE, not a template label (phase-08)

  • Symptom: cluster create fails in the driver before provisioning.
  • Cause: the magnum-capi-helm driver reads kube_version from the Glance image properties and routes on os_distro; it does NOT take k8s version from a template label.
  • Fix: the workload image (e.g. ubuntu-jammy-kube-v1.32.13) MUST carry kube_version (e.g. v1.32.13) and os_distro=ubuntu. Verify before create (phase-08 8.0).

================================================================================

Driver / cluster health

================================================================================

D-042 -- driver contract-coherence; health "infrastructure: not found" (phase-07, phase-08, appendix-B)

  • Symptom: coe cluster show reports health_status = UNHEALTHY deterministically (survives a conductor restart); only the infrastructure sub-check fails ("Infrastructure resource not found"); cluster + control-plane + nodegroup are Ready.
  • Cause: driver 1.3.0 reads apiVersion off spec.infrastructureRef to build its health GET, but the CAPI v1.13 (v1beta2 contract) ref carries apiGroup+kind+name with NO apiVersion. COSMETIC -- the create path is unaffected (the chart templates the resource versions); only the driver's direct health query breaks.
  • Fix: upgrade to the RELEASED magnum-capi-helm==1.4.0 (the "generalize-api-resources" feature). 1.4.0 builds each health GET from an explicit api_version via its [capi_helm] api_resources option, which DEFAULTS to v1beta1 for every CAPI kind -- and CAPI v1.13.2 / CAPO v0.14.4 still serve v1beta1, so the default works (no override needed; phase-07 7.3-7.6). Set a per-kind override only if a kind is v1beta2-only. Rule (amends D-034): the Layer-B driver pin must be contract-coherent with the Layer-A CAPI core.
  • Operational caveat while unfixed: do NOT wire magnum auto-healing to health_status (a persistent false UNHEALTHY could misfire); CAPI MachineHealthCheck heals independently.

================================================================================

Cluster lifecycle / Octavia

================================================================================

D-039 -- app-cred roles (load-balancer_member) / Octavia 403 (phase-08)

  • Symptom: cluster create or delete wedges; CAPO gets 403 querying the Octavia LB.
  • Cause: the Magnum-minted application credential lacks load-balancer_member (a pre-D-039 frozen app-cred cannot query Octavia to confirm LB state).
  • Fix: ensure the service path mints app-creds carrying load-balancer_member (+ member, reader). Verify before acceptance (phase-08 prereqs).

stuck-delete -- wedged CAPI cluster delete (phase-08)

  • Symptom: cluster stuck DELETE_IN_PROGRESS; helm release already gone; Cluster and OpenStackCluster CRs stuck Deleting (often on an Octavia 403, see D-039).
  • Recovery: clear the OpenStackCluster finalizer on the mgmt cluster -- kubectl -n <magnum-ns> patch openstackcluster <cluster>-<suffix> --type=merge -p '{"metadata":{"finalizers":[]}}'. The Cluster finalizer was only waiting on it, so the Cluster auto-finalizes and deletes. Then manually clean orphaned neutron resources in dependency order: router remove subnet -> router unset external-gateway -> router delete -> subnet delete -> network delete -> security group delete.

LB-failover -- LB stuck provisioning_status=ERROR after a host event (phase-08)

  • Symptom: the kube-api Octavia LB shows operating_status ONLINE but provisioning_status ERROR after a host outage/OOM.
  • Cause: a control-plane op on the amphora failed during the outage.
  • Fix: openstack loadbalancer failover <lb-id> in ADMIN-project scope (amphora / failover ops 403 under tenant member scope). Watch ERROR -> PENDING_UPDATE -> ACTIVE (~100s); a single STANDALONE amphora gives a brief blip; operating_status holds ONLINE.

uninitialized-taint -- workload addons Pending (phase-08)

  • Symptom: new workload nodes are kubelet-Ready but addon pods (metrics-server, node-feature-discovery, etc.) stay Pending; nodes carry node.cluster.x-k8s.io/uninitialized.
  • Cause: that taint is removed by the CAPI machine controller on the MANAGEMENT cluster. If the mgmt cluster is down (see D-041), the taint persists.
  • Fix: restore the mgmt cluster API; CAPI then removes the taint and addons schedule.

CNI-label -- network_driver vs the chart-default Calico (1.4.0) (phase-08)

  • Note: under the as-FIRST-built driver 1.3.0 the legacy Magnum network_driver label was IGNORED and the capi-helm openstack-cluster chart's default CNI (Calico) always ran. Under the RELEASED 1.4.0 driver the network_driver template option IS honored (it maps through to the chart). To keep the as-built CNI (Calico), the capi-k8s-v1-32 template OMITS --network-driver (phase-08); set flannel there only to intentionally switch the CNI. (Mgmt cluster CNI is separately Cilium, via k8s-snap.)

================================================================================

Hyperconverged host / mgmt-VM resilience

================================================================================

D-040 -- host OOM from low reserved-host-memory (phase-08)

  • Symptom: guests OOM-killed; a compute host may even present in Juju as State=down (heavy swap thrash stalls OVS/OVN heartbeats and the machine agent).
  • Cause: reserved-host-memory default 512 MB does not cover the co-located LXD/Ceph/MySQL services on these hyperconverged hosts -> nova over-commits real RAM.
  • Fix: reserved-host-memory = 8192 on all compute units (baked into the hardened bundle). Diagnose a suspected OOM-vs-reboot with who -b / uptime (no recent boot) and journalctl -k | grep -i oom; the ovsdb "no response to inactivity probe ... disconnecting" storm is the swap-thrash signature.

D-041 -- single-node mgmt cluster does not self-heal (phase-08)

  • Symptom: after a host event the mgmt VM (capi-mgmt-v2) is SHUTOFF; FIP unreachable; magnum cannot reach the mgmt API; workload addons go Pending (see uninitialized-taint).
  • Cause: the D-035 single-node mgmt cluster is a SPOF with no MachineHealthCheck (unlike the workload cluster).
  • Fix: openstack server start capi-mgmt-v2 (API serves ~40s later; a brief TLS handshake timeout on the first kubectl is expected). Follow-up: HA mgmt cluster for Roosevelt.

juju-macaroon -- "cannot get discharge ... EOF" (phase-07, phase-08)

  • Symptom: juju ssh (or other juju calls) fail mid-session with a discharge/EOF error.
  • Cause: the juju macaroon expired during a long session.
  • Fix: re-run juju login, then retry.

================================================================================

Teardown / MAAS reset (phase-00)

================================================================================

DOCFIX-016 -- never maas list (API-key leak) (phase-00, phase-01, phase-04)

  • Risk: maas list prints the stored API key to stdout (and into any transcript/log).
  • Fix: the profile name is known (admin); call maas admin ... directly. Never run maas list in a runbook or paste block.

DOCFIX-017 -- no maas whoami; hardcode the eyeballed system_ids (phase-00)

  • Risk: scripting machine selection via maas <profile> whoami + owner filters is fragile and, in this lab, unnecessary.
  • Fix: the four host system_ids are fixed and eyeball-verified (openstack0=4na83t, openstack1=qdbqd6, openstack2=h8frng, openstack3=tmsafc) -- iterate those literals. (The older 01-destroy-model.md used maas list/whoami and released 5 VMs incl. the retired D-033 capi-mgmt; the current rebuild releases 4.)

R7 -- sudo for libvirt / qemu-img (phase-00, phase-01)

  • The OSD qcow2 files (/var/lib/libvirt/images/<host>-1.qcow2) are root:root / 600; qemu-img info|create, virsh domstate, stat, rm against them all need sudo.

KI-P3-001 -- VIP / primary collision (phase-00, phase-04)

  • Symptom: a charm vip: address equals a MAAS-auto-assigned machine/container primary (observed: cinder public VIP .226 == magnum container 1/lxd/3 primary).
  • Cause: MAAS auto-static allocation was not excluded over the VIP block (provider had NO VIP reservation), so MAAS handed primaries .225/.226/.227 onto the .224-.236 VIPs.
  • Fix (durable): on EVERY space carrying VIPs (provider AND metal) reserve the front-loaded VIP /26 in MAAS, distinct from the primary range and any neutron allocation_pool (phase-00 Phase 4). A reserved range stops future auto-assign onto a configured VIP. Negative test post-deploy: no service vip == any unit primary.

================================================================================

Deploy-time (phase-01)

================================================================================

R14 -- VIP relocation .224-.236 -> .50-.60 (phase-01)

  • The public + internal API VIPs were front-loaded out of the old high-end .224-.236 block into .50-.60 (inside the reserved .2-.63 /26). Every bundle vip: is a dual provider+metal pair "10.12.4.5x 10.12.8.5x" (D-020). Pre-deploy guard: total provider VIPs=11, all in .50-.60, zero in the stale .10-.20 (phase-01 1.1). Any per-cloud consumer of a VIP (the Horizon reverse proxy, monitoring) must be repointed.

R15 -- the .10 phantom resolver (phase-01)

  • Symptom: an unreachable region resolver 10.12.8.10 appears in a node's resolver list (sometimes as Current DNS Server) despite the subnet dns_servers override.
  • Cause: MAAS advertises its region/rack controller as a DNS server on the MAAS-managed metal VLAN, independent of the subnet field; the override does not purge it.
  • Impact: NON-BLOCKING -- systemd-resolved deprioritizes .10 and falls through to .1. Latent fragility if .1 ever drops. Understand/eliminate for Roosevelt (no libvirt split there).

L1 -- no set -e on count-gate blocks; guard greps || true (phase-01)

  • A guarded grep -c returning 0 is a VALID answer, not a failure. Under set -e a zero-count grep aborts the block. Pre-deploy verify blocks run WITHOUT set -e, and every count grep ends || true. (bash -n would not catch this -- it is behavior.)

L3 -- metal-side dual-VIP eyeball check (phase-01)

  • The provider-side VIP guard greps only the first token of each dual vip:. The metal side (second token, 10.12.8.5x) must be eyeballed to confirm all 11 sit in .8.50-.60, clear of metal infra (.8.10 maas / .8.20 lxd / .8.21 capi / .8.30 juju).

================================================================================

Vault / secrets (phase-02)

================================================================================

DOCFIX-006 -- vault init is one-shot; stdout-only redirect loses the keys (phase-02)

  • Symptom: vault operator init ... > file captures stdout only; if the key block went to stderr (or the run is interrupted) you are left with an unusable/empty file and the 5 shares + root token are GONE -- init runs exactly once and cannot be replayed.
  • Fix: vault operator init -key-shares=5 -key-threshold=3 2>&1 | tee ~/vault-init/init.txt VERBATIM; gate on grep -c '^Unseal Key' == 5 and Initial Root Token present; then save the file OFF-HOST before anything else. Never improvise this command.

DOCFIX-011 -- authorize-charm parameter is token (phase-02)

  • The vault authorize-charm action takes token (a direct token string); there is no token-secret-id variant in this charm rev. Confirm via juju actions vault --schema. Authorize with a SHORT-LIVED CHILD token (juju run persists action params in the op log).

DOCFIX-014 -- generate-root-ca is required (phase-02)

  • Symptom: after authorize-charm, vault stays BLOCKED "Missing CA cert".
  • Fix: run juju run vault/leader generate-root-ca -- it mints the charm-pki-local root and clears the block straight to active. (Omitting it leaves vault hung.)

L4 -- vault unseal via hidden prompt, not key-on-argv (phase-02)

  • Use Vault's own vault operator unseal (no argument) so it prompts hidden; the key is never on the command line / in a var / in ps / in scrollback. Do NOT use vault operator unseal $KEY (visible in ps on the unit). Unseal is re-runnable, so the verbatim-reference rule is looser here, but the security gain is real.

R3 -- "HA Enabled false" is correct for vault-on-mysql (phase-02)

  • Expected post-unseal: Initialized true / Sealed false / Storage Type mysql / HA Enabled false. Single-unit vault on the mysql backend is non-HA by design; any reference to "HA Enabled true (etcd backend)" is STALE (etcd was dropped).

================================================================================

Identity / openrc (phase-03)

================================================================================

DOCFIX-018 -- IP-only OS_AUTH_URL (phase-03)

  • This cloud is IP-only (no FQDN, no cloud DNS). The admin openrc must point at the keystone PUBLIC endpoint by IP: OS_AUTH_URL=https://10.12.4.50:5000/v3, with the vault root CA in OS_CACERT (B5 IP-SAN certs validate). No /etc/hosts, no FQDN.

DOCFIX-022 -- discover the admin project; do not hardcode it (phase-03)

  • Symptom: with TLS working, keystone returns HTTP 401.
  • Cause: wrong project scope. The scoping project name varies by charm rev (here it is admin, living in domain admin_domain; an older doc's OS_PROJECT_NAME=admin_domain 401s). Credential good, scope wrong.
  • Fix: a candidate loop -- try each of "admin admin_domain"; the first that issues a SCOPED token wins (phase-03 3.2). Costs 2 extra token requests; self-corrects across revs instead of re-introducing the 401-by-hardcode.

================================================================================

Octavia enablement (phase-05)

================================================================================

L7 -- the openstack snap cannot read /tmp (phase-05, also phase-01 PKI sanity)

  • Symptom: openstack image create --file /tmp/... -> "[Errno 2] No such file or directory" even though sha256sum just read the same path.
  • Cause: the openstack CLI snap is confined and cannot read /tmp; it CAN read $HOME (home interface).
  • Fix: stage any file the snap must read under $HOME (e.g. $HOME/amphora-base/...), never /tmp.

octavia-configure-resources -- long-running action; o-hm0 transient is normal (phase-05)

  • configure-resources is long-running: juju's default action wait may time out ("timed out waiting for results") while the hook KEEPS RUNNING -- do NOT treat the wait-timeout as failure or re-fire blindly. Use a bound --wait and confirm completion via juju show-operation <N> (authoritative), not the streamed log.
  • NORMAL (not faults) during/after: lb-mgmt-net is IPv6-ULA (fc00::/..) by design; a "Virtual network for access to Amphorae is down" transient self-heals as o-hm0 comes up; the lb-mgmt network:distributed port shows DOWN (logical OVN port, never chassis-bound).

amp-image-tag-mismatch -- LP#1937003 (phase-05)

  • Octavia looks up the amphora image by octavia amp-image-tag; it MUST equal the tag the retrofit stamps (octavia-diskimage-retrofit amp-image-tag), both octavia-amphora. A mismatch means octavia cannot find the image even though it is built and ACTIVE. The amphora pipeline gate asserts the two are equal before building (phase-05 5.2).

================================================================================

Notes

================================================================================

  • This index covers phases 00-08. It grows the same way for any future phase: keyed by D-NNN / DOCFIX-NNN / L-N / R-N / named-symptom, each entry symptom -> cause -> fix with a "phase NN" back-reference, and decision rationale left to design-decisions.md.
  • memcached track drift is recorded in appendix-B (B.1), not here (it is a version-lock note, not a troubleshooting entry).