Keyed by the same D-NNN / DOCFIX-NNN / L-P6-N identifiers used inline in the phase runbooks. This is an OPERATIONAL index (symptom -> cause -> fix), NOT the decision log: full rationale lives in design-decisions.md and the per-decision files (D-0NN-*.md); the driver fix has its own magnum-capi-helm-driver-fix-runbook. Each entry notes the phase(s) that reference it. ASCII-only.
================================================================================
================================================================================
juju ssh/ssh ... bash -s or remote sudo block dies early or behaves as if truncated; later commands in the heredoc never run.ssh/sudo/juju ssh (or any stdin reader) consumes the rest of the heredoc/pipe that was feeding the outer command.</dev/null to every inner ssh/sudo/juju ssh invocation (use </dev/tty instead only when the call genuinely needs an interactive prompt).( { ...; } ) so a stray exit cannot kill the interactive shell.juju run vault/leader get-root-ca wraps the PEM in an INDENTED YAML output: |- block; sed-by-marker preserves the indent and an indented -----BEGIN CERTIFICATE----- is not valid PEM -> openssl "Unable to load certificate" -> keystone NO_CERTIFICATE_OR_CRL_FOUND. Fix: pull from the action JSON (real newlines, no indent): juju run vault/leader get-root-ca -m openstack --format json | jq -r '[.. | strings | select(test("BEGIN CERTIFICATE"))][0]'. (Same class as DOCFIX-006: never trust action human output for a captured secret/cert.)/tmp, or letting a PTY mangle it in transit.umask 077, then chown to the service user and chmod 0600 -- never touch /tmp. (Pattern in phase-07 7.2.)================================================================================
================================================================================
k8s bootstrap "succeeds" but the node never reaches Ready; network and DNS are silently disabled; CoreDNS/Cilium absent.--file whose top level lacks a cluster-config: block leaves ALL features (network, dns, ...) at disabled defaults. Setting only pod-cidr / service-cidr / extra-sans does NOT enable them.cluster-config:
network: { enabled: true }
dns: { enabled: true }(See phase-06 6.4 for the full config.) Retry: snap remove k8s --purge then re-bootstrap.================================================================================
================================================================================
--set installCRDs=true.installCRDs was removed from the cert-manager chart (~v1.18). The current flag is crds.enabled=true.helm install cert-manager jetstack/cert-manager ... --set crds.enabled=true.clusterctl init, capo-controller-manager CrashLoopBackOff (observed ~6 restarts / ~15 min) before self-healing.openstackserver controller hard-depends on ORC's Image.openstack.k-orc.cloud CRD at startup. clusterctl init installs CAPO; if ORC is not yet present, CAPO crash-loops until it appears.Image CRD) BEFORE clusterctl init. Hardened order: cert-manager -> ORC -> clusterctl init -> CAAPH -> janitor.capi-helm-charts tag's dependencies.json (read live with jq); do not hardcode semver. (Full rationale: design-decisions D-034; driver-coherence amendment: D-042.)================================================================================
================================================================================
cilium_host -- silent, asymmetric breakage. (The "do-07 pattern.")================================================================================
================================================================================
[capi_helm] conf.d drop-in is ignored; the conductor behaves as if it was never written, even though a systemd drop-in "looks" applied.systemd-start, NOT a direct ExecStart=. A systemd drop-in appending --config-dir passes it as a positional arg to the init script, which ignores it -- the flag never reaches the daemon. The args are assembled inside the init script from DAEMON_ARGS (base --config-file first), extensible only via /etc/default/<service>./etc/default/magnum-conductor (0644; the charm does not manage it):
DAEMON_ARGS="$DAEMON_ARGS --config-dir /etc/magnum/magnum.conf.d"Verify with the init script's own
show-args (dry-run) AND ps -ww -C magnum-conductor -o args on the live process -- behavioral, not string-presence./etc/default/magnum-conductor, the append is lost and [capi_helm] silently stops being read. Re-check via show-args/ps.ExecStart shape for OpenStack debs, and never treat "string present in the unit file" as "the daemon received the flag." Gate on the assembled/launched cmdline (show-args, then ps on the live process).helm (cluster create errors on a helm invocation), yet command -v helm in an interactive juju ssh magnum/0 shell finds it.systemd-start) with the restricted init PATH (e.g. /usr/sbin:/usr/bin:/sbin:/bin), which EXCLUDES /usr/local/bin -- where a get.helm.sh tarball install lands. An interactive login shell has /usr/local/bin on PATH, so it masks the problem (the classic green-in-the-shell, broken-in-the-daemon trap)./usr/local/bin/helm AND symlink /usr/bin/helm -> it (/usr/bin IS on the restricted PATH). Checksum-verify the tarball (sha256 vs get.helm.sh .sha256sum) before install. VERIFY against the restricted PATH, not a login shell: env -i PATH=/usr/sbin:/usr/bin:/sbin:/bin sh -c 'command -v helm && helm version --short' must print /usr/bin/helm (phase-07 7.4).kube_version from the Glance image properties and routes on os_distro; it does NOT take k8s version from a template label.ubuntu-jammy-kube-v1.34.8) MUST carry kube_version (e.g. v1.32.13) and os_distro=ubuntu. Verify before create (phase-08 8.0).================================================================================
================================================================================
coe cluster show reports health_status = UNHEALTHY deterministically (survives a conductor restart); only the infrastructure sub-check fails ("Infrastructure resource not found"); cluster + control-plane + nodegroup are Ready.apiVersion off spec.infrastructureRef to build its health GET, but the CAPI v1.13 (v1beta2 contract) ref carries apiGroup+kind+name with NO apiVersion. COSMETIC -- the create path is unaffected (the chart templates the resource versions); only the driver's direct health query breaks.magnum-capi-helm==1.4.0 (the "generalize-api-resources" feature). 1.4.0 builds each health GET from an explicit api_version via its [capi_helm] api_resources option, which DEFAULTS to v1beta1 for every CAPI kind -- and CAPI v1.13.2 / CAPO v0.14.4 still serve v1beta1, so the default works (no override needed; phase-07 7.3-7.6). Set a per-kind override only if a kind is v1beta2-only. Rule (amends D-034): the Layer-B driver pin must be contract-coherent with the Layer-A CAPI core.health_status (a persistent false UNHEALTHY could misfire); CAPI MachineHealthCheck heals independently.================================================================================
================================================================================
load-balancer_member (a pre-D-039 frozen app-cred cannot query Octavia to confirm LB state).load-balancer_member (+ member, reader). Verify before acceptance (phase-08 prereqs).DELETE_IN_PROGRESS; helm release already gone; Cluster and OpenStackCluster CRs stuck Deleting (often on an Octavia 403, see D-039).OpenStackCluster finalizer on the mgmt cluster -- kubectl -n <magnum-ns> patch openstackcluster <cluster>-<suffix> --type=merge -p '{"metadata":{"finalizers":[]}}'. The Cluster finalizer was only waiting on it, so the Cluster auto-finalizes and deletes. Then manually clean orphaned neutron resources in dependency order: router remove subnet -> router unset external-gateway -> router delete -> subnet delete -> network delete -> security group delete.<cluster>-<CAPI-suffix> where the suffix is random per create (NOT the Magnum cluster name). LIST first -- kubectl -n <magnum-ns> get openstackcluster -- and operate on the EXACT name returned. The magnum-ns is magnum-<project-id> (resolve the project id; never hardcode). A wrong-name patch silently no-ops and the delete stays wedged.operating_status ONLINE but provisioning_status ERROR after a host outage/OOM.openstack loadbalancer failover <lb-id> in ADMIN-project scope (amphora / failover ops 403 under tenant member scope). Watch ERROR -> PENDING_UPDATE -> ACTIVE (~100s); a single STANDALONE amphora gives a brief blip; operating_status holds ONLINE.node.cluster.x-k8s.io/uninitialized.network_driver label was IGNORED and the capi-helm openstack-cluster chart's default CNI (Calico) always ran. Under the RELEASED 1.4.0 driver the network_driver template option IS honored (it maps through to the chart network_driver).--network-driver calico EXPLICITLY on the capi-k8s-v1-34 template (phase-08) rather than relying on the default staying Calico. Chart 0.25.1 ships ONLY Calico (flannel is not packaged), so flannel there would fail to converge -- do not set it. (Mgmt cluster CNI is separately Cilium, via k8s-snap.)================================================================================
================================================================================
State=down (heavy swap thrash stalls OVS/OVN heartbeats and the machine agent).reserved-host-memory default 512 MB does not cover the co-located LXD/Ceph/MySQL services on these hyperconverged hosts -> nova over-commits real RAM.reserved-host-memory = 8192 on all compute units (baked into the hardened bundle). Diagnose a suspected OOM-vs-reboot with who -b / uptime (no recent boot) and journalctl -k | grep -i oom; the ovsdb "no response to inactivity probe ... disconnecting" storm is the swap-thrash signature.capi-mgmt-v2) is SHUTOFF; FIP unreachable; magnum cannot reach the mgmt API; workload addons go Pending (see uninitialized-taint).openstack server start capi-mgmt-v2 (API serves ~40s later; a brief TLS handshake timeout on the first kubectl is expected). Follow-up: HA mgmt cluster for Roosevelt.juju ssh (or other juju calls) fail mid-session with a discharge/EOF error.juju login, then retry.================================================================================
================================================================================
maas list (API-key leak) (phase-00, phase-01, phase-04)maas list prints the stored API key to stdout (and into any transcript/log).admin); call maas admin ... directly. Never run maas list in a runbook or paste block.maas whoami; hardcode the eyeballed system_ids (phase-00)maas <profile> whoami + owner filters is fragile and, in this lab, unnecessary.maas list/whoami and released 5 VMs incl. the retired D-033 capi-mgmt; the current rebuild releases 4.)/var/lib/libvirt/images/<host>-1.qcow2) are root:root / 600; qemu-img info|create, virsh domstate, stat, rm against them all need sudo.vip: address equals a MAAS-auto-assigned machine/container primary (observed: cinder public VIP .226 == magnum container 1/lxd/3 primary).virsh setmaxmem <dom> 32G --config then virsh setmem <dom> 32G --config; boot; then MAAS RECOMMISSION the node so MAAS re-reads hardware and lands it back at Ready at the new size (4x Ready at 32768 in ~3 min). Do the maxmem change while shut off -- a live setmaxmem is rejected.reserved-host-memory 8192 is RETAINED (correctness floor, independent of host size). Re-measure the per-host container/service footprint against the 32 GiB envelope before the Roosevelt node-role split -- 16 GiB-era pressure numbers do not map 1:1.================================================================================
================================================================================
vip: is a dual provider+metal pair "10.12.4.5x 10.12.8.5x" (D-020). Pre-deploy guard: total provider VIPs=11, all in .50-.60, zero in the stale .10-.20 (phase-01 1.1). Any per-cloud consumer of a VIP (the Horizon reverse proxy, monitoring) must be repointed.10.12.8.10 appears in a node's resolver list (sometimes as Current DNS Server) despite the subnet dns_servers override.set -e on count-gate blocks; guard greps || true (phase-01)grep -c returning 0 is a VALID answer, not a failure. Under set -e a zero-count grep aborts the block. Pre-deploy verify blocks run WITHOUT set -e, and every count grep ends || true. (bash -n would not catch this -- it is behavior.)vip:. The metal side (second token, 10.12.8.5x) must be eyeballed to confirm all 11 sit in .8.50-.60, clear of metal infra (.8.10 maas / .8.20 lxd / .8.21 capi / .8.30 juju).================================================================================
================================================================================
vault operator init ... > file captures stdout only; if the key block went to stderr (or the run is interrupted) you are left with an unusable/empty file and the 5 shares + root token are GONE -- init runs exactly once and cannot be replayed.vault operator init -key-shares=5 -key-threshold=3 2>&1 | tee ~/vault-init/init.txt VERBATIM; gate on grep -c '^Unseal Key' == 5 and Initial Root Token present; then save the file OFF-HOST before anything else. Never improvise this command.token (phase-02)authorize-charm action takes token (a direct token string); there is no token-secret-id variant in this charm rev. Confirm via juju actions vault --schema. Authorize with a SHORT-LIVED CHILD token (juju run persists action params in the op log).juju run vault/leader generate-root-ca -- it mints the charm-pki-local root and clears the block straight to active. (Omitting it leaves vault hung.)vault operator unseal (no argument) so it prompts hidden; the key is never on the command line / in a var / in ps / in scrollback. Do NOT use vault operator unseal $KEY (visible in ps on the unit). Unseal is re-runnable, so the verbatim-reference rule is looser here, but the security gain is real.================================================================================
================================================================================
OS_AUTH_URL=https://10.12.4.50:5000/v3, with the vault root CA in OS_CACERT (B5 IP-SAN certs validate). No /etc/hosts, no FQDN.admin, living in domain admin_domain; an older doc's OS_PROJECT_NAME=admin_domain 401s). Credential good, scope wrong.================================================================================
================================================================================
juju status is all active/idle, yet a service VIP intermittently 503s or a unit's API is unreachable. juju health is BLIND to per-backend haproxy state./var/run/haproxy/admin.sock (show stat) and grep ',DOWN,' (excluding the FRONTEND/BACKEND summary rows). For any flagged unit: sudo haproxy -c -f /etc/haproxy/haproxy.cfg (must say valid) then sudo systemctl reload haproxy (graceful master-worker; reload, not restart). Phase-03 3.1 gates on a zero-DOWN sweep cloud-wide -- it closes the juju-green-but-backend-DOWN hole.systemctl reload nginx right after editing the vhost can be served by a still-draining old worker (a curl ~2s later hits stale behavior; the co-hosted MAAS proxy blips too). nginx -t FIRST; prefer restart for a definitive cutover when the listen/upstream set changed, reload only for content-equivalent edits.juju-ffe3b8-2-lxd-2); set proxy_ssl_name to that SAN, proxy_ssl_verify on, and the vault CA in proxy_ssl_trusted_certificate, or verification fails on the IP-only connect.sed -i that does not match silently changes nothing and the proxy keeps the old behavior -- assert the post-edit content, do not trust sed's exit code.https:// Location headers while the proxy listens http; without proxy_redirect https:// http:// (or a matching listen scheme) the browser loops. Match the scheme end-to-end or rewrite the redirect.================================================================================
================================================================================
openstack image create --file /tmp/... -> "[Errno 2] No such file or directory" even though sha256sum just read the same path./tmp; it CAN read $HOME (home interface).$HOME (e.g. $HOME/amphora-base/...), never /tmp.configure-resources is long-running: juju's default action wait may time out ("timed out waiting for results") while the hook KEEPS RUNNING -- do NOT treat the wait-timeout as failure or re-fire blindly. Use a bound --wait and confirm completion via juju show-operation <N> (authoritative), not the streamed log.network:distributed port shows DOWN (logical OVN port, never chassis-bound).octavia amp-image-tag; it MUST equal the tag the retrofit stamps (octavia-diskimage-retrofit amp-image-tag), both octavia-amphora. A mismatch means octavia cannot find the image even though it is built and ACTIVE. The amphora pipeline gate asserts the two are equal before building (phase-05 5.2).================================================================================
================================================================================
--import-method web-download) 202-accepts, then the image hangs in queued forever and never reaches active.Python-urllib/3.x); the azimuth-images CDN (azimuth-images.stackhpc.cloud) returns HTTP 403 to that UA. A curl/HEAD probe with a different UA passes -- which is why an earlier probe false-passed while the real import failed.$HOME (snap-readable, NOT /tmp -- L7; curl's UA is not blocked), verify the checksum against the published manifest (azimuth-images manifest.json -- sha512 for kube images; the ubuntu cloud-images SHA256SUMS for noble), then openstack image create --file --import (the openstack snap's --import == glance-direct; image-conversion lands it raw). CORRECTION-1: a plain --file PUT (no --import) stores qcow2 -- fine for boot, but --import gives the raw Ceph fast-clone alignment.openstack image delete <id> on the queued remnant (verify the EXACT id first -- FINDING-4 name-guard discipline).openstack image create --import --import-method web-download --uri <url>) is retained as a tested ALTERNATIVE, not the canonical path (superseded 2026-06-17; see design-decisions). Caveats: (1) it cannot checksum-verify the fetched file against a published digest (the CDN redirect strips it) -- weaker provenance; (2) it 403s on the azimuth CDN (FINDING-3), so it is unusable for kube images; (3) for ubuntu cloud-images it works on the hardened bundle (the 2026-06-08 403 was transient/pre-hardening). Use only as an expedient.================================================================================
================================================================================
Five entries from the 2026-06-10 recovery session. Full procedures with verified blocks: runbooks/ops-capi-recovery.md.
While capi-mgmt-v2 is stopped: Magnum reports UNHEALTHY with an EMPTY health_status_reason (distinct from the D-042 cosmetic, which has a populated reason); the Horizon Container Infra panel may 504 through the jumphost nginx proxy and coe CLI calls may stall; the workload cluster keeps serving (no runtime dependency on the mgmt cluster). If jumphost secrets were filed during parking, the convention is ~/sweep-YYYYMMDD/secrets/. See ops-capi-recovery Section 0 (expectations table) and Section 1 (parking block).
Causal chain (traced live 2026-06-10): host CPU/memory pressure -> amphora heartbeats go stale -> Octavia health-manager marks amphorae ERROR and launches auto-failovers -> failovers fail NoValidHost (no placement headroom) -> amphora servers accumulate with NO Octavia DB row. Two variants: an ERROR server (failed spawn) and an ACTIVE heartbeating zombie (health-manager logs "missing from the DB ... An operator must manually delete it" every 10 s). Remedy: verify-then-delete by SERVER UUID under admin scope -- the loadbalancer amphora list output is the DB truth; Nova name lookup is project-scoped (amphorae live in the Octavia services project). Procedure: ops-capi-recovery 5a. Do NOT retry failover against the same blocker; each attempt mints another zombie.
STANDALONE failover builds the replacement amphora BEFORE reaping the old one, so it transiently needs one extra amphora slot (charm-octavia: 1024 MB / 1 vCPU / 8 GB). Scheduler ceiling per host = physical_MB * ram_allocation_ratio (1.5)
</dev/null vs an expired macaroon (DOCFIX-021 interaction)DOCFIX-021's </dev/null on juju ssh assumes valid macaroon auth. When the jumphost macaroon goes stale, juju falls back to an interactive password prompt; </dev/null feeds that prompt EOF and the symptom is the misleading "cannot get discharge from https://:17070/auth: EOF". Triage: run juju status interactively -- if it succeeds after a password prompt, the controller is healthy and only the credential cache is stale. Workaround for the session: stdin from </dev/tty. Fix at a calm moment: juju logout then juju login.
CAPI/Magnum VMs are owned by the capi-mgmt project; an empty Project -> Compute -> Instances page under admin scope is correct, not a defect. Map: tenant VMs -> Instances in the OWNING project's scope (use the header project switcher; admin holds member on capi-mgmt per phase-06 6.0-BOOT); LB objects -> Project -> Network -> Load Balancers in the owning project's scope; amphora VMs -> Admin -> Compute -> Instances ONLY (they belong to the Octavia services project); everything at once -> CLI openstack server list --all-projects. Warning about the asymmetry: the Container Infra panel lists clusters cross-project under admin policy, which makes the strictly-scoped Nova panel look broken when it is not.
SYMPTOM: link-subnet fails "IP address is already in use" but the IP is in no visible table (ipaddresses read empty, discovery cleared, not in DHCP
dynamic range, interface on the correct VLAN).
CAUSE: A freshly re-enrolled host PXE-leases its own metal IP (10.12.8.4N) at commission; MAAS keeps it as a StaticIPAddress of alloc_type 6 (DISCOVERED), tied to the node. Distinct from the network-discovery table AND from user allocations -- neither discoveries clear-by-mac-and-ip nor a plain ipaddresses release clears it. AUTHORITATIVE READ (use FIRST, before guessing): maas admin subnet ip-addresses -> lists every in-use IP with .alloc_type and .node_summary. alloc_type 6 = DISCOVERED. This is the definitive "who holds this IP and why". FIX: maas admin ipaddresses release ip= force=true discovered=true (BOTH flags; force alone -> "does not exist"). Only release when the discovered record's node is the SAME host -- a different node means a real address conflict; stop and investigate. NOW AUTOMATED: scripts/carve-host-interfaces.sh release_self_discovered() does this, gated to self-owned records only.