Keyed by the same D-NNN / DOCFIX-NNN / L-P6-N identifiers used inline in the phase runbooks. This is an OPERATIONAL index (symptom -> cause -> fix), NOT the decision log: full rationale lives in design-decisions.md and the per-decision files (D-0NN-*.md); the driver fix has its own magnum-capi-helm-driver-fix-runbook. Each entry notes the phase(s) that reference it. ASCII-only.
================================================================================
================================================================================
juju ssh/ssh ... bash -s or remote sudo block dies early or behaves as if truncated; later commands in the heredoc never run.ssh/sudo/juju ssh (or any stdin reader) consumes the rest of the heredoc/pipe that was feeding the outer command.</dev/null to every inner ssh/sudo/juju ssh invocation (use </dev/tty instead only when the call genuinely needs an interactive prompt).( { ...; } ) so a stray exit cannot kill the interactive shell.juju run vault/leader get-root-ca wraps the PEM in an INDENTED YAML output: |- block; sed-by-marker preserves the indent and an indented -----BEGIN CERTIFICATE----- is not valid PEM -> openssl "Unable to load certificate" -> keystone NO_CERTIFICATE_OR_CRL_FOUND. Fix: pull from the action JSON (real newlines, no indent): juju run vault/leader get-root-ca -m openstack --format json | jq -r '[.. | strings | select(test("BEGIN CERTIFICATE"))][0]'. (Same class as DOCFIX-006: never trust action human output for a captured secret/cert.)/tmp, or letting a PTY mangle it in transit.umask 077, then chown to the service user and chmod 0600 -- never touch /tmp. (Pattern in phase-07 7.2.)================================================================================
================================================================================
k8s bootstrap "succeeds" but the node never reaches Ready; network and DNS are silently disabled; CoreDNS/Cilium absent.--file whose top level lacks a cluster-config: block leaves ALL features (network, dns, ...) at disabled defaults. Setting only pod-cidr / service-cidr / extra-sans does NOT enable them.cluster-config:
network: { enabled: true }
dns: { enabled: true }(See phase-06 6.4 for the full config.) Retry: snap remove k8s --purge then re-bootstrap.================================================================================
================================================================================
--set installCRDs=true.installCRDs was removed from the cert-manager chart (~v1.18). The current flag is crds.enabled=true.helm install cert-manager jetstack/cert-manager ... --set crds.enabled=true.clusterctl init, capo-controller-manager CrashLoopBackOff (observed ~6 restarts / ~15 min) before self-healing.openstackserver controller hard-depends on ORC's Image.openstack.k-orc.cloud CRD at startup. clusterctl init installs CAPO; if ORC is not yet present, CAPO crash-loops until it appears.Image CRD) BEFORE clusterctl init. Hardened order: cert-manager -> ORC -> clusterctl init -> CAAPH -> janitor.capi-helm-charts tag's dependencies.json (read live with jq); do not hardcode semver. (Full rationale: design-decisions D-034; driver-coherence amendment: D-042.)================================================================================
================================================================================
cilium_host -- silent, asymmetric breakage. (The "do-07 pattern.")================================================================================
================================================================================
[capi_helm] conf.d drop-in is ignored; the conductor behaves as if it was never written, even though a systemd drop-in "looks" applied.systemd-start, NOT a direct ExecStart=. A systemd drop-in appending --config-dir passes it as a positional arg to the init script, which ignores it -- the flag never reaches the daemon. The args are assembled inside the init script from DAEMON_ARGS (base --config-file first), extensible only via /etc/default/<service>./etc/default/magnum-conductor (0644; the charm does not manage it):
DAEMON_ARGS="$DAEMON_ARGS --config-dir /etc/magnum/magnum.conf.d"Verify with the init script's own
show-args (dry-run) AND ps -ww -C magnum-conductor -o args on the live process -- behavioral, not string-presence./etc/default/magnum-conductor, the append is lost and [capi_helm] silently stops being read. Re-check via show-args/ps.ExecStart shape for OpenStack debs, and never treat "string present in the unit file" as "the daemon received the flag." Gate on the assembled/launched cmdline (show-args, then ps on the live process).helm (cluster create errors on a helm invocation), yet command -v helm in an interactive juju ssh magnum/0 shell finds it.systemd-start) with the restricted init PATH (e.g. /usr/sbin:/usr/bin:/sbin:/bin), which EXCLUDES /usr/local/bin -- where a get.helm.sh tarball install lands. An interactive login shell has /usr/local/bin on PATH, so it masks the problem (the classic green-in-the-shell, broken-in-the-daemon trap)./usr/local/bin/helm AND symlink /usr/bin/helm -> it (/usr/bin IS on the restricted PATH). Checksum-verify the tarball (sha256 vs get.helm.sh .sha256sum) before install. VERIFY against the restricted PATH, not a login shell: env -i PATH=/usr/sbin:/usr/bin:/sbin:/bin sh -c 'command -v helm && helm version --short' must print /usr/bin/helm (phase-07 7.4).kube_version from the Glance image properties and routes on os_distro; it does NOT take k8s version from a template label.ubuntu-jammy-kube-v1.34.8) MUST carry kube_version (e.g. v1.32.13) and os_distro=ubuntu. Verify before create (phase-08 8.0).================================================================================
================================================================================
coe cluster show reports health_status = UNHEALTHY deterministically (survives a conductor restart); only the infrastructure sub-check fails ("Infrastructure resource not found"); cluster + control-plane + nodegroup are Ready.apiVersion off spec.infrastructureRef to build its health GET, but the CAPI v1.13 (v1beta2 contract) ref carries apiGroup+kind+name with NO apiVersion. COSMETIC -- the create path is unaffected (the chart templates the resource versions); only the driver's direct health query breaks.magnum-capi-helm==1.4.0 (the "generalize-api-resources" feature). 1.4.0 builds each health GET from an explicit api_version via its [capi_helm] api_resources option, which DEFAULTS to v1beta1 for every CAPI kind -- and CAPI v1.13.2 / CAPO v0.14.4 still serve v1beta1, so the default works (no override needed; phase-07 7.3-7.6). Set a per-kind override only if a kind is v1beta2-only. Rule (amends D-034): the Layer-B driver pin must be contract-coherent with the Layer-A CAPI core.health_status (a persistent false UNHEALTHY could misfire); CAPI MachineHealthCheck heals independently.================================================================================
================================================================================
load-balancer_member (a pre-D-039 frozen app-cred cannot query Octavia to confirm LB state).load-balancer_member (+ member, reader). Verify before acceptance (phase-08 prereqs).DELETE_IN_PROGRESS; helm release already gone; Cluster and OpenStackCluster CRs stuck Deleting (often on an Octavia 403, see D-039).OpenStackCluster finalizer on the mgmt cluster -- kubectl -n <magnum-ns> patch openstackcluster <cluster>-<suffix> --type=merge -p '{"metadata":{"finalizers":[]}}'. The Cluster finalizer was only waiting on it, so the Cluster auto-finalizes and deletes. Then manually clean orphaned neutron resources in dependency order: router remove subnet -> router unset external-gateway -> router delete -> subnet delete -> network delete -> security group delete.<cluster>-<CAPI-suffix> where the suffix is random per create (NOT the Magnum cluster name). LIST first -- kubectl -n <magnum-ns> get openstackcluster -- and operate on the EXACT name returned. The magnum-ns is magnum-<project-id> (resolve the project id; never hardcode). A wrong-name patch silently no-ops and the delete stays wedged.operating_status ONLINE but provisioning_status ERROR after a host outage/OOM.openstack loadbalancer failover <lb-id> in ADMIN-project scope (amphora / failover ops 403 under tenant member scope). Watch ERROR -> PENDING_UPDATE -> ACTIVE (~100s); a single STANDALONE amphora gives a brief blip; operating_status holds ONLINE.node.cluster.x-k8s.io/uninitialized.network_driver label was IGNORED and the capi-helm openstack-cluster chart's default CNI (Calico) always ran. Under the RELEASED 1.4.0 driver the network_driver template option IS honored (it maps through to the chart network_driver).--network-driver calico EXPLICITLY on the capi-k8s-v1-34 template (phase-08) rather than relying on the default staying Calico. Chart 0.25.1 ships ONLY Calico (flannel is not packaged), so flannel there would fail to converge -- do not set it. (Mgmt cluster CNI is separately Cilium, via k8s-snap.)================================================================================
================================================================================
State=down (heavy swap thrash stalls OVS/OVN heartbeats and the machine agent).reserved-host-memory default 512 MB does not cover the co-located LXD/Ceph/MySQL services on these hyperconverged hosts -> nova over-commits real RAM.reserved-host-memory = 8192 on all compute units (baked into the hardened bundle). Diagnose a suspected OOM-vs-reboot with who -b / uptime (no recent boot) and journalctl -k | grep -i oom; the ovsdb "no response to inactivity probe ... disconnecting" storm is the swap-thrash signature.capi-mgmt-v2) is SHUTOFF; FIP unreachable; magnum cannot reach the mgmt API; workload addons go Pending (see uninitialized-taint).openstack server start capi-mgmt-v2 (API serves ~40s later; a brief TLS handshake timeout on the first kubectl is expected). Follow-up: HA mgmt cluster for Roosevelt.juju ssh (or other juju calls) fail mid-session with a discharge/EOF error.juju login, then retry.================================================================================
================================================================================
maas list (API-key leak) (phase-00, phase-01, phase-04)maas list prints the stored API key to stdout (and into any transcript/log).admin); call maas admin ... directly. Never run maas list in a runbook or paste block.maas whoami; hardcode the eyeballed system_ids (phase-00)maas <profile> whoami + owner filters is fragile and, in this lab, unnecessary.maas list/whoami and released 5 VMs incl. the retired D-033 capi-mgmt; the current rebuild releases 4.)/var/lib/libvirt/images/<host>-1.qcow2) are root:root / 600; qemu-img info|create, virsh domstate, stat, rm against them all need sudo.vip: address equals a MAAS-auto-assigned machine/container primary (observed: cinder public VIP .226 == magnum container 1/lxd/3 primary).virsh setmaxmem <dom> 32G --config then virsh setmem <dom> 32G --config; boot; then MAAS RECOMMISSION the node so MAAS re-reads hardware and lands it back at Ready at the new size (4x Ready at 32768 in ~3 min). Do the maxmem change while shut off -- a live setmaxmem is rejected.reserved-host-memory 8192 is RETAINED (correctness floor, independent of host size). Re-measure the per-host container/service footprint against the 32 GiB envelope before the Roosevelt node-role split -- 16 GiB-era pressure numbers do not map 1:1.================================================================================
================================================================================
vip: is a dual provider+metal pair "10.12.4.5x 10.12.8.5x" (D-020). Pre-deploy guard: total provider VIPs=11, all in .50-.60, zero in the stale .10-.20 (phase-01 1.1). Any per-cloud consumer of a VIP (the Horizon reverse proxy, monitoring) must be repointed.10.12.8.10 appears in a node's resolver list (sometimes as Current DNS Server) despite the subnet dns_servers override.set -e on count-gate blocks; guard greps || true (phase-01)grep -c returning 0 is a VALID answer, not a failure. Under set -e a zero-count grep aborts the block. Pre-deploy verify blocks run WITHOUT set -e, and every count grep ends || true. (bash -n would not catch this -- it is behavior.)vip:. The metal side (second token, 10.12.8.5x) must be eyeballed to confirm all 11 sit in .8.50-.60, clear of metal infra (.8.10 maas / .8.20 lxd / .8.21 capi / .8.30 juju).================================================================================
================================================================================
vault operator init ... > file captures stdout only; if the key block went to stderr (or the run is interrupted) you are left with an unusable/empty file and the 5 shares + root token are GONE -- init runs exactly once and cannot be replayed.vault operator init -key-shares=5 -key-threshold=3 2>&1 | tee ~/vault-init/init.txt VERBATIM; gate on grep -c '^Unseal Key' == 5 and Initial Root Token present; then save the file OFF-HOST before anything else. Never improvise this command.token (phase-02)authorize-charm action takes token (a direct token string); there is no token-secret-id variant in this charm rev. Confirm via juju actions vault --schema. Authorize with a SHORT-LIVED CHILD token (juju run persists action params in the op log).juju run vault/leader generate-root-ca -- it mints the charm-pki-local root and clears the block straight to active. (Omitting it leaves vault hung.)vault operator unseal (no argument) so it prompts hidden; the key is never on the command line / in a var / in ps / in scrollback. Do NOT use vault operator unseal $KEY (visible in ps on the unit). Unseal is re-runnable, so the verbatim-reference rule is looser here, but the security gain is real.================================================================================
================================================================================
OS_AUTH_URL=https://10.12.4.50:5000/v3, with the vault root CA in OS_CACERT (B5 IP-SAN certs validate). No /etc/hosts, no FQDN.admin, living in domain admin_domain; an older doc's OS_PROJECT_NAME=admin_domain 401s). Credential good, scope wrong.================================================================================
================================================================================
juju status is all active/idle, yet a service VIP intermittently 503s or a unit's API is unreachable. juju health is BLIND to per-backend haproxy state./var/run/haproxy/admin.sock (show stat) and grep ',DOWN,' (excluding the FRONTEND/BACKEND summary rows). For any flagged unit: sudo haproxy -c -f /etc/haproxy/haproxy.cfg (must say valid) then sudo systemctl reload haproxy (graceful master-worker; reload, not restart). Phase-03 3.1 gates on a zero-DOWN sweep cloud-wide -- it closes the juju-green-but-backend-DOWN hole.systemctl reload nginx right after editing the vhost can be served by a still-draining old worker (a curl ~2s later hits stale behavior; the co-hosted MAAS proxy blips too). nginx -t FIRST; prefer restart for a definitive cutover when the listen/upstream set changed, reload only for content-equivalent edits.juju-ffe3b8-2-lxd-2); set proxy_ssl_name to that SAN, proxy_ssl_verify on, and the vault CA in proxy_ssl_trusted_certificate, or verification fails on the IP-only connect.sed -i that does not match silently changes nothing and the proxy keeps the old behavior -- assert the post-edit content, do not trust sed's exit code.https:// Location headers while the proxy listens http; without proxy_redirect https:// http:// (or a matching listen scheme) the browser loops. Match the scheme end-to-end or rewrite the redirect.================================================================================
================================================================================
openstack image create --file /tmp/... -> "[Errno 2] No such file or directory" even though sha256sum just read the same path./tmp; it CAN read $HOME (home interface).$HOME (e.g. $HOME/amphora-base/...), never /tmp.configure-resources is long-running: juju's default action wait may time out ("timed out waiting for results") while the hook KEEPS RUNNING -- do NOT treat the wait-timeout as failure or re-fire blindly. Use a bound --wait and confirm completion via juju show-operation <N> (authoritative), not the streamed log.network:distributed port shows DOWN (logical OVN port, never chassis-bound).octavia amp-image-tag; it MUST equal the tag the retrofit stamps (octavia-diskimage-retrofit amp-image-tag), both octavia-amphora. A mismatch means octavia cannot find the image even though it is built and ACTIVE. The amphora pipeline gate asserts the two are equal before building (phase-05 5.2).================================================================================
================================================================================
--import-method web-download) 202-accepts, then the image hangs in queued forever and never reaches active.Python-urllib/3.x); the azimuth-images CDN (azimuth-images.stackhpc.cloud) returns HTTP 403 to that UA. A curl/HEAD probe with a different UA passes -- which is why an earlier probe false-passed while the real import failed.$HOME (snap-readable, NOT /tmp -- L7; curl's UA is not blocked), verify the checksum against the published manifest (azimuth-images manifest.json -- sha512 for kube images; the ubuntu cloud-images SHA256SUMS for noble), then openstack image create --file --import (the openstack snap's --import == glance-direct; image-conversion lands it raw). CORRECTION-1: a plain --file PUT (no --import) stores qcow2 -- fine for boot, but --import gives the raw Ceph fast-clone alignment.openstack image delete <id> on the queued remnant (verify the EXACT id first -- FINDING-4 name-guard discipline).openstack image create --import --import-method web-download --uri <url>) is retained as a tested ALTERNATIVE, not the canonical path (superseded 2026-06-17; see design-decisions). Caveats: (1) it cannot checksum-verify the fetched file against a published digest (the CDN redirect strips it) -- weaker provenance; (2) it 403s on the azimuth CDN (FINDING-3), so it is unusable for kube images; (3) for ubuntu cloud-images it works on the hardened bundle (the 2026-06-08 403 was transient/pre-hardening). Use only as an expedient.================================================================================
================================================================================
Five entries from the 2026-06-10 recovery session. Full procedures with verified blocks: runbooks/ops-capi-recovery.md.
While capi-mgmt-v2 is stopped: Magnum reports UNHEALTHY with an EMPTY health_status_reason (distinct from the D-042 cosmetic, which has a populated reason); the Horizon Container Infra panel may 504 through the jumphost nginx proxy and coe CLI calls may stall; the workload cluster keeps serving (no runtime dependency on the mgmt cluster). If jumphost secrets were filed during parking, the convention is ~/sweep-YYYYMMDD/secrets/. See ops-capi-recovery Section 0 (expectations table) and Section 1 (parking block).
Causal chain (traced live 2026-06-10): host CPU/memory pressure -> amphora heartbeats go stale -> Octavia health-manager marks amphorae ERROR and launches auto-failovers -> failovers fail NoValidHost (no placement headroom) -> amphora servers accumulate with NO Octavia DB row. Two variants: an ERROR server (failed spawn) and an ACTIVE heartbeating zombie (health-manager logs "missing from the DB ... An operator must manually delete it" every 10 s). Remedy: verify-then-delete by SERVER UUID under admin scope -- the loadbalancer amphora list output is the DB truth; Nova name lookup is project-scoped (amphorae live in the Octavia services project). Procedure: ops-capi-recovery 5a. Do NOT retry failover against the same blocker; each attempt mints another zombie.
STANDALONE failover builds the replacement amphora BEFORE reaping the old one, so it transiently needs one extra amphora slot (charm-octavia: 1024 MB / 1 vCPU / 8 GB). Scheduler ceiling per host = physical_MB * ram_allocation_ratio (1.5)
</dev/null vs an expired macaroon (DOCFIX-021 interaction)DOCFIX-021's </dev/null on juju ssh assumes valid macaroon auth. When the jumphost macaroon goes stale, juju falls back to an interactive password prompt; </dev/null feeds that prompt EOF and the symptom is the misleading "cannot get discharge from https://:17070/auth: EOF". Triage: run juju status interactively -- if it succeeds after a password prompt, the controller is healthy and only the credential cache is stale. Workaround for the session: stdin from </dev/tty. Fix at a calm moment: juju logout then juju login.
CAPI/Magnum VMs are owned by the capi-mgmt project; an empty Project -> Compute -> Instances page under admin scope is correct, not a defect. Map: tenant VMs -> Instances in the OWNING project's scope (use the header project switcher; admin holds member on capi-mgmt per phase-06 6.0-BOOT); LB objects -> Project -> Network -> Load Balancers in the owning project's scope; amphora VMs -> Admin -> Compute -> Instances ONLY (they belong to the Octavia services project); everything at once -> CLI openstack server list --all-projects. Warning about the asymmetry: the Container Infra panel lists clusters cross-project under admin policy, which makes the strictly-scoped Nova panel look broken when it is not.