Keyed by the same D-NNN / DOCFIX-NNN / L-P6-N identifiers used inline in the phase runbooks. This is an OPERATIONAL index (symptom -> cause -> fix), NOT the decision log: full rationale lives in design-decisions.md and the per-decision files (D-0NN-*.md); the driver fix has its own magnum-capi-helm-driver-fix-runbook. Each entry notes the phase(s) that reference it. ASCII-only.
================================================================================
================================================================================
juju ssh/ssh ... bash -s or remote sudo block dies early or behaves as if truncated; later commands in the heredoc never run.ssh/sudo/juju ssh (or any stdin reader) consumes the rest of the heredoc/pipe that was feeding the outer command.</dev/null to every inner ssh/sudo/juju ssh invocation (use </dev/tty instead only when the call genuinely needs an interactive prompt).( { ...; } ) so a stray exit cannot kill the interactive shell.juju run vault/leader get-root-ca wraps the PEM in an INDENTED YAML output: |- block; sed-by-marker preserves the indent and an indented -----BEGIN CERTIFICATE----- is not valid PEM -> openssl "Unable to load certificate" -> keystone NO_CERTIFICATE_OR_CRL_FOUND. Fix: pull from the action JSON (real newlines, no indent): juju run vault/leader get-root-ca -m openstack --format json | jq -r '[.. | strings | select(test("BEGIN CERTIFICATE"))][0]'. (Same class as DOCFIX-006: never trust action human output for a captured secret/cert.)/tmp, or letting a PTY mangle it in transit.umask 077, then chown to the service user and chmod 0600 -- never touch /tmp. (Pattern in phase-07 7.2.)================================================================================
================================================================================
k8s bootstrap "succeeds" but the node never reaches Ready; network and DNS are silently disabled; CoreDNS/Cilium absent.--file whose top level lacks a cluster-config: block leaves ALL features (network, dns, ...) at disabled defaults. Setting only pod-cidr / service-cidr / extra-sans does NOT enable them.cluster-config:
network: { enabled: true }
dns: { enabled: true }(See phase-06 6.4 for the full config.) Retry: snap remove k8s --purge then re-bootstrap.================================================================================
================================================================================
--set installCRDs=true.installCRDs was removed from the cert-manager chart (~v1.18). The current flag is crds.enabled=true.helm install cert-manager jetstack/cert-manager ... --set crds.enabled=true.clusterctl init, capo-controller-manager CrashLoopBackOff (observed ~6 restarts / ~15 min) before self-healing.openstackserver controller hard-depends on ORC's Image.openstack.k-orc.cloud CRD at startup. clusterctl init installs CAPO; if ORC is not yet present, CAPO crash-loops until it appears.Image CRD) BEFORE clusterctl init. Hardened order: cert-manager -> ORC -> clusterctl init -> CAAPH -> janitor.capi-helm-charts tag's dependencies.json (read live with jq); do not hardcode semver. (Full rationale: design-decisions D-034; driver-coherence amendment: D-042.)================================================================================
================================================================================
cilium_host -- silent, asymmetric breakage. (The "do-07 pattern.")================================================================================
================================================================================
[capi_helm] conf.d drop-in is ignored; the conductor behaves as if it was never written, even though a systemd drop-in "looks" applied.systemd-start, NOT a direct ExecStart=. A systemd drop-in appending --config-dir passes it as a positional arg to the init script, which ignores it -- the flag never reaches the daemon. The args are assembled inside the init script from DAEMON_ARGS (base --config-file first), extensible only via /etc/default/<service>./etc/default/magnum-conductor (0644; the charm does not manage it):
DAEMON_ARGS="$DAEMON_ARGS --config-dir /etc/magnum/magnum.conf.d"Verify with the init script's own
show-args (dry-run) AND ps -ww -C magnum-conductor -o args on the live process -- behavioral, not string-presence./etc/default/magnum-conductor, the append is lost and [capi_helm] silently stops being read. Re-check via show-args/ps.ExecStart shape for OpenStack debs, and never treat "string present in the unit file" as "the daemon received the flag." Gate on the assembled/launched cmdline (show-args, then ps on the live process).kube_version from the Glance image properties and routes on os_distro; it does NOT take k8s version from a template label.ubuntu-jammy-kube-v1.32.13) MUST carry kube_version (e.g. v1.32.13) and os_distro=ubuntu. Verify before create (phase-08 8.0).================================================================================
================================================================================
coe cluster show reports health_status = UNHEALTHY deterministically (survives a conductor restart); only the infrastructure sub-check fails ("Infrastructure resource not found"); cluster + control-plane + nodegroup are Ready.apiVersion off spec.infrastructureRef to build its health GET, but the CAPI v1.13 (v1beta2 contract) ref carries apiGroup+kind+name with NO apiVersion. COSMETIC -- the create path is unaffected (the chart templates the resource versions); only the driver's direct health query breaks.magnum-capi-helm==1.4.0 (the "generalize-api-resources" feature). 1.4.0 builds each health GET from an explicit api_version via its [capi_helm] api_resources option, which DEFAULTS to v1beta1 for every CAPI kind -- and CAPI v1.13.2 / CAPO v0.14.4 still serve v1beta1, so the default works (no override needed; phase-07 7.3-7.6). Set a per-kind override only if a kind is v1beta2-only. Rule (amends D-034): the Layer-B driver pin must be contract-coherent with the Layer-A CAPI core.health_status (a persistent false UNHEALTHY could misfire); CAPI MachineHealthCheck heals independently.================================================================================
================================================================================
load-balancer_member (a pre-D-039 frozen app-cred cannot query Octavia to confirm LB state).load-balancer_member (+ member, reader). Verify before acceptance (phase-08 prereqs).DELETE_IN_PROGRESS; helm release already gone; Cluster and OpenStackCluster CRs stuck Deleting (often on an Octavia 403, see D-039).OpenStackCluster finalizer on the mgmt cluster -- kubectl -n <magnum-ns> patch openstackcluster <cluster>-<suffix> --type=merge -p '{"metadata":{"finalizers":[]}}'. The Cluster finalizer was only waiting on it, so the Cluster auto-finalizes and deletes. Then manually clean orphaned neutron resources in dependency order: router remove subnet -> router unset external-gateway -> router delete -> subnet delete -> network delete -> security group delete.operating_status ONLINE but provisioning_status ERROR after a host outage/OOM.openstack loadbalancer failover <lb-id> in ADMIN-project scope (amphora / failover ops 403 under tenant member scope). Watch ERROR -> PENDING_UPDATE -> ACTIVE (~100s); a single STANDALONE amphora gives a brief blip; operating_status holds ONLINE.node.cluster.x-k8s.io/uninitialized.network_driver label was IGNORED and the capi-helm openstack-cluster chart's default CNI (Calico) always ran. Under the RELEASED 1.4.0 driver the network_driver template option IS honored (it maps through to the chart). To keep the as-built CNI (Calico), the capi-k8s-v1-32 template OMITS --network-driver (phase-08); set flannel there only to intentionally switch the CNI. (Mgmt cluster CNI is separately Cilium, via k8s-snap.)================================================================================
================================================================================
State=down (heavy swap thrash stalls OVS/OVN heartbeats and the machine agent).reserved-host-memory default 512 MB does not cover the co-located LXD/Ceph/MySQL services on these hyperconverged hosts -> nova over-commits real RAM.reserved-host-memory = 8192 on all compute units (baked into the hardened bundle). Diagnose a suspected OOM-vs-reboot with who -b / uptime (no recent boot) and journalctl -k | grep -i oom; the ovsdb "no response to inactivity probe ... disconnecting" storm is the swap-thrash signature.capi-mgmt-v2) is SHUTOFF; FIP unreachable; magnum cannot reach the mgmt API; workload addons go Pending (see uninitialized-taint).openstack server start capi-mgmt-v2 (API serves ~40s later; a brief TLS handshake timeout on the first kubectl is expected). Follow-up: HA mgmt cluster for Roosevelt.juju ssh (or other juju calls) fail mid-session with a discharge/EOF error.juju login, then retry.================================================================================
================================================================================
maas list (API-key leak) (phase-00, phase-01, phase-04)maas list prints the stored API key to stdout (and into any transcript/log).admin); call maas admin ... directly. Never run maas list in a runbook or paste block.maas whoami; hardcode the eyeballed system_ids (phase-00)maas <profile> whoami + owner filters is fragile and, in this lab, unnecessary.maas list/whoami and released 5 VMs incl. the retired D-033 capi-mgmt; the current rebuild releases 4.)/var/lib/libvirt/images/<host>-1.qcow2) are root:root / 600; qemu-img info|create, virsh domstate, stat, rm against them all need sudo.vip: address equals a MAAS-auto-assigned machine/container primary (observed: cinder public VIP .226 == magnum container 1/lxd/3 primary).================================================================================
================================================================================
vip: is a dual provider+metal pair "10.12.4.5x 10.12.8.5x" (D-020). Pre-deploy guard: total provider VIPs=11, all in .50-.60, zero in the stale .10-.20 (phase-01 1.1). Any per-cloud consumer of a VIP (the Horizon reverse proxy, monitoring) must be repointed.10.12.8.10 appears in a node's resolver list (sometimes as Current DNS Server) despite the subnet dns_servers override.set -e on count-gate blocks; guard greps || true (phase-01)grep -c returning 0 is a VALID answer, not a failure. Under set -e a zero-count grep aborts the block. Pre-deploy verify blocks run WITHOUT set -e, and every count grep ends || true. (bash -n would not catch this -- it is behavior.)vip:. The metal side (second token, 10.12.8.5x) must be eyeballed to confirm all 11 sit in .8.50-.60, clear of metal infra (.8.10 maas / .8.20 lxd / .8.21 capi / .8.30 juju).================================================================================
================================================================================
vault operator init ... > file captures stdout only; if the key block went to stderr (or the run is interrupted) you are left with an unusable/empty file and the 5 shares + root token are GONE -- init runs exactly once and cannot be replayed.vault operator init -key-shares=5 -key-threshold=3 2>&1 | tee ~/vault-init/init.txt VERBATIM; gate on grep -c '^Unseal Key' == 5 and Initial Root Token present; then save the file OFF-HOST before anything else. Never improvise this command.token (phase-02)authorize-charm action takes token (a direct token string); there is no token-secret-id variant in this charm rev. Confirm via juju actions vault --schema. Authorize with a SHORT-LIVED CHILD token (juju run persists action params in the op log).juju run vault/leader generate-root-ca -- it mints the charm-pki-local root and clears the block straight to active. (Omitting it leaves vault hung.)vault operator unseal (no argument) so it prompts hidden; the key is never on the command line / in a var / in ps / in scrollback. Do NOT use vault operator unseal $KEY (visible in ps on the unit). Unseal is re-runnable, so the verbatim-reference rule is looser here, but the security gain is real.================================================================================
================================================================================
OS_AUTH_URL=https://10.12.4.50:5000/v3, with the vault root CA in OS_CACERT (B5 IP-SAN certs validate). No /etc/hosts, no FQDN.admin, living in domain admin_domain; an older doc's OS_PROJECT_NAME=admin_domain 401s). Credential good, scope wrong.================================================================================
================================================================================
openstack image create --file /tmp/... -> "[Errno 2] No such file or directory" even though sha256sum just read the same path./tmp; it CAN read $HOME (home interface).$HOME (e.g. $HOME/amphora-base/...), never /tmp.configure-resources is long-running: juju's default action wait may time out ("timed out waiting for results") while the hook KEEPS RUNNING -- do NOT treat the wait-timeout as failure or re-fire blindly. Use a bound --wait and confirm completion via juju show-operation <N> (authoritative), not the streamed log.network:distributed port shows DOWN (logical OVN port, never chassis-bound).octavia amp-image-tag; it MUST equal the tag the retrofit stamps (octavia-diskimage-retrofit amp-image-tag), both octavia-amphora. A mismatch means octavia cannot find the image even though it is built and ACTIVE. The amphora pipeline gate asserts the two are equal before building (phase-05 5.2).================================================================================
================================================================================
Five entries from the 2026-06-10 recovery session. Full procedures with verified blocks: runbooks/ops-capi-recovery.md.
While capi-mgmt-v2 is stopped: Magnum reports UNHEALTHY with an EMPTY health_status_reason (distinct from the D-042 cosmetic, which has a populated reason); the Horizon Container Infra panel may 504 through the jumphost nginx proxy and coe CLI calls may stall; the workload cluster keeps serving (no runtime dependency on the mgmt cluster). If jumphost secrets were filed during parking, the convention is ~/sweep-YYYYMMDD/secrets/. See ops-capi-recovery Section 0 (expectations table) and Section 1 (parking block).
Causal chain (traced live 2026-06-10): host CPU/memory pressure -> amphora heartbeats go stale -> Octavia health-manager marks amphorae ERROR and launches auto-failovers -> failovers fail NoValidHost (no placement headroom) -> amphora servers accumulate with NO Octavia DB row. Two variants: an ERROR server (failed spawn) and an ACTIVE heartbeating zombie (health-manager logs "missing from the DB ... An operator must manually delete it" every 10 s). Remedy: verify-then-delete by SERVER UUID under admin scope -- the loadbalancer amphora list output is the DB truth; Nova name lookup is project-scoped (amphorae live in the Octavia services project). Procedure: ops-capi-recovery 5a. Do NOT retry failover against the same blocker; each attempt mints another zombie.
STANDALONE failover builds the replacement amphora BEFORE reaping the old one, so it transiently needs one extra amphora slot (charm-octavia: 1024 MB / 1 vCPU / 8 GB). Scheduler ceiling per host = physical_MB * ram_allocation_ratio (1.5)
</dev/null vs an expired macaroon (DOCFIX-021 interaction)DOCFIX-021's </dev/null on juju ssh assumes valid macaroon auth. When the jumphost macaroon goes stale, juju falls back to an interactive password prompt; </dev/null feeds that prompt EOF and the symptom is the misleading "cannot get discharge from https://:17070/auth: EOF". Triage: run juju status interactively -- if it succeeds after a password prompt, the controller is healthy and only the credential cache is stale. Workaround for the session: stdin from </dev/tty. Fix at a calm moment: juju logout then juju login.
CAPI/Magnum VMs are owned by the capi-mgmt project; an empty Project -> Compute -> Instances page under admin scope is correct, not a defect. Map: tenant VMs -> Instances in the OWNING project's scope (use the header project switcher; admin holds member on capi-mgmt per phase-06 6.0-BOOT); LB objects -> Project -> Network -> Load Balancers in the owning project's scope; amphora VMs -> Admin -> Compute -> Instances ONLY (they belong to the Octavia services project); everything at once -> CLI openstack server list --all-projects. Warning about the asymmetry: the Container Infra panel lists clusters cross-project under admin policy, which makes the strictly-scoped Nova panel look broken when it is not.