Troubleshooting - method, reflexes, and routing
The repo's runbooks/appendix-A-troubleshooting.md is the operational symptom->cause->fix index, keyed by D-NNN/DOCFIX-NNN (it now includes the mysql-innodb-cluster recovery signatures and a point-of-use identifier index). ALWAYS check it before diagnosing from first principles - most failures on this cloud have been seen before and have a validated fix. ops-capi-recovery.md covers the CAPI/Magnum stack; runbooks/ops-restart-procedure.md covers full-cloud recovery. This file gives you the method and the reflexes that must fire BEFORE and WHILE you read those.
Triage method
- Reproduce and capture verbatim. The exact error text routes you in appendix-A. Paraphrased symptoms match nothing.
- Eliminate cheap layers first. Web UI misbehaving: incognito/client state (~10s) before any server hypothesis. CLI misbehaving: confirm the token scope (project/domain) before blaming the service - wrong ambient scope produces misleading 404s/403s, especially with domain-per-client identity (an un-domain-qualified lookup resolves in the WRONG domain).
- Establish what is actually true, live.
bash scripts/cloud-assert.sh first - it runs every service-own-verdict gate in one read-only sweep and localizes the failing layer. Then the targeted read-only audit of that surface before touching anything. juju status, the service's own state (haproxy admin socket, mysql cluster-status, vault status, ovn-appctl), the relevant logs. Hard rule 2: measurements, not memory.
- Match against appendix-A, then design-decisions if it smells architectural. If matched: follow the recorded fix exactly (verbatim- reference rule for anything one-shot).
- If unmatched: smallest-possible hypothesis, read-only test to confirm or kill it, and log the finding. New root causes become appendix-A / DOCFIX material - capturing them is part of the fix.
- Least-destructive first, individually gated. Reload before restart, restart before rebuild, rebuild before redeploy. Never batch.
Reflexes (internalize these; they override intuition)
- juju-green is necessary, not sufficient.
active/idle is BLIND to: a DOWN haproxy backend (D-045 - hid a dead nova-api for 3 days), a missing magnum trustee domain (D-046 - magnum reports ready regardless), an unparsed-but-attached policy override, and per-backend/service state generally. Gate on the service's OWN verdict (admin sockets, functional probes), never on juju status alone.
- "Reports OK while broken" generalizes. Charm-side validation is often parse-level only (the keystone policy zip validates YAML, not semantics). Acceptance is BEHAVIORAL: prove the capability works, not that the config is present.
- Distinguish down from thrashing. A host that looks "down" to juju/OVN (ovsdb inactivity-probe storms) may be swap-thrashing, not down (D-040). Check
who -b / uptime / journalctl -k | grep -i oom before treating it as an outage.
- Read the reason field before the status field. Magnum health UNHEALTHY has three signatures (D-042 amendment): reason EMPTY = conductor cannot reach the mgmt API (VM down?); all-Ready except infrastructure "not found" = the cosmetic driver-contract miss; a reason citing LB failure = real, check Octavia. Same status, three different responses.
- Single-node mgmt VM does not self-heal (D-035/D-041). Workload nodes wedged with the
uninitialized taint, magnum reconcile dead, addons Pending -> check capi-mgmt-v2 is ACTIVE before anything else. Manual start is POLICY (down is a signal), not a defect.
- Known blast-radius traps:
juju destroy-model decomposes MAAS pod-composed machines - machine retention is juju remove-machine --keep-instance (D-061); mysql-innodb-cluster never bootstraps at num_units 1 - deploy at 3 (D-062); reboot-cluster-from-complete-outage is destructive against an already-healthy cluster - check cluster status FIRST; vault restarts sealed by design - sealed-after-reboot is expected, not a fault; the magnum domain-setup action must be re-run after every redeploy (D-046).
- Green-in-the-shell, broken-in-the-daemon. Interactive shells have a different PATH, env, and stdin than LSB-init daemons. Verify under the daemon's conditions (restricted PATH, live-process args) - see script-authoring.
- The absence of an expected resource is a finding, not a gap to fill. If something that should exist does not (a domain, a rule, an image property), find out WHY it is absent before recreating it - it may have been removed deliberately (check design-decisions) or its absence may be the actual root cause several layers up.
Escalation and blockers
- Missing vault unseal keys, lost one-shot secrets, or anything requiring material only the operator holds: STOP and escalate. Do not improvise around a recovery blocker (a vault re-init wipes the TLS plane).
- If a fix would deviate from a runbook or contradict a D-NNN: log it as a proposal and get a ruling. Mid-incident is exactly when discipline pays.
- Two failed fix attempts on the same hypothesis = the hypothesis is wrong. Step back to the audit stage; widen what you are measuring.