Troubleshooting - method, reflexes, and routing

The repo's runbooks/appendix-A-troubleshooting.md is the operational symptom->cause->fix index, keyed by D-NNN/DOCFIX-NNN (it now includes the mysql-innodb-cluster recovery signatures and a point-of-use identifier index). ALWAYS check it before diagnosing from first principles - most failures on this cloud have been seen before and have a validated fix. ops-capi-recovery.md covers the CAPI/Magnum stack; runbooks/ops-restart-procedure.md covers full-cloud recovery. This file gives you the method and the reflexes that must fire BEFORE and WHILE you read those.

Triage method

Reproduce and capture verbatim. The exact error text routes you in appendix-A. Paraphrased symptoms match nothing.
Eliminate cheap layers first. Web UI misbehaving: incognito/client state (~10s) before any server hypothesis. CLI misbehaving: confirm the token scope (project/domain) before blaming the service - wrong ambient scope produces misleading 404s/403s, especially with domain-per-client identity (an un-domain-qualified lookup resolves in the WRONG domain).
Establish what is actually true, live. bash scripts/cloud-assert.sh first - it runs every service-own-verdict gate in one read-only sweep and localizes the failing layer. Then the targeted read-only audit of that surface before touching anything. juju status, the service's own state (haproxy admin socket, mysql cluster-status, vault status, ovn-appctl), the relevant logs. Hard rule 2: measurements, not memory.
Match against appendix-A, then design-decisions if it smells architectural. If matched: follow the recorded fix exactly (verbatim- reference rule for anything one-shot).
If unmatched: smallest-possible hypothesis, read-only test to confirm or kill it, and log the finding. New root causes become appendix-A / DOCFIX material - capturing them is part of the fix.
Least-destructive first, individually gated. Reload before restart, restart before rebuild, rebuild before redeploy. Never batch.

Reflexes (internalize these; they override intuition)

juju-green is necessary, not sufficient. active/idle is BLIND to: a DOWN haproxy backend (D-045 - hid a dead nova-api for 3 days), a missing magnum trustee domain (D-046 - magnum reports ready regardless), an unparsed-but-attached policy override, and per-backend/service state generally. Gate on the service's OWN verdict (admin sockets, functional probes), never on juju status alone.
"Reports OK while broken" generalizes. Charm-side validation is often parse-level only (the keystone policy zip validates YAML, not semantics). Acceptance is BEHAVIORAL: prove the capability works, not that the config is present.
Distinguish down from thrashing. A host that looks "down" to juju/OVN (ovsdb inactivity-probe storms) may be swap-thrashing, not down (D-040). Check who -b / uptime / journalctl -k | grep -i oom before treating it as an outage.
Read the reason field before the status field. Magnum health UNHEALTHY has three signatures (D-042 amendment): reason EMPTY = conductor cannot reach the mgmt API (VM down?); all-Ready except infrastructure "not found" = the cosmetic driver-contract miss; a reason citing LB failure = real, check Octavia. Same status, three different responses.
Single-node mgmt VM does not self-heal (D-035/D-041). Workload nodes wedged with the uninitialized taint, magnum reconcile dead, addons Pending -> check capi-mgmt-v2 is ACTIVE before anything else. Manual start is POLICY (down is a signal), not a defect.
Known blast-radius traps: juju destroy-model decomposes MAAS pod-composed machines - machine retention is juju remove-machine --keep-instance (D-061); mysql-innodb-cluster never bootstraps at num_units 1 - deploy at 3 (D-062); reboot-cluster-from-complete-outage is destructive against an already-healthy cluster - check cluster status FIRST; vault restarts sealed by design - sealed-after-reboot is expected, not a fault; the magnum domain-setup action must be re-run after every redeploy (D-046).
Green-in-the-shell, broken-in-the-daemon. Interactive shells have a different PATH, env, and stdin than LSB-init daemons. Verify under the daemon's conditions (restricted PATH, live-process args) - see script-authoring.
The absence of an expected resource is a finding, not a gap to fill. If something that should exist does not (a domain, a rule, an image property), find out WHY it is absent before recreating it - it may have been removed deliberately (check design-decisions) or its absence may be the actual root cause several layers up.

Escalation and blockers

Missing vault unseal keys, lost one-shot secrets, or anything requiring material only the operator holds: STOP and escalate. Do not improvise around a recovery blocker (a vault re-init wipes the TLS plane).
If a fix would deviate from a runbook or contradict a D-NNN: log it as a proposal and get a ruling. Mid-incident is exactly when discipline pays.
Two failed fix attempts on the same hypothesis = the hypothesis is wrong. Step back to the audit stage; widen what you are measuring.