# Troubleshooting - method, reflexes, and routing

The repo's `runbooks/appendix-A-troubleshooting.md` is the operational
symptom->cause->fix index, keyed by D-NNN/DOCFIX-NNN (it now includes the
mysql-innodb-cluster recovery signatures and a point-of-use identifier index).
ALWAYS check it before diagnosing from first principles - most failures on
this cloud have been seen before and have a validated fix.
`ops-capi-recovery.md` covers the CAPI/Magnum stack;
`runbooks/ops-restart-procedure.md` covers full-cloud recovery.
This file gives you the method and the reflexes that must fire BEFORE and
WHILE you read those.

## Triage method

1. **Reproduce and capture verbatim.** The exact error text routes you in
   appendix-A. Paraphrased symptoms match nothing.
2. **Eliminate cheap layers first.** Web UI misbehaving: incognito/client
   state (~10s) before any server hypothesis. CLI misbehaving: confirm the
   token scope (project/domain) before blaming the service - wrong ambient
   scope produces misleading 404s/403s, especially with domain-per-client
   identity (an un-domain-qualified lookup resolves in the WRONG domain).
3. **Establish what is actually true, live.** `bash scripts/cloud-assert.sh`
   first - it runs every service-own-verdict gate in one read-only sweep and
   localizes the failing layer. Then the targeted read-only audit of that
   surface before touching anything. juju status, the service's own state
   (haproxy admin socket, mysql cluster-status, vault status, ovn-appctl),
   the relevant logs. Hard rule 2: measurements, not memory.
4. **Match against appendix-A**, then design-decisions if it smells
   architectural. If matched: follow the recorded fix exactly (verbatim-
   reference rule for anything one-shot).
5. **If unmatched:** smallest-possible hypothesis, read-only test to confirm
   or kill it, and log the finding. New root causes become appendix-A /
   DOCFIX material - capturing them is part of the fix.
6. **Least-destructive first, individually gated.** Reload before restart,
   restart before rebuild, rebuild before redeploy. Never batch.

## Reflexes (internalize these; they override intuition)

- **juju-green is necessary, not sufficient.** `active/idle` is BLIND to:
  a DOWN haproxy backend (D-045 - hid a dead nova-api for 3 days), a missing
  magnum trustee domain (D-046 - magnum reports ready regardless), an
  unparsed-but-attached policy override, and per-backend/service state
  generally. Gate on the service's OWN verdict (admin sockets, functional
  probes), never on juju status alone.
- **"Reports OK while broken" generalizes.** Charm-side validation is often
  parse-level only (the keystone policy zip validates YAML, not semantics).
  Acceptance is BEHAVIORAL: prove the capability works, not that the config
  is present.
- **Distinguish down from thrashing.** A host that looks "down" to juju/OVN
  (ovsdb inactivity-probe storms) may be swap-thrashing, not down (D-040).
  Check `who -b` / `uptime` / `journalctl -k | grep -i oom` before treating
  it as an outage.
- **Read the reason field before the status field.** Magnum health UNHEALTHY
  has three signatures (D-042 amendment): reason EMPTY = conductor cannot
  reach the mgmt API (VM down?); all-Ready except infrastructure "not found"
  = the cosmetic driver-contract miss; a reason citing LB failure = real,
  check Octavia. Same status, three different responses.
- **Single-node mgmt VM does not self-heal (D-035/D-041).** Workload nodes
  wedged with the `uninitialized` taint, magnum reconcile dead, addons
  Pending -> check `capi-mgmt-v2` is ACTIVE before anything else. Manual
  start is POLICY (down is a signal), not a defect.
- **Known blast-radius traps:** `juju destroy-model` decomposes MAAS
  pod-composed machines - machine retention is `juju remove-machine
  --keep-instance` (D-061); mysql-innodb-cluster never bootstraps at
  num_units 1 - deploy at 3 (D-062); `reboot-cluster-from-complete-outage`
  is destructive against an already-healthy cluster - check cluster status
  FIRST; vault restarts sealed by design - sealed-after-reboot is expected,
  not a fault; the magnum `domain-setup` action must be re-run after every
  redeploy (D-046).
- **Green-in-the-shell, broken-in-the-daemon.** Interactive shells have a
  different PATH, env, and stdin than LSB-init daemons. Verify under the
  daemon's conditions (restricted PATH, live-process args) - see
  script-authoring.
- **The absence of an expected resource is a finding, not a gap to fill.**
  If something that should exist does not (a domain, a rule, an image
  property), find out WHY it is absent before recreating it - it may have
  been removed deliberately (check design-decisions) or its absence may be
  the actual root cause several layers up.

## Escalation and blockers

- Missing vault unseal keys, lost one-shot secrets, or anything requiring
  material only the operator holds: STOP and escalate. Do not improvise
  around a recovery blocker (a vault re-init wipes the TLS plane).
- If a fix would deviate from a runbook or contradict a D-NNN: log it as a
  proposal and get a ruling. Mid-incident is exactly when discipline pays.
- Two failed fix attempts on the same hypothesis = the hypothesis is wrong.
  Step back to the audit stage; widen what you are measuring.
