# Operating discipline

How work is executed on this cloud. These conventions exist because the
runbooks and scripts ARE the product: a fix that works but is undocumented,
unrepeatable, or Roosevelt-incompatible is a failure.

## The gated execution model

Every command block is labeled so a command line is never mistaken for prose,
and so mutation risk is explicit (same convention as the repo runbooks):

- `RUN -- <loc>` - the block CHANGES state; run it at <loc> (jumphost, a unit
  via `juju ssh`, the mgmt VM...).
- `CHECK (read-only) -- <loc>` - safe to re-run.
- `GATE:` - hard stop. Do not proceed unless the stated condition holds.
- `Expect:` - what a passing result looks like. Always state it: the operator
  should never have to guess whether output is good.
- `> CAUTION:` - destructive, secret-handling, or irreversible.

Sequence discipline: read-only audit -> present the mutation -> operator
approves/runs -> verify the result -> next step. One gated mutation at a time;
never batch destructive steps. If output comes back unexpected, STOP and
re-derive from the live state - do not improvise a correction inline.

**Chat (no shell):** you prepare blocks, the operator runs and pastes output.
Treat pasted output as the only evidence; a block you wrote but saw no output
for did not happen.

**Live shell:** you may execute `CHECK (read-only)` blocks yourself. `RUN` and
`> CAUTION:` blocks still get presented and human-approved first - state what
will change and why it is the minimal correct action.

## Irreversible / one-shot / secret steps

- Start every consequential session with `bash scripts/run-logged.sh <label>`
  (opens a script(1)-logged shell to ~/as-executed/; index the session in
  logs/as-executed-index.md -- content NEVER commits, the index always does).
- Retrieve the exact prior working command from the as-executed log or runbook
  VERBATIM. Never improvise vault init, unseal, authorize-charm, CA issuance,
  or anything one-shot. (DOCFIX-006: a mis-redirected `vault operator init`
  loses the unseal keys forever.)
- Secrets never transit argv, clipboard, scrollback, or /tmp. Capture straight
  to a 0600 file under $HOME with `umask 077`; unseal via hidden prompt (L4);
  transfer via base64 pipe into a root-written 0600 file (L-P6-4). Never echo
  a secret to verify it - verify by length/format from the file.
- Never run `maas list` (prints the API key - DOCFIX-016). Never trust a juju
  action's human-formatted output for a captured secret or cert - pull from
  `--format json` (indented YAML block-scalars corrupt PEMs; DOCFIX-021).
- Authorize charms with short-lived child tokens: `juju run` persists action
  params in the operation log (DOCFIX-011).

## Record-keeping: D-NNN / DOCFIX-NNN / BUNDLEFIX-NNN

- `D-NNN` - design decisions (`docs/design-decisions.md`). Append-only:
  superseded entries stay in place, marked, with the superseding entry
  appended. `DOCFIX-NNN` - runbook fixes. `BUNDLEFIX-NNN` - bundle fixes.
- ALWAYS grep the repo for the next free number before assigning one -
  collisions have happened. State "next-free verified" when you assign.
- PARALLEL WORK numbering: when more than one workstream edits the shared numbered
  ledgers (this repo + a parallel Claude Code stream), do NOT each independently grab
  "next-free" -- that collides. One stream owns a given number/block; each DECLARES every
  identifier it consumes in its commit; the other re-greps AFTER the parallel push lands
  and resumes above it. `scripts/ledger-scan.sh` derives next-free from decision HEADERS
  (not prose mentions -- a "next-free D-072" pointer must not inflate the count).
- Mid-task findings are logged as proposals, not acted on (hard rule 1).
  A finding that changes a runbook becomes a DOCFIX; one that changes
  architecture becomes a D-NNN with status PROPOSED until the operator rules.
- Status vocabulary: PROPOSED/OPEN -> ADOPTED/DECIDED -> SUPERSEDED (by
  D-MMM). Do not implement PROPOSED items.
- Before acting on any change request, grep design-decisions for the
  governing decision on that surface (and its dependents - e.g. a runbook
  step may rely on the current state). Finding a PROPOSED/OPEN decision
  means presenting its recorded options for a ruling, not picking one.

## Delivery rules (files handed to the operator)

- Multi-file changes ship as repo-relative ZIPs, never loose files (the
  Windows/GitHub Desktop loose-file workflow misplaces them).
- Everything committed is ASCII-only, LF-only. Validate before delivery:
  non-ASCII with `grep -nP '[^\x00-\x7F]'` (or a Python byte read); CR bytes
  with a Python `data.count(b"\x0d")` - a `grep $'\r'` false-positives on
  `$r...` tokens. Non-ASCII in OpenStack config has caused silent daemon
  failures (mod_wsgi UnicodeDecodeError).
- On-box script delivery over `juju ssh` goes as a base64 pipe decoded to a
  file, then the FILE is executed - never a raw heredoc (paste-mangling and
  stdin-consumption both bite; see script-authoring on `read` vs pipes).
- Windows-side steps are PowerShell-native (no bash heredocs, no backslash
  continuations). The operator commits from Windows; the jumphost only pulls.

- Credential exposures and security obligations get a ROW in
  docs/security-ledger.md at discovery time (owner + status) - never only a
  script-comment note; the ledger is reviewed at every phase-00 and handoff.
- Under operator-granted blanket approval, the delivery contract is: implement,
  then hand over ONE cumulative repo-relative ZIP plus a changelog where every
  item states what changed, the evidence for why, and a per-item revert. The
  changelog is the review surface; opinion-weighted calls are flagged as such.

## Session continuity: the in-flight ledger

Long sessions on this cloud get COMPACTED, often several times. Work that lives only in
chat scrollback is lost at compaction. `docs/session-ledger.md` is the durable record of
what is IN FLIGHT (active build, backlog, operator-gated items, open decisions, numbering
state, state facts). Standing practice:
- Read the ledger AND run `scripts/ledger-scan.sh` at session start; reconcile.
- The scan is the DRIFT CHECK: it derives open decisions (last Status line per block, so an
  amendment closes it), OPEN security rows, and next-free numbers straight from the repo.
  The narrative ledger must not claim CLOSED what the scan shows OPEN, nor omit what it
  surfaces. A ledger without this check rots (cf. D-070's unexercised-snapshot critique).
- Update the ledger at every deliverable/commit -- it is a standing deliverable like the
  changelog. Changelog = what changed; ledger = what is still open.

## Boundary mutations: adds-before-removes with a readback gate

When tightening a security boundary (SG rules, firewall, policy) that a live flow depends on,
NEVER remove the permissive rule before the specific replacements are proven present. Create
the narrow rules, READ THEM BACK (fail if any is missing), and only THEN remove the wide rule
by its MEASURED id. There must be no window in which a required flow (e.g. the magnum
conductor -> mgmt apiserver poll) is unauthorized. Verify the tightened boundary by FUNCTIONAL
outcome afterwards, not by assuming the rule text is sufficient.

## Attestation gates for unautomatable safety items

Some acceptance items cannot be scripted -- e.g. a SECOND person performing the Vault unseal
(D-069). Never auto-pass such an item. Gate it on a ledger ROW (docs/security-ledger.md):
report PASS_PENDING_MANUAL while the row is OPEN, PASS only when it is attested CLOSED/rehearsed,
HOLD if the row is missing. Auto-passing an undone safety item is the exact failure D-070 retired.

## Debate and correction norms

- Challenge weak reasoning with sources; the operator prefers industry best
  practice over the quick fix and will push back hard on unverified claims.
  When challenged, verify (docs, source, live probe) rather than defend.
- Own mistakes plainly, in one or two sentences, then fix. No self-flagellation.
- When a prior decision looks wrong, propose superseding it through the D-NNN
  process - do not quietly deviate from it.

## Troubleshooting entry discipline

Before any server-side hypothesis for "X used to work, now it doesn't" on a
web UI: eliminate client state first (incognito window, ~10 seconds), then a
server-side curl, THEN hypothesize. Before any command that acts on tenant
resources: confirm which project/domain scope the shell holds
(`openstack token issue`-level certainty) - `server create` and friends use
ambient scope silently. Full triage method: references/troubleshooting.md.
