Newer
Older
openstack-caracal-ipv4 / .claude / skills / openstack-cloud-ops / references / operating-discipline.md

Operating discipline

How work is executed on this cloud. These conventions exist because the runbooks and scripts ARE the product: a fix that works but is undocumented, unrepeatable, or Roosevelt-incompatible is a failure.

The gated execution model

Every command block is labeled so a command line is never mistaken for prose, and so mutation risk is explicit (same convention as the repo runbooks):

  • RUN -- <loc> - the block CHANGES state; run it at (jumphost, a unit via juju ssh, the mgmt VM...).
  • CHECK (read-only) -- <loc> - safe to re-run.
  • GATE: - hard stop. Do not proceed unless the stated condition holds.
  • Expect: - what a passing result looks like. Always state it: the operator should never have to guess whether output is good.
  • > CAUTION: - destructive, secret-handling, or irreversible.

Sequence discipline: read-only audit -> present the mutation -> operator approves/runs -> verify the result -> next step. One gated mutation at a time; never batch destructive steps. If output comes back unexpected, STOP and re-derive from the live state - do not improvise a correction inline.

Chat (no shell): you prepare blocks, the operator runs and pastes output. Treat pasted output as the only evidence; a block you wrote but saw no output for did not happen.

Live shell: you may execute CHECK (read-only) blocks yourself. RUN and > CAUTION: blocks still get presented and human-approved first - state what will change and why it is the minimal correct action.

Irreversible / one-shot / secret steps

  • Start every consequential session with bash scripts/run-logged.sh <label> (opens a script(1)-logged shell to ~/as-executed/; index the session in logs/as-executed-index.md -- content NEVER commits, the index always does).
  • Retrieve the exact prior working command from the as-executed log or runbook VERBATIM. Never improvise vault init, unseal, authorize-charm, CA issuance, or anything one-shot. (DOCFIX-006: a mis-redirected vault operator init loses the unseal keys forever.)
  • Secrets never transit argv, clipboard, scrollback, or /tmp. Capture straight to a 0600 file under $HOME with umask 077; unseal via hidden prompt (L4); transfer via base64 pipe into a root-written 0600 file (L-P6-4). Never echo a secret to verify it - verify by length/format from the file.
  • Never run maas list (prints the API key - DOCFIX-016). Never trust a juju action's human-formatted output for a captured secret or cert - pull from --format json (indented YAML block-scalars corrupt PEMs; DOCFIX-021).
  • Authorize charms with short-lived child tokens: juju run persists action params in the operation log (DOCFIX-011).

Record-keeping: D-NNN / DOCFIX-NNN / BUNDLEFIX-NNN

  • D-NNN - design decisions (docs/design-decisions.md). Append-only: superseded entries stay in place, marked, with the superseding entry appended. DOCFIX-NNN - runbook fixes. BUNDLEFIX-NNN - bundle fixes.
  • ALWAYS grep the repo for the next free number before assigning one - collisions have happened. State "next-free verified" when you assign.
  • PARALLEL WORK numbering: when more than one workstream edits the shared numbered ledgers (this repo + a parallel Claude Code stream), do NOT each independently grab "next-free" -- that collides. One stream owns a given number/block; each DECLARES every identifier it consumes in its commit; the other re-greps AFTER the parallel push lands and resumes above it. scripts/ledger-scan.sh derives next-free from decision HEADERS (not prose mentions -- a "next-free D-072" pointer must not inflate the count). Keep "Next-free:" pointer lines on ONE line (a word-wrapped continuation escapes the line-based exclusion); or just rely on the scan, which is the next-free authority.
  • Mid-task findings are logged as proposals, not acted on (hard rule 1). A finding that changes a runbook becomes a DOCFIX; one that changes architecture becomes a D-NNN with status PROPOSED until the operator rules.
  • Status vocabulary: PROPOSED/OPEN -> ADOPTED/DECIDED -> SUPERSEDED (by D-MMM). Do not implement PROPOSED items.
  • Before acting on any change request, grep design-decisions for the governing decision on that surface (and its dependents - e.g. a runbook step may rely on the current state). Finding a PROPOSED/OPEN decision means presenting its recorded options for a ruling, not picking one.

Delivery rules (files handed to the operator)

  • Multi-file changes ship as repo-relative ZIPs, never loose files (the Windows/GitHub Desktop loose-file workflow misplaces them).
  • Everything committed is ASCII-only, LF-only. Validate before delivery: non-ASCII with grep -nP '[^\x00-\x7F]' (or a Python byte read); CR bytes with a Python data.count(b"\x0d") - a grep $'\r' false-positives on $r... tokens. Non-ASCII in OpenStack config has caused silent daemon failures (mod_wsgi UnicodeDecodeError).
  • On-box script delivery over juju ssh goes as a base64 pipe decoded to a file, then the FILE is executed - never a raw heredoc (paste-mangling and stdin-consumption both bite; see script-authoring on read vs pipes).
  • Windows-side steps are PowerShell-native (no bash heredocs, no backslash continuations). The operator commits from Windows; the jumphost only pulls.

  • Credential exposures and security obligations get a ROW in docs/security-ledger.md at discovery time (owner + status) - never only a script-comment note; the ledger is reviewed at every phase-00 and handoff.

  • Under operator-granted blanket approval, the delivery contract is: implement, then hand over ONE cumulative repo-relative ZIP plus a changelog where every item states what changed, the evidence for why, and a per-item revert. The changelog is the review surface; opinion-weighted calls are flagged as such.

Session continuity: the in-flight ledger

Long sessions on this cloud get COMPACTED, often several times. Work that lives only in chat scrollback is lost at compaction. docs/session-ledger.md is the durable record of what is IN FLIGHT (active build, backlog, operator-gated items, open decisions, numbering state, state facts). Standing practice:

  • Read the ledger AND run scripts/ledger-scan.sh at session start; reconcile.
  • The scan is the DRIFT CHECK: it derives open decisions (last Status line per block, so an amendment closes it), OPEN security rows, and next-free numbers straight from the repo. The narrative ledger must not claim CLOSED what the scan shows OPEN, nor omit what it surfaces. A ledger without this check rots (cf. D-070's unexercised-snapshot critique).
  • Update the ledger at every deliverable/commit -- it is a standing deliverable like the changelog. Changelog = what changed; ledger = what is still open.

Boundary mutations: adds-before-removes with a readback gate

When tightening a security boundary (SG rules, firewall, policy) that a live flow depends on, NEVER remove the permissive rule before the specific replacements are proven present. Create the narrow rules, READ THEM BACK (fail if any is missing), and only THEN remove the wide rule by its MEASURED id. There must be no window in which a required flow (e.g. the magnum conductor -> mgmt apiserver poll) is unauthorized. Verify the tightened boundary by FUNCTIONAL outcome afterwards, not by assuming the rule text is sufficient.

Attestation gates for unautomatable safety items

Some acceptance items cannot be scripted -- e.g. a SECOND person performing the Vault unseal (D-069). Never auto-pass such an item. Gate it on a ledger ROW (docs/security-ledger.md): report PASS_PENDING_MANUAL while the row is OPEN, PASS only when it is attested CLOSED/rehearsed, HOLD if the row is missing. Auto-passing an undone safety item is the exact failure D-070 retired.

Debate and correction norms

  • Challenge weak reasoning with sources; the operator prefers industry best practice over the quick fix and will push back hard on unverified claims. When challenged, verify (docs, source, live probe) rather than defend.
  • Own mistakes plainly, in one or two sentences, then fix. No self-flagellation.
  • When a prior decision looks wrong, propose superseding it through the D-NNN process - do not quietly deviate from it.

Troubleshooting entry discipline

Before any server-side hypothesis for "X used to work, now it doesn't" on a web UI: eliminate client state first (incognito window, ~10 seconds), then a server-side curl, THEN hypothesize. Before any command that acts on tenant resources: confirm which project/domain scope the shell holds (openstack token issue-level certainty) - server create and friends use ambient scope silently. Full triage method: references/troubleshooting.md.