Newer
Older
openstack-caracal-ipv4 / skills / openstack-cloud-ops / references / operating-discipline.md

Operating discipline

How work is executed on this cloud. These conventions exist because the runbooks and scripts ARE the product: a fix that works but is undocumented, unrepeatable, or Roosevelt-incompatible is a failure.

The gated execution model

Every command block is labeled so a command line is never mistaken for prose, and so mutation risk is explicit (same convention as the repo runbooks):

  • RUN -- <loc> - the block CHANGES state; run it at (jumphost, a unit via juju ssh, the mgmt VM...).
  • CHECK (read-only) -- <loc> - safe to re-run.
  • GATE: - hard stop. Do not proceed unless the stated condition holds.
  • Expect: - what a passing result looks like. Always state it: the operator should never have to guess whether output is good.
  • > CAUTION: - destructive, secret-handling, or irreversible.

Sequence discipline: read-only audit -> present the mutation -> operator approves/runs -> verify the result -> next step. One gated mutation at a time; never batch destructive steps. If output comes back unexpected, STOP and re-derive from the live state - do not improvise a correction inline.

Chat (no shell): you prepare blocks, the operator runs and pastes output. Treat pasted output as the only evidence; a block you wrote but saw no output for did not happen.

Live shell: you may execute CHECK (read-only) blocks yourself. RUN and > CAUTION: blocks still get presented and human-approved first - state what will change and why it is the minimal correct action.

Irreversible / one-shot / secret steps

  • Start every consequential session with bash scripts/run-logged.sh <label> (opens a script(1)-logged shell to ~/as-executed/; index the session in logs/as-executed-index.md -- content NEVER commits, the index always does).
  • Retrieve the exact prior working command from the as-executed log or runbook VERBATIM. Never improvise vault init, unseal, authorize-charm, CA issuance, or anything one-shot. (DOCFIX-006: a mis-redirected vault operator init loses the unseal keys forever.)
  • Secrets never transit argv, clipboard, scrollback, or /tmp. Capture straight to a 0600 file under $HOME with umask 077; unseal via hidden prompt (L4); transfer via base64 pipe into a root-written 0600 file (L-P6-4). Never echo a secret to verify it - verify by length/format from the file.
  • Never run maas list (prints the API key - DOCFIX-016). Never trust a juju action's human-formatted output for a captured secret or cert - pull from --format json (indented YAML block-scalars corrupt PEMs; DOCFIX-021).
  • Authorize charms with short-lived child tokens: juju run persists action params in the operation log (DOCFIX-011).

Record-keeping: D-NNN / DOCFIX-NNN / BUNDLEFIX-NNN

  • D-NNN - design decisions (docs/design-decisions.md). Append-only: superseded entries stay in place, marked, with the superseding entry appended. DOCFIX-NNN - runbook fixes. BUNDLEFIX-NNN - bundle fixes.
  • ALWAYS grep the repo for the next free number before assigning one - collisions have happened. State "next-free verified" when you assign.
  • Mid-task findings are logged as proposals, not acted on (hard rule 1). A finding that changes a runbook becomes a DOCFIX; one that changes architecture becomes a D-NNN with status PROPOSED until the operator rules.
  • Status vocabulary: PROPOSED/OPEN -> ADOPTED/DECIDED -> SUPERSEDED (by D-MMM). Do not implement PROPOSED items.
  • Before acting on any change request, grep design-decisions for the governing decision on that surface (and its dependents - e.g. a runbook step may rely on the current state). Finding a PROPOSED/OPEN decision means presenting its recorded options for a ruling, not picking one.

Delivery rules (files handed to the operator)

  • Multi-file changes ship as repo-relative ZIPs, never loose files (the Windows/GitHub Desktop loose-file workflow misplaces them).
  • Everything committed is ASCII-only, LF-only. Validate before delivery: non-ASCII with grep -nP '[^\x00-\x7F]' (or a Python byte read); CR bytes with a Python data.count(b"\x0d") - a grep $'\r' false-positives on $r... tokens. Non-ASCII in OpenStack config has caused silent daemon failures (mod_wsgi UnicodeDecodeError).
  • On-box script delivery over juju ssh goes as a base64 pipe decoded to a file, then the FILE is executed - never a raw heredoc (paste-mangling and stdin-consumption both bite; see script-authoring on read vs pipes).
  • Windows-side steps are PowerShell-native (no bash heredocs, no backslash continuations). The operator commits from Windows; the jumphost only pulls.

  • Credential exposures and security obligations get a ROW in docs/security-ledger.md at discovery time (owner + status) - never only a script-comment note; the ledger is reviewed at every phase-00 and handoff.

  • Under operator-granted blanket approval, the delivery contract is: implement, then hand over ONE cumulative repo-relative ZIP plus a changelog where every item states what changed, the evidence for why, and a per-item revert. The changelog is the review surface; opinion-weighted calls are flagged as such.

Debate and correction norms

  • Challenge weak reasoning with sources; the operator prefers industry best practice over the quick fix and will push back hard on unverified claims. When challenged, verify (docs, source, live probe) rather than defend.
  • Own mistakes plainly, in one or two sentences, then fix. No self-flagellation.
  • When a prior decision looks wrong, propose superseding it through the D-NNN process - do not quietly deviate from it.

Troubleshooting entry discipline

Before any server-side hypothesis for "X used to work, now it doesn't" on a web UI: eliminate client state first (incognito window, ~10 seconds), then a server-side curl, THEN hypothesize. Before any command that acts on tenant resources: confirm which project/domain scope the shell holds (openstack token issue-level certainty) - server create and friends use ambient scope silently. Full triage method: references/troubleshooting.md.