# Script authoring - house style and hardening

Every script and paste block in this project follows these rules. They are
not style preferences: each one encodes a failure that actually happened.
Read this in full before writing ANY bash or python for this cloud.

## Zeroth decision: does it already exist?

Before writing ANY operational script or check block, search the repo:
`grep -rli <topic> scripts/ runbooks/` - the deploy/verify surface is heavily
scripted and duplicating an existing script creates drift (e.g. the haproxy
backend sweep already exists as `scripts/phase-03-core-verify.sh` 3.1b).
Route to, extend, or fix the existing artifact; write new only when nothing
covers the need, and say which search came up empty.

## First decision: what kind of block is this?

The error-handling regime depends on execution mode. Choosing wrong is itself
a bug:

1. **Executed script** (`bash script.sh`, own process): `set -uo pipefail`
   at minimum; `set -e` acceptable and usually right, with the capture
   caveats below. Exit codes are the interface.
2. **Interactive paste block** (operator pastes into their shell): NEVER a
   bare `exit` (it kills their shell) and no `set -e` (it can kill their
   shell's options or abort mid-paste). Wrap the whole block in a subshell
   `( { ...; } )` so a stray exit is contained; signal failure by printing
   `FAIL: ...` lines the operator can read.
3. **Verify/count-gate block** (greps that legitimately return zero matches):
   run WITHOUT `set -e`, and end every count-grep with `|| true` - a zero
   count is a valid answer, not an error (L1). `bash -n` cannot catch this;
   it is behavior, not syntax.

## The header contract (executed scripts)

Every script opens with a comment block stating: path + argument synopsis;
what it does (one paragraph, referencing the D-NNN/DOCFIX it implements);
**whether it mutates anything** ("Mutates NOTHING" / what it changes and the
gate protecting it); usage line; exit-code contract; and "ASCII + LF".
House exit codes: `0` PROCEED, `1` HOLD (a gate failed), `2` precondition
missing (tool absent, wrong model, helper not found). Then:

    set -euo pipefail            # or set -uo pipefail; see regime above
    shopt -s inherit_errexit 2>/dev/null || true
    IFS=$'\n\t'

Source shared constants instead of restating them:
`. "$SCRIPT_DIR/lib-net.sh"` (planes, VIP bands, helpers) and `lib-hosts.sh`
(hostnames, octets, system_id resolution). If a value you need is not in a
lib, consider adding it there rather than inlining a literal.

## Hardening rules (each one is a scar)

**SIGPIPE races break guards in BOTH directions.** `cmd | grep -q X` under
pipefail: on match, grep exits, the producer takes SIGPIPE (141), the pipeline
reports failure despite the match. In an `... || die` verify this FALSE-DIES on
success; in an `... && die` guard it FAILS OPEN -- the 2026-07 sweep found a
duplicate-CIDR guard that let collisions through exactly this way. Treat every
`| grep -q` on a live pipe as a defect regardless of which way the test points.

**Pipefail + SIGPIPE race.** `cmd | grep -q X` under pipefail falsely fails:
`grep -q` closes the pipe on first match, SIGPIPE (141) kills the producer
(`juju ssh` especially). Capture, then test:

    OUT=$(cmd 2>&1 || true)
    grep -q "pattern" <<<"$OUT"

**`set -e` kills id-captures silently.** `ID=$(openstack ... || die ...)` -
if the subshell exits non-zero before your handler fires, `set -e` aborts the
assignment line with no message. Append `|| true` to each capture, then
validate the captured value explicitly (see whole-output validation below).

**Whole-output validation, never extract-then-check.** Do not pipe raw output
through `awk`/`grep` to extract a field and then test the fragment - a
partial failure yields a plausible-looking fragment. Capture the WHOLE output,
validate its shape (e.g. `is_id(){ [[ "$1" =~ ^[0-9a-f]{32}$ ]]; }` for a
keystone id), and only then use it.

**Centralize `</dev/null` in a wrapper, not per call.** The house pattern is a
2-line helper (`rc()`, `rcap()`, `J()`) that appends `</dev/null 2>&1` once;
every call site then stays clean and un-forgettable. Heredoc-payload ssh
(`ssh ... bash -s <<'EOF'`) is the ONE exemption -- stdin IS the delivery there.

**Inner stdin consumption.** Any `ssh`/`sudo`/`juju ssh` inside a heredoc,
pipe, or loop eats the remaining stdin and truncates the block. Append
`</dev/null` to EVERY inner invocation (`</dev/tty` only when it genuinely
must prompt).

**`read` and `... | bash` are mutually exclusive.** A script piped to bash
has the pipe as stdin, so `read` returns empty at EOF - this once silently
created a passwordless user. For paste-safety AND working prompts: base64-
decode to a file, then run the file (it inherits the terminal stdin).

**juju invocation shape.** `juju ssh -m <model>` - the `-m` flag goes BEFORE
the target. `juju run <unit> <action>` output: use `--format json` for
anything captured; confirm long actions via `juju show-operation <N>`, not
the streamed log (a wait-timeout does not mean the hook failed).

**Snap confinement.** The openstack CLI snap cannot read `/tmp` - stage files
under `$HOME`. Same for `juju attach-resource` payloads.

**Client output ordering.** `openstack -f value -c X -c Y` returns columns in
ALPHABETICAL order, not flag order. When order matters: `-f json | jq`, or
single-column queries. After any jq returning null, run `jq 'keys'` - key
casing is command-specific (Title-Case in lists, hyphenated in quota show).

**Environment isolation.** Any block that switches identity runs in a
subshell that first unsets all OS_* vars:
`( for v in $(env | awk -F= '/^OS_/{print $1}'); do unset "$v"; done; export ... )`
Thread `OS_CACERT` explicitly into isolated subshells - it gets stripped.
Confirm scope before acting: which project/domain the token holds, not which
you intended.

**Stable keys, not drifting IDs.** Look up MAAS subnets by CIDR, machines by
hostname, CAPI CRs by LISTING then operating on the exact returned name
(the OpenStackCluster suffix is random per create - a wrong-name patch
silently no-ops). system_ids are re-minted on re-enrollment.

**Verify the launched cmdline, not the config text.** For OpenStack debs run
via LSB-init-wrapped systemd: a flag "present in a file" proves nothing.
Gate on behavior - the init script's `show-args`, `ps -ww -C <daemon> -o args`
on the live process, and probe under the daemon's RESTRICTED PATH
(`env -i PATH=/usr/sbin:/usr/bin:/sbin:/bin sh -c 'command -v helm'`) - an
interactive shell's PATH masks daemon-PATH failures.

**sed is not a verifier.** A non-matching `sed -i` exits 0 having changed
nothing. Assert the post-edit content; never trust sed's exit code as proof
of the edit.

**ASCII + LF, validated.** Non-ASCII: `grep -nP '[^\x00-\x7F]'` or a Python
byte read. CR check: Python `data.count(b"\x0d")` - `grep $'\r'` false-
positives on `$r...` tokens. Non-ASCII in conf.d has silently killed daemons.

**Python helpers live in .py files** tested against fixtures - no inline
python-in-bash beyond one-liners. Do not assume `jq` exists off-jumphost;
gate on it (`need_jq`) or use python3.

**Secrets in scripts:** whitelist-write to 0600 files under a 0700 dir
(`umask 077` first), never echo, never argv, measure secret length from the
file rather than asserting an expected length (this deployment's app-cred
secrets are 86 chars, not the commonly assumed 43 - never hardcode either).

## Testing: nothing ships on `bash -n`

`bash -n` validates parse, not behavior - and most of the failures above are
behavioral. Every nontrivial script gets an offline regression harness at
`tests/<script-name>/run-tests.sh` in the established pattern:

- `fakebin/` contains fake `juju` / `openstack` / `maas` / `kubectl` /
  `ssh` executables that replay fixture output and log the calls they
  receive; real coreutils stay real.
- The harness sets `PATH="$BIN:$PATH"`, a scratch `HOME`, and env-injected
  fixture paths, runs the target, and asserts BOTH the exit code and an
  output regex per case: `run <want_rc> <regex> <label>`.
- Cover: the happy path, each gate's failure branch (asserting HOLD not
  crash), each precondition (exit 2), and any DOCFIX behavior the script
  claims (e.g. "verify-first never creates a rule when one exists").
- Read an existing harness (e.g. `tests/phase-07-conductor-graft/`) before
  writing a new one; match its shape.

For MUTATING scripts, the harness fakebin is STATEFUL: the fake `maas`/`juju`
advances a phase file on each mutation call so post-mutation verification reads
post-mutation reality (see tests/phase-00-teardown-d061 -- canary survival,
decompose-detection, substrate-collision aborts). A `--no-prompt` flag on a
destructive script exists FOR its harness, nothing else.

**Harnesses are HERMETIC: fakes shadow, deletion un-shadows.** Never simulate
an unavailable tool by DELETING its fake -- on a real host that un-shadows the
real binary and the test falls through to live infrastructure (a jumphost
gauntlet run reached real Charmhub this way). Simulate failure with a FAILING
fake (prints a realistic error, exits nonzero); keep the fail-loud-on-unmatched
arm in every fake so an uncovered call can never silently hit a real CLI.

**A script migration commit MUST carry its harness.** The D-060 revert updated
the scripts and left their harnesses testing the retired D-058 world -- red-at-
HEAD tests that trained everyone to ignore red. If you change a script's
behavior or vocabulary, the same commit updates fixtures and expectations.

**Static contract:** `bash scripts/repo-lint.sh` must exit 0 before anything is
delivered (L1 ASCII/LF, L2 stale tokens, L3 ghost refs, L4 deprecated refs,
L5 numbering, L6 bare invocations). A guard script that must NAME stale tokens
opts out per-file with a `repo-lint: allow-stale-tokens` marker in its header.
Committed binaries (e.g. policies/overrides.zip) need a `*.zip binary`
.gitattributes rule or the LF normalization corrupts them in transit.
Introducing a tool dependency (jq, python module) means adding its presence
gate in the same change.

Read-only verify scripts DETECT and report; remediation stays a gated human
step even when the fix is obvious - the script's job is evidence, the
operator's job is the mutation decision.
