# Script authoring - house style and hardening Every script and paste block in this project follows these rules. They are not style preferences: each one encodes a failure that actually happened. Read this in full before writing ANY bash or python for this cloud. ## Zeroth decision: does it already exist? Before writing ANY operational script or check block, search the repo: `grep -rli scripts/ runbooks/` - the deploy/verify surface is heavily scripted and duplicating an existing script creates drift (e.g. the haproxy backend sweep already exists as `scripts/phase-03-core-verify.sh` 3.1b). Route to, extend, or fix the existing artifact; write new only when nothing covers the need, and say which search came up empty. ## First decision: what kind of block is this? The error-handling regime depends on execution mode. Choosing wrong is itself a bug: 1. **Executed script** (`bash script.sh`, own process): `set -uo pipefail` at minimum; `set -e` acceptable and usually right, with the capture caveats below. Exit codes are the interface. 2. **Interactive paste block** (operator pastes into their shell): NEVER a bare `exit` (it kills their shell) and no `set -e` (it can kill their shell's options or abort mid-paste). Wrap the whole block in a subshell `( { ...; } )` so a stray exit is contained; signal failure by printing `FAIL: ...` lines the operator can read. 3. **Verify/count-gate block** (greps that legitimately return zero matches): run WITHOUT `set -e`, and end every count-grep with `|| true` - a zero count is a valid answer, not an error (L1). `bash -n` cannot catch this; it is behavior, not syntax. ## The header contract (executed scripts) Every script opens with a comment block stating: path + argument synopsis; what it does (one paragraph, referencing the D-NNN/DOCFIX it implements); **whether it mutates anything** ("Mutates NOTHING" / what it changes and the gate protecting it); usage line; exit-code contract; and "ASCII + LF". House exit codes: `0` PROCEED, `1` HOLD (a gate failed), `2` precondition missing (tool absent, wrong model, helper not found). Then: set -euo pipefail # or set -uo pipefail; see regime above shopt -s inherit_errexit 2>/dev/null || true IFS=$'\n\t' Source shared constants instead of restating them: `. "$SCRIPT_DIR/lib-net.sh"` (planes, VIP bands, helpers) and `lib-hosts.sh` (hostnames, octets, system_id resolution). If a value you need is not in a lib, consider adding it there rather than inlining a literal. ## Hardening rules (each one is a scar) **SIGPIPE races break guards in BOTH directions.** `cmd | grep -q X` under pipefail: on match, grep exits, the producer takes SIGPIPE (141), the pipeline reports failure despite the match. In an `... || die` verify this FALSE-DIES on success; in an `... && die` guard it FAILS OPEN -- the 2026-07 sweep found a duplicate-CIDR guard that let collisions through exactly this way. Treat every `| grep -q` on a live pipe as a defect regardless of which way the test points. **Pipefail + SIGPIPE race.** `cmd | grep -q X` under pipefail falsely fails: `grep -q` closes the pipe on first match, SIGPIPE (141) kills the producer (`juju ssh` especially). Capture, then test: OUT=$(cmd 2>&1 || true) grep -q "pattern" <<<"$OUT" **`set -e` kills id-captures silently.** `ID=$(openstack ... || die ...)` - if the subshell exits non-zero before your handler fires, `set -e` aborts the assignment line with no message. Append `|| true` to each capture, then validate the captured value explicitly (see whole-output validation below). **Whole-output validation, never extract-then-check.** Do not pipe raw output through `awk`/`grep` to extract a field and then test the fragment - a partial failure yields a plausible-looking fragment. Capture the WHOLE output, validate its shape (e.g. `is_id(){ [[ "$1" =~ ^[0-9a-f]{32}$ ]]; }` for a keystone id), and only then use it. **Centralize `&1` once; every call site then stays clean and un-forgettable. Heredoc-payload ssh (`ssh ... bash -s <<'EOF'`) is the ONE exemption -- stdin IS the delivery there. **Inner stdin consumption.** Any `ssh`/`sudo`/`juju ssh` inside a heredoc, pipe, or loop eats the remaining stdin and truncates the block. Append `` - the `-m` flag goes BEFORE the target. `juju run ` output: use `--format json` for anything captured; confirm long actions via `juju show-operation `, not the streamed log (a wait-timeout does not mean the hook failed). **Snap confinement.** The openstack CLI snap cannot read `/tmp` - stage files under `$HOME`. Same for `juju attach-resource` payloads. **Client output ordering.** `openstack -f value -c X -c Y` returns columns in ALPHABETICAL order, not flag order. When order matters: `-f json | jq`, or single-column queries. After any jq returning null, run `jq 'keys'` - key casing is command-specific (Title-Case in lists, hyphenated in quota show). **Environment isolation.** Any block that switches identity runs in a subshell that first unsets all OS_* vars: `( for v in $(env | awk -F= '/^OS_/{print $1}'); do unset "$v"; done; export ... )` Thread `OS_CACERT` explicitly into isolated subshells - it gets stripped. Confirm scope before acting: which project/domain the token holds, not which you intended. **Stable keys, not drifting IDs.** Look up MAAS subnets by CIDR, machines by hostname, CAPI CRs by LISTING then operating on the exact returned name (the OpenStackCluster suffix is random per create - a wrong-name patch silently no-ops). system_ids are re-minted on re-enrollment. **Verify the launched cmdline, not the config text.** For OpenStack debs run via LSB-init-wrapped systemd: a flag "present in a file" proves nothing. Gate on behavior - the init script's `show-args`, `ps -ww -C -o args` on the live process, and probe under the daemon's RESTRICTED PATH (`env -i PATH=/usr/sbin:/usr/bin:/sbin:/bin sh -c 'command -v helm'`) - an interactive shell's PATH masks daemon-PATH failures. **sed is not a verifier.** A non-matching `sed -i` exits 0 having changed nothing. Assert the post-edit content; never trust sed's exit code as proof of the edit. **ASCII + LF, validated.** Non-ASCII: `grep -nP '[^\x00-\x7F]'` or a Python byte read. CR check: Python `data.count(b"\x0d")` - `grep $'\r'` false- positives on `$r...` tokens. Non-ASCII in conf.d has silently killed daemons. **Python helpers live in .py files** tested against fixtures - no inline python-in-bash beyond one-liners. Do not assume `jq` exists off-jumphost; gate on it (`need_jq`) or use python3. **Secrets in scripts:** whitelist-write to 0600 files under a 0700 dir (`umask 077` first), never echo, never argv, measure secret length from the file rather than asserting an expected length (this deployment's app-cred secrets are 86 chars, not the commonly assumed 43 - never hardcode either). ## Testing: nothing ships on `bash -n` `bash -n` validates parse, not behavior - and most of the failures above are behavioral. Every nontrivial script gets an offline regression harness at `tests//run-tests.sh` in the established pattern: - `fakebin/` contains fake `juju` / `openstack` / `maas` / `kubectl` / `ssh` executables that replay fixture output and log the calls they receive; real coreutils stay real. - The harness sets `PATH="$BIN:$PATH"`, a scratch `HOME`, and env-injected fixture paths, runs the target, and asserts BOTH the exit code and an output regex per case: `run `. - Cover: the happy path, each gate's failure branch (asserting HOLD not crash), each precondition (exit 2), and any DOCFIX behavior the script claims (e.g. "verify-first never creates a rule when one exists"). - Read an existing harness (e.g. `tests/phase-07-conductor-graft/`) before writing a new one; match its shape. For MUTATING scripts, the harness fakebin is STATEFUL: the fake `maas`/`juju` advances a phase file on each mutation call so post-mutation verification reads post-mutation reality (see tests/phase-00-teardown-d061 -- canary survival, decompose-detection, substrate-collision aborts). A `--no-prompt` flag on a destructive script exists FOR its harness, nothing else. **Harnesses are HERMETIC: fakes shadow, deletion un-shadows.** Never simulate an unavailable tool by DELETING its fake -- on a real host that un-shadows the real binary and the test falls through to live infrastructure (a jumphost gauntlet run reached real Charmhub this way). Simulate failure with a FAILING fake (prints a realistic error, exits nonzero); keep the fail-loud-on-unmatched arm in every fake so an uncovered call can never silently hit a real CLI. **A script migration commit MUST carry its harness.** The D-060 revert updated the scripts and left their harnesses testing the retired D-058 world -- red-at- HEAD tests that trained everyone to ignore red. If you change a script's behavior or vocabulary, the same commit updates fixtures and expectations. **Static contract:** `bash scripts/repo-lint.sh` must exit 0 before anything is delivered (L1 ASCII/LF, L2 stale tokens, L3 ghost refs, L4 deprecated refs, L5 numbering, L6 bare invocations). A guard script that must NAME stale tokens opts out per-file with a `repo-lint: allow-stale-tokens` marker in its header. Committed binaries (e.g. policies/overrides.zip) need a `*.zip binary` .gitattributes rule or the LF normalization corrupts them in transit. Introducing a tool dependency (jq, python module) means adding its presence gate in the same change. Read-only verify scripts DETECT and report; remediation stays a gated human step even when the fix is obvious - the script's job is evidence, the operator's job is the mutation decision.