Newer
Older
openstack-caracal-ipv4 / .claude / skills / openstack-cloud-ops / references / script-authoring.md

Script authoring - house style and hardening

Every script and paste block in this project follows these rules. They are not style preferences: each one encodes a failure that actually happened. Read this in full before writing ANY bash or python for this cloud.

Zeroth decision: does it already exist?

Before writing ANY operational script or check block, search the repo: grep -rli <topic> scripts/ runbooks/ - the deploy/verify surface is heavily scripted and duplicating an existing script creates drift (e.g. the haproxy backend sweep already exists as scripts/phase-03-core-verify.sh 3.1b). Route to, extend, or fix the existing artifact; write new only when nothing covers the need, and say which search came up empty.

First decision: what kind of block is this?

The error-handling regime depends on execution mode. Choosing wrong is itself a bug:

  1. Executed script (bash script.sh, own process): set -uo pipefail at minimum; set -e acceptable and usually right, with the capture caveats below. Exit codes are the interface.
  2. Interactive paste block (operator pastes into their shell): NEVER a bare exit (it kills their shell) and no set -e (it can kill their shell's options or abort mid-paste). Wrap the whole block in a subshell ( { ...; } ) so a stray exit is contained; signal failure by printing FAIL: ... lines the operator can read.
  3. Verify/count-gate block (greps that legitimately return zero matches): run WITHOUT set -e, and end every count-grep with || true - a zero count is a valid answer, not an error (L1). bash -n cannot catch this; it is behavior, not syntax.

The header contract (executed scripts)

Every script opens with a comment block stating: path + argument synopsis; what it does (one paragraph, referencing the D-NNN/DOCFIX it implements); whether it mutates anything ("Mutates NOTHING" / what it changes and the gate protecting it); usage line; exit-code contract; and "ASCII + LF". House exit codes: 0 PROCEED, 1 HOLD (a gate failed), 2 precondition missing (tool absent, wrong model, helper not found). Then:

set -euo pipefail            # or set -uo pipefail; see regime above
shopt -s inherit_errexit 2>/dev/null || true
IFS=$'\n\t'

Source shared constants instead of restating them: . "$SCRIPT_DIR/lib-net.sh" (planes, VIP bands, helpers) and lib-hosts.sh (hostnames, octets, system_id resolution). If a value you need is not in a lib, consider adding it there rather than inlining a literal.

Hardening rules (each one is a scar)

SIGPIPE races break guards in BOTH directions. cmd | grep -q X under pipefail: on match, grep exits, the producer takes SIGPIPE (141), the pipeline reports failure despite the match. In an ... || die verify this FALSE-DIES on success; in an ... && die guard it FAILS OPEN -- the 2026-07 sweep found a duplicate-CIDR guard that let collisions through exactly this way. Treat every | grep -q on a live pipe as a defect regardless of which way the test points.

Pipefail + SIGPIPE race. cmd | grep -q X under pipefail falsely fails: grep -q closes the pipe on first match, SIGPIPE (141) kills the producer (juju ssh especially). Capture, then test:

OUT=$(cmd 2>&1 || true)
grep -q "pattern" <<<"$OUT"

set -e kills id-captures silently. ID=$(openstack ... || die ...) - if the subshell exits non-zero before your handler fires, set -e aborts the assignment line with no message. Append || true to each capture, then validate the captured value explicitly (see whole-output validation below).

Whole-output validation, never extract-then-check. Do not pipe raw output through awk/grep to extract a field and then test the fragment - a partial failure yields a plausible-looking fragment. Capture the WHOLE output, validate its shape (e.g. is_id(){ [[ "$1" =~ ^[0-9a-f]{32}$ ]]; } for a keystone id), and only then use it.

Centralize </dev/null in a wrapper, not per call. The house pattern is a 2-line helper (rc(), rcap(), J()) that appends </dev/null 2>&1 once; every call site then stays clean and un-forgettable. Heredoc-payload ssh (ssh ... bash -s <<'EOF') is the ONE exemption -- stdin IS the delivery there.

Inner stdin consumption. Any ssh/sudo/juju ssh inside a heredoc, pipe, or loop eats the remaining stdin and truncates the block. Append </dev/null to EVERY inner invocation (</dev/tty only when it genuinely must prompt).

read and ... | bash are mutually exclusive. A script piped to bash has the pipe as stdin, so read returns empty at EOF - this once silently created a passwordless user. For paste-safety AND working prompts: base64- decode to a file, then run the file (it inherits the terminal stdin).

juju invocation shape. juju ssh -m <model> - the -m flag goes BEFORE the target. juju run <unit> <action> output: use --format json for anything captured; confirm long actions via juju show-operation <N>, not the streamed log (a wait-timeout does not mean the hook failed).

Snap confinement. The openstack CLI snap cannot read /tmp - stage files under $HOME. Same for juju attach-resource payloads.

Client output ordering. openstack -f value -c X -c Y returns columns in ALPHABETICAL order, not flag order. When order matters: -f json | jq, or single-column queries. After any jq returning null, run jq 'keys' - key casing is command-specific (Title-Case in lists, hyphenated in quota show).

Environment isolation. Any block that switches identity runs in a subshell that first unsets all OS* vars: `( for v in $(env | awk -F= '/^OS/{print $1}'); do unset "$v"; done; export ... )ThreadOS_CACERT` explicitly into isolated subshells - it gets stripped. Confirm scope before acting: which project/domain the token holds, not which you intended.

Stable keys, not drifting IDs. Look up MAAS subnets by CIDR, machines by hostname, CAPI CRs by LISTING then operating on the exact returned name (the OpenStackCluster suffix is random per create - a wrong-name patch silently no-ops). system_ids are re-minted on re-enrollment.

Verify the launched cmdline, not the config text. For OpenStack debs run via LSB-init-wrapped systemd: a flag "present in a file" proves nothing. Gate on behavior - the init script's show-args, ps -ww -C <daemon> -o args on the live process, and probe under the daemon's RESTRICTED PATH (env -i PATH=/usr/sbin:/usr/bin:/sbin:/bin sh -c 'command -v helm') - an interactive shell's PATH masks daemon-PATH failures.

sed is not a verifier. A non-matching sed -i exits 0 having changed nothing. Assert the post-edit content; never trust sed's exit code as proof of the edit.

ASCII + LF, validated. Non-ASCII: grep -nP '[^\x00-\x7F]' or a Python byte read. CR check: Python data.count(b"\x0d") - grep $'\r' false- positives on $r... tokens. Non-ASCII in conf.d has silently killed daemons.

Python helpers live in .py files tested against fixtures - no inline python-in-bash beyond one-liners. Do not assume jq exists off-jumphost; gate on it (need_jq) or use python3.

Secrets in scripts: whitelist-write to 0600 files under a 0700 dir (umask 077 first), never echo, never argv, measure secret length from the file rather than asserting an expected length (this deployment's app-cred secrets are 86 chars, not the commonly assumed 43 - never hardcode either).

Testing: nothing ships on bash -n

bash -n validates parse, not behavior - and most of the failures above are behavioral. Every nontrivial script gets an offline regression harness at tests/<script-name>/run-tests.sh in the established pattern:

  • fakebin/ contains fake juju / openstack / maas / kubectl / ssh executables that replay fixture output and log the calls they receive; real coreutils stay real.
  • The harness sets PATH="$BIN:$PATH", a scratch HOME, and env-injected fixture paths, runs the target, and asserts BOTH the exit code and an output regex per case: run <want_rc> <regex> <label>.
  • Cover: the happy path, each gate's failure branch (asserting HOLD not crash), each precondition (exit 2), and any DOCFIX behavior the script claims (e.g. "verify-first never creates a rule when one exists").
  • Read an existing harness (e.g. tests/phase-07-conductor-graft/) before writing a new one; match its shape.

For MUTATING scripts, the harness fakebin is STATEFUL: the fake maas/juju advances a phase file on each mutation call so post-mutation verification reads post-mutation reality (see tests/phase-00-teardown-d061 -- canary survival, decompose-detection, substrate-collision aborts). A --no-prompt flag on a destructive script exists FOR its harness, nothing else.

Harnesses are HERMETIC: fakes shadow, deletion un-shadows. Never simulate an unavailable tool by DELETING its fake -- on a real host that un-shadows the real binary and the test falls through to live infrastructure (a jumphost gauntlet run reached real Charmhub this way). Simulate failure with a FAILING fake (prints a realistic error, exits nonzero); keep the fail-loud-on-unmatched arm in every fake so an uncovered call can never silently hit a real CLI.

A script migration commit MUST carry its harness. The D-060 revert updated the scripts and left their harnesses testing the retired D-058 world -- red-at- HEAD tests that trained everyone to ignore red. If you change a script's behavior or vocabulary, the same commit updates fixtures and expectations.

Static contract: bash scripts/repo-lint.sh must exit 0 before anything is delivered (L1 ASCII/LF, L2 stale tokens, L3 ghost refs, L4 deprecated refs, L5 numbering, L6 bare invocations). A guard script that must NAME stale tokens opts out per-file with a repo-lint: allow-stale-tokens marker in its header. Committed binaries (e.g. policies/overrides.zip) need a *.zip binary .gitattributes rule or the LF normalization corrupts them in transit. Introducing a tool dependency (jq, python module) means adding its presence gate in the same change.

Read-only verify scripts DETECT and report; remediation stays a gated human step even when the fix is obvious - the script's job is evidence, the operator's job is the mutation decision.