CUMULATIVE: this covers BOTH work blocks (the DOCFIX-066..072 sweep patches and the process-improvement build). One ZIP, extract over repo root. Every item lists its revert. Executed under blanket approval; nothing here touched live infrastructure -- all changes are repo content. Validation state at packaging: merged-tree repo lint PASS (1 documented WARN), bundle gate PASS, four test harnesses 33/33 cases, all python compiles. Identifier numbering verified next-free at HEAD 690779a; RE-GREP at commit (per discipline).
================================================================================
================================================================================
FILE: runbooks/phase-00-teardown-maas-reset.md (REWRITE) WHY: old spine drove the DEPRECATED phase-00-teardown.sh ("releases to Ready" -- the premise that decomposed machines 3x). New spine: destroy path as the validated redeploy route (with the previously MISSING reenroll step), release path documented with the honest re-acquisition caveat, all invocations bash-prefixed. REVERT: git checkout HEAD~ -- runbooks/phase-00-teardown-maas-reset.md (reverting re-arms the decompose trap; recommend not).
FILE: runbooks/phase-01-bundle-deploy.md (CNF block + one prose line) WHY: IP.1 was the pre-R14 literal 10.12.4.233; current VIP is .57. Now derived from the bundle at generation time, echoed, regex-gated. VERIFY-LIVE still queued: read the DEPLOYED cert SAN before deciding whether any live action is ever needed (none recommended -- regenerates next redeploy). REVERT: restore the literal block from git history.
FILE: runbooks/phase-01-bundle-deploy.md ("Constants and env-literals") WHY: carried the pre-D-052 plane map (incl. retired lbaas), hardcoded MAAS subnet ids (violates PATTERN-1) and system_ids (violates DOCFIX-040). Now points at lib-net.sh / lib-hosts.sh; only verified-current literals remain. REVERT: git history (not recommended; values were wrong).
FILES: phase-00 runbook (folded into #1); apply-notes optional git update-index --chmod=+x scripts/*.sh. WHY: all 37 scripts are mode 100644 (Windows commit path); bare invocation fails on a fresh clone. bash-prefix is the durable invariant; repo-lint L6 now guards regressions. REVERT: n/a (wording only) -- to reject the rule, remove lint check L6.
FILES: scripts/provider-bundle-check.py (checks 4-7 added); scripts/review-bundle.py (git rm -- manual action, see apply notes); tests/provider-bundle-check/ (NEW: 8-case harness). WHY: review-bundle.py was pre-D-052; FAIL=71 pure noise against HEAD (alarm fatigue). Its still-valid checks (relation syntax/existence) are absorbed; added: D-062 unit count, VIP-octet uniqueness, DOCFIX-071 policy wiring + zip/source content drift guard. REVERT: git checkout the old provider-bundle-check.py; git restore review-bundle.py. (Harness T-cases will then fail -- delete tests/ dir too.)
FILES: bundle.yaml (keystone resources: stanza, +6 lines); policies/overrides.zip (NEW, binary; built from committed source); .gitattributes (+ *.zip binary); runbooks/phase-03-core-verify.md (NEW Step 3.4: PO: gate + behavioral G3 gate); runbooks/appendix-C-identity-rbac.md (attach block: subshell-wrapped, /tmp -> repo path [snap trap], reframed as live-update path). WHY: the D-064 attach was unreachable from the schedule -- a clean redeploy shipped WITHOUT the SCS Domain Manager RBAC; the only written procedure used the snap-blocked /tmp form. REVERT: remove the 6 bundle lines + the zip; restore appendix-C block from history. (Reverting re-opens the blocker; the phase-03 gate would then FAIL by design -- that is the guard working.)
FILE: docs/design-decisions.md (D-043 RESOLVED -> ADOPTED option (a)). WHY: bundle set resume-guests-state-on-host-boot=True while D-043 was still PROPOSED. Adopted: auto-resume is the tenant-VM norm; D-041 unchanged for control-plane. JUDGMENT CALL FLAGGED EARLIER: capi-mgmt-v2 WILL auto-resume; manual-start policy now governs deliberate stops only. If you want a real exclusion, that needs a mechanism -- say so and I will draft options. REVERT: append a reversal entry (append-only doc) + remove the bundle option.
FILE: docs/design-decisions.md -- D-002 (channel matrix reconciled: etcd/easyrsa dead, memcached latest-only, vault 1.16 per D-068), D-009 (rabbitmq min-cluster-size Roosevelt delta), D-011 (item 6 = manual unseal, matching phase-08), D-051 (bundle-native delivery), D-061 (validation scope: survival vs re-acquisition; destroy = validated spine). REVERT: each is a discrete appended section -- delete the section.
================================================================================
================================================================================
FILES: scripts/preflight.sh (NEW), scripts/channel_assert.py (NEW), tests/preflight/ (NEW, 7 cases), runbooks/phase-01-bundle-deploy.md (prerequisites GATE added -- pre-flight was previously invoked by NO runbook). WHY: gate surface was six artifacts remembered independently. Orchestrates repo-lint -> bundle gate -> channel assert -> live pre-flight, worst-exit aggregation, stage-2 reminders. channel_assert fulfils D-002's "verify against Charmhub each deploy" claim (previously unimplemented): every pinned channel must exist on Charmhub; typo/retired-track = FAIL; offline = WARN. REVERT: delete the three new paths + the phase-01 GATE paragraph.
FILES: scripts/repo_lint.py + scripts/repo-lint.sh (NEW), tests/repo-lint/ (NEW, 9 cases). Plus SANITIZE remediation of every ASCII-rule violation it found: docs/v1-pre-deploy-fixes.md, docs/netbox-vip-queue.md, .gitignore, netbox/README.md, netbox/ipv4-prefixes-import.py, netbox/ipv6-mark-reserved.py (punctuation transliteration only; python files re-compiled OK). Plus two catches it made against MY OWN morning patch and bundle HEAD, both fixed: a missed .233 prose line in phase-01; a stale storage/replication CIDR comment in bundle.yaml. CHECKS: L1 encoding, L2 stale tokens (with explicit per-file opt-out marker for guard scripts), L3 ghost script refs, L4 deprecated refs, L5 numbering collisions + next-free report, L6 bare invocations. REVERT: delete scripts + tests; restore pre-sanitize files from history.
FILES: scripts/cloud-assert.sh (NEW), tests/cloud-assert/ (NEW, 9 fakebin cases), runbooks/ops-restart-procedure.md (NEW -- the restart procedure, previously operator-local, adapted to current references with [REVALIDATE] markers and committed). WHY: the D-045/D-046/D-051/D-042 family = "juju green, service broken". One idempotent read-only sweep of every service-own-verdict gate (vault seal, mysql 1xR/W, OVN unity, chassis, hypervisors, LBs, PO:, trustee domain, coe 403, conductor LIVE args). Missing admin scope = HELD exit 2, never a silent pass. --capture writes a committed asbuilt// BOM (Roosevelt drift baseline). Replaces the never-committed post-maintenance-health-check.sh. REVERT: delete the three paths.
FILES: scripts/run-logged.sh (NEW), docs/as-executed-log-convention.md (NEW), logs/as-executed-index.md (NEW, committed index; content stays jumphost-only). WHY: the verbatim-retrieval rule for one-shot steps depended on an artifact with no defined location/format (DOCFIX-006 is the cautionary tale). REVERT: delete the three paths.
FILE: runbooks/appendix-A-troubleshooting.md (two appended sections): mysql-innodb-cluster recovery (D-062 signatures + the destructive-action guard) and the point-of-use identifier index (DOCFIX-027/028/029/034/037, BUNDLEFIX-001..006) closing the dangling-reference gap. REVERT: delete the two appended sections.
FILE: docs/security-ledger.md (NEW). Seeded: SEC-001 libvirt credential exposure (was living only in a script header), SEC-002 juju action-log token rule, SEC-003 vault custody, SEC-004 repo-public flag. REVERT: delete the file.
FILE: docs/design-decisions.md (appended). WHAT: split custody (no individual holds threshold), second-person unseal rehearsal as an acceptance item, custody re-cut review at every re-init. Custodian ASSIGNMENT deliberately left as operator input (SEC-003). REVERT: append a reversal entry.
FILE: docs/design-decisions.md (appended). WHAT: D-012 was never exercised (no virsh snapshot step exists anywhere); rebuild-from-runbooks (D-017/D-018) is THE restore path; baseline-capture role moves to cloud-assert --capture. Counter-argument recorded in the entry (mid-rehearsal rollback convenience) with the revisit condition. REVERT: append a reversal entry (this one is the most opinion-weighted item in the set -- if you disagree anywhere, it is probably here).
================================================================================
================================================================================ git rm scripts/review-bundle.py # DOCFIX-070 git rm scripts/phase-00-teardown.sh # deprecated; runbook no longer
# references it by path (repo-lint
# L3/L4 assume it gone)
python3 scripts/repo_lint.py . # expect: 0 fail, 1 legacy WARN python3 scripts/provider-bundle-check.py bundle.yaml # expect: PASS, 6 [ok] for t in tests/*/run-tests.sh; do bash "$t"; done # expect: 4x ALL PASS (33 cases)
Verify-live: octavia deployed-cert SAN read; juju info channel probe on the jumphost (channel_assert now automates it at preflight); second-person unseal rehearsal (D-069). Operator inputs: vault custodian assignment (SEC-003); ruling if a real capi-mgmt auto-resume exclusion is wanted (item 7).
================================================================================
================================================================================ Method: automated audit of every script and runbook paste block against the house hardening rules (SIGPIPE, capture-die, inner-ssh stdin, /tmp-snap, gawk, sed -i, column-order, juju-run capture, bare-exit-in-paste), then MANUAL verification of every hit before any change. Audit honesty notes: H3 (inner ssh) and H7 (paste-block exit) closed at ZERO true positives -- the rc/rcap/J wrapper convention and the subshell/remote-quote discipline held everywhere; H2/H4/H6/H8 clean.
FILES: scripts/phase-02-vault-preflight.sh, phase-03-core-verify.sh (model- presence gate: race read a PRESENT model as absent -> false HOLD); phase-06-bootstrap.sh (role verify-first: race re-ran grants); phase-06-net-setup.sh (SG-rule verify-first: race -> duplicate create -> 409 aborts the run under set -e); phase-07-conductor-graft.sh + its runbook twin (helm version check: converted to substring param-expansion, no pipe at all); tenant-onboard.sh x2 -- the serious pair: the manager-grant verify could FALSE-DIE on a successful grant, and the duplicate-CIDR guard FAILED OPEN (match -> SIGPIPE 141 -> '&& die' skipped -> colliding subnet proceeds). NEW TEST: tests/tenant-onboard/run-tests.sh proves the CIDR guard fires on a collision and stays quiet on a clear CIDR. REVERT: each is a commented, anchored block -- git history per file.
FILE: scripts/tenant-onboard.sh (2 sites) + a jq presence gate added. WHY: -f value -c ID -c Name | awk '{print $1}' rode alphabetical column ordering (ID<Name today). Converted to -f json | jq per the house rule. Display-only multi-column -f value uses elsewhere (phase-05/06 confirms, tenant-acceptance echo) were left as-is deliberately -- output is read by eyes, not parsed; converting them changes operator-facing format for no robustness gain. ADVISORY: do not parse those lines positionally later.
FILE: scripts/phase-04-internal-cert-san-verify.sh. WHY: a jq https-only pre-filter made the non-TLS SKIP branch unreachable -- an unexpectedly plain-HTTP internal endpoint was silently hidden, which is precisely a finding the operator must see. Feed now carries ALL internal endpoints; the existing SKIP line reports non-TLS visibly. This also un- breaks tests/phase-04-internal-cert-san (the harness encoded the correct behavior; the script had regressed under it). REVERT: restore the jq select() -- not recommended.
The most destructive scripts in the repo (teardown-release/-destroy) shipped with NO tests despite a --no-prompt flag documented "tested automation only". Stateful fakebin (maas machine state advances on remove/destroy; instant sleep): 8 cases incl. canary-stops-after-one, DECOMPOSE-detection fails loud and blocks destroy-model, substrate-sid collision aborts pre-mutation, --release-storage vs --destroy-storage flag assertions, orphan deletion. REVERT: delete the test dir (coverage loss only).
RETIRE (manual git actions -- harnesses for retired scripts test nothing): git rm -r tests/phase-00-teardown # script git rm'd this patchset;
# replaced by tests/phase-00-teardown-d061
git rm -r tests/provider-vip-standup # script already gone at HEAD git rm -r tests/phase-00-maas-recidr # script already gone at HEAD REPAIR SPECS (stale-but-live; harnesses still test the D-058 world their scripts left at D-060 -- red-at-HEAD today, which trains ignoring red):
Runbook paste blocks: every exit verified nested in a subshell or a remote- quoted command -- zero operator-shell-killing exits. Inner ssh/juju-ssh: all argv-style invocations route through </dev/null-appending wrappers or terminate with </dev/null; heredoc-payload ssh (bash -s) correctly exempt. No /tmp-snap, gawk-ism, unasserted sed -i, or capture-die instances anywhere.
================================================================================
================================================================================
One command runs every tests/*/run-tests.sh: summary line each, worst exit, optional substring filter, exit 2 on no match. Deliberately NOT folded into preflight.sh: preflight gates the DEPLOY TARGET, this gates the TOOLS -- different cadence, different failure meaning (run it after any script/harness change per the change-delivery loop). REVERT: delete the file.
make_fixtures.py + run-tests.sh regenerated from the CURRENT script vocabulary and lib-net plane table (was: D-058 provider-vip/VID-104/10.12.20 expectations, red at HEAD). Four scenarios: fresh full create plan (asserts the retired provider-vip can never reappear), all-SKIP done, wrong-plane DRIFT + refuse, wrong-VID drift. REVERT: git history (restores a red harness).
Fixture generator (new make_fixtures.py) + expectations rebuilt: fresh host full Pattern-A plan (uplink L3 unlink -> br-ex OVS + static; br-metal/.103/ br-internal chain; raw data/storage/replication statics; enp11s0 idle), missing-subnet and wrong-VID hard gates, all-SKIP idempotent re-run; forbids br-prov-api/enp1s0.104/provider-vip forever. REVERT: git history (red harness).
The D-011 runner's drafting list still targeted DROPPED features (Designate resolution -- D-019; snapshot existence -- D-012, superseded by D-070) and predated the manual-unseal ruling. List now points at the existing building blocks (cloud-assert / tenant-acceptance / run-tests-all) and the genuinely missing items, so the eventual implementation is built to the real spec.
The repo carried only the packaged .skill zip (opaque to grep/diff/repo-lint). The unpacked SKILL.md + references/ (v1.2) now live beside it as the reviewable source of truth; regenerate the .skill from this folder on any change (or drop the zip -- operator's call). Divergence rule applies: repo source wins over any installed copy.
Runbook->script contract cross-check: CLEAN (both automated hits verified false -- reenroll --check and carve --apply exist under parser styles the checker missed). Loop-efficiency scan: clean (flagged sites are watchers/polls by design). Exit-code audit: clean (tenant-acceptance's 11-14 are a declared per-phase contract). deploy-watch, maas-fabric-prune, juju-spaces-check, osd-blank-check: read; disciplined; no findings.