Newer
Older
openstack-caracal-ipv4 / docs / changelog-20260703-process-hardening.md
@JANeumatrix JANeumatrix 8 hours ago 16 KB Patches

Change log -- 2026-07-02/03 redeploy-readiness + process-hardening patchset

CUMULATIVE: this covers BOTH work blocks (the DOCFIX-066..072 sweep patches and the process-improvement build). One ZIP, extract over repo root. Every item lists its revert. Executed under blanket approval; nothing here touched live infrastructure -- all changes are repo content. Validation state at packaging: merged-tree repo lint PASS (1 documented WARN), bundle gate PASS, four test harnesses 33/33 cases, all python compiles. Identifier numbering verified next-free at HEAD 690779a; RE-GREP at commit (per discipline).

================================================================================

Block 1 -- redeploy-readiness fixes (DOCFIX-066..072)

================================================================================

1. DOCFIX-066 -- teardown runbook rebuilt around the D-061 fork

FILE: runbooks/phase-00-teardown-maas-reset.md (REWRITE) WHY: old spine drove the DEPRECATED phase-00-teardown.sh ("releases to Ready" -- the premise that decomposed machines 3x). New spine: destroy path as the validated redeploy route (with the previously MISSING reenroll step), release path documented with the honest re-acquisition caveat, all invocations bash-prefixed. REVERT: git checkout HEAD~ -- runbooks/phase-00-teardown-maas-reset.md (reverting re-arms the decompose trap; recommend not).

2. DOCFIX-067 -- octavia PKI SAN derived, not baked

FILE: runbooks/phase-01-bundle-deploy.md (CNF block + one prose line) WHY: IP.1 was the pre-R14 literal 10.12.4.233; current VIP is .57. Now derived from the bundle at generation time, echoed, regex-gated. VERIFY-LIVE still queued: read the DEPLOYED cert SAN before deciding whether any live action is ever needed (none recommended -- regenerates next redeploy). REVERT: restore the literal block from git history.

3. DOCFIX-068 -- phase-01 constants de-staled

FILE: runbooks/phase-01-bundle-deploy.md ("Constants and env-literals") WHY: carried the pre-D-052 plane map (incl. retired lbaas), hardcoded MAAS subnet ids (violates PATTERN-1) and system_ids (violates DOCFIX-040). Now points at lib-net.sh / lib-hosts.sh; only verified-current literals remain. REVERT: git history (not recommended; values were wrong).

4. DOCFIX-069 -- exec-bit reality encoded

FILES: phase-00 runbook (folded into #1); apply-notes optional git update-index --chmod=+x scripts/*.sh. WHY: all 37 scripts are mode 100644 (Windows commit path); bare invocation fails on a fresh clone. bash-prefix is the durable invariant; repo-lint L6 now guards regressions. REVERT: n/a (wording only) -- to reject the rule, remove lint check L6.

5. DOCFIX-070 -- one bundle gate instead of two disagreeing ones

FILES: scripts/provider-bundle-check.py (checks 4-7 added); scripts/review-bundle.py (git rm -- manual action, see apply notes); tests/provider-bundle-check/ (NEW: 8-case harness). WHY: review-bundle.py was pre-D-052; FAIL=71 pure noise against HEAD (alarm fatigue). Its still-valid checks (relation syntax/existence) are absorbed; added: D-062 unit count, VIP-octet uniqueness, DOCFIX-071 policy wiring + zip/source content drift guard. REVERT: git checkout the old provider-bundle-check.py; git restore review-bundle.py. (Harness T-cases will then fail -- delete tests/ dir too.)

6. DOCFIX-071 -- keystone policy ships IN the bundle

FILES: bundle.yaml (keystone resources: stanza, +6 lines); policies/overrides.zip (NEW, binary; built from committed source); .gitattributes (+ *.zip binary); runbooks/phase-03-core-verify.md (NEW Step 3.4: PO: gate + behavioral G3 gate); runbooks/appendix-C-identity-rbac.md (attach block: subshell-wrapped, /tmp -> repo path [snap trap], reframed as live-update path). WHY: the D-064 attach was unreachable from the schedule -- a clean redeploy shipped WITHOUT the SCS Domain Manager RBAC; the only written procedure used the snap-blocked /tmp form. REVERT: remove the 6 bundle lines + the zip; restore appendix-C block from history. (Reverting re-opens the blocker; the phase-03 gate would then FAIL by design -- that is the guard working.)

7. DOCFIX-072 / D-043 -- decision brought level with the bundle

FILE: docs/design-decisions.md (D-043 RESOLVED -> ADOPTED option (a)). WHY: bundle set resume-guests-state-on-host-boot=True while D-043 was still PROPOSED. Adopted: auto-resume is the tenant-VM norm; D-041 unchanged for control-plane. JUDGMENT CALL FLAGGED EARLIER: capi-mgmt-v2 WILL auto-resume; manual-start policy now governs deliberate stops only. If you want a real exclusion, that needs a mechanism -- say so and I will draft options. REVERT: append a reversal entry (append-only doc) + remove the bundle option.

8. D-doc alignment amendments (appended, append-only discipline)

FILE: docs/design-decisions.md -- D-002 (channel matrix reconciled: etcd/easyrsa dead, memcached latest-only, vault 1.16 per D-068), D-009 (rabbitmq min-cluster-size Roosevelt delta), D-011 (item 6 = manual unseal, matching phase-08), D-051 (bundle-native delivery), D-061 (validation scope: survival vs re-acquisition; destroy = validated spine). REVERT: each is a discrete appended section -- delete the section.

================================================================================

Block 2 -- process improvements (DOCFIX-073..078, D-069, D-070)

================================================================================

9. DOCFIX-073 -- preflight.sh: THE single pre-deploy gate

FILES: scripts/preflight.sh (NEW), scripts/channel_assert.py (NEW), tests/preflight/ (NEW, 7 cases), runbooks/phase-01-bundle-deploy.md (prerequisites GATE added -- pre-flight was previously invoked by NO runbook). WHY: gate surface was six artifacts remembered independently. Orchestrates repo-lint -> bundle gate -> channel assert -> live pre-flight, worst-exit aggregation, stage-2 reminders. channel_assert fulfils D-002's "verify against Charmhub each deploy" claim (previously unimplemented): every pinned channel must exist on Charmhub; typo/retired-track = FAIL; offline = WARN. REVERT: delete the three new paths + the phase-01 GATE paragraph.

10. DOCFIX-074 -- repo-lint: the drift sweep, productized

FILES: scripts/repo_lint.py + scripts/repo-lint.sh (NEW), tests/repo-lint/ (NEW, 9 cases). Plus SANITIZE remediation of every ASCII-rule violation it found: docs/v1-pre-deploy-fixes.md, docs/netbox-vip-queue.md, .gitignore, netbox/README.md, netbox/ipv4-prefixes-import.py, netbox/ipv6-mark-reserved.py (punctuation transliteration only; python files re-compiled OK). Plus two catches it made against MY OWN morning patch and bundle HEAD, both fixed: a missed .233 prose line in phase-01; a stale storage/replication CIDR comment in bundle.yaml. CHECKS: L1 encoding, L2 stale tokens (with explicit per-file opt-out marker for guard scripts), L3 ghost script refs, L4 deprecated refs, L5 numbering collisions + next-free report, L6 bare invocations. REVERT: delete scripts + tests; restore pre-sanitize files from history.

11. DOCFIX-075 -- cloud-assert.sh: the behavioral verifier + BOM capture

FILES: scripts/cloud-assert.sh (NEW), tests/cloud-assert/ (NEW, 9 fakebin cases), runbooks/ops-restart-procedure.md (NEW -- the restart procedure, previously operator-local, adapted to current references with [REVALIDATE] markers and committed). WHY: the D-045/D-046/D-051/D-042 family = "juju green, service broken". One idempotent read-only sweep of every service-own-verdict gate (vault seal, mysql 1xR/W, OVN unity, chassis, hypervisors, LBs, PO:, trustee domain, coe 403, conductor LIVE args). Missing admin scope = HELD exit 2, never a silent pass. --capture writes a committed asbuilt// BOM (Roosevelt drift baseline). Replaces the never-committed post-maintenance-health-check.sh. REVERT: delete the three paths.

12. DOCFIX-076 -- as-executed log convention

FILES: scripts/run-logged.sh (NEW), docs/as-executed-log-convention.md (NEW), logs/as-executed-index.md (NEW, committed index; content stays jumphost-only). WHY: the verbatim-retrieval rule for one-shot steps depended on an artifact with no defined location/format (DOCFIX-006 is the cautionary tale). REVERT: delete the three paths.

13. DOCFIX-077 -- appendix-A completions

FILE: runbooks/appendix-A-troubleshooting.md (two appended sections): mysql-innodb-cluster recovery (D-062 signatures + the destructive-action guard) and the point-of-use identifier index (DOCFIX-027/028/029/034/037, BUNDLEFIX-001..006) closing the dangling-reference gap. REVERT: delete the two appended sections.

14. DOCFIX-078 -- security exposure ledger

FILE: docs/security-ledger.md (NEW). Seeded: SEC-001 libvirt credential exposure (was living only in a script header), SEC-002 juju action-log token rule, SEC-003 vault custody, SEC-004 repo-public flag. REVERT: delete the file.

15. D-069 -- vault unseal-key custody (ADOPTED, policy)

FILE: docs/design-decisions.md (appended). WHAT: split custody (no individual holds threshold), second-person unseal rehearsal as an acceptance item, custody re-cut review at every re-init. Custodian ASSIGNMENT deliberately left as operator input (SEC-003). REVERT: append a reversal entry.

16. D-070 -- supersedes D-012 (no KVM snapshot restore path)

FILE: docs/design-decisions.md (appended). WHAT: D-012 was never exercised (no virsh snapshot step exists anywhere); rebuild-from-runbooks (D-017/D-018) is THE restore path; baseline-capture role moves to cloud-assert --capture. Counter-argument recorded in the entry (mid-rehearsal rollback convenience) with the revisit condition. REVERT: append a reversal entry (this one is the most opinion-weighted item in the set -- if you disagree anywhere, it is probably here).

================================================================================

Manual git actions (cannot ship in a zip)

================================================================================ git rm scripts/review-bundle.py # DOCFIX-070 git rm scripts/phase-00-teardown.sh # deprecated; runbook no longer

                                       # references it by path (repo-lint
                                       # L3/L4 assume it gone)

optional: git update-index --chmod=+x scripts/*.sh

Pre-commit verify (jumphost or Windows+python3)

python3 scripts/repo_lint.py . # expect: 0 fail, 1 legacy WARN python3 scripts/provider-bundle-check.py bundle.yaml # expect: PASS, 6 [ok] for t in tests/*/run-tests.sh; do bash "$t"; done # expect: 4x ALL PASS (33 cases)

Still queued (unchanged by this patchset)

Verify-live: octavia deployed-cert SAN read; juju info channel probe on the jumphost (channel_assert now automates it at preflight); second-person unseal rehearsal (D-069). Operator inputs: vault custodian assignment (SEC-003); ruling if a real capi-mgmt auto-resume exclusion is wanted (item 7).

================================================================================

Block 3 -- code hardening sweep (all repo scripts + harness estate)

================================================================================ Method: automated audit of every script and runbook paste block against the house hardening rules (SIGPIPE, capture-die, inner-ssh stdin, /tmp-snap, gawk, sed -i, column-order, juju-run capture, bare-exit-in-paste), then MANUAL verification of every hit before any change. Audit honesty notes: H3 (inner ssh) and H7 (paste-block exit) closed at ZERO true positives -- the rc/rcap/J wrapper convention and the subshell/remote-quote discipline held everywhere; H2/H4/H6/H8 clean.

17. DOCFIX-079 -- seven SIGPIPE (H1) fixes: capture-then-test conversions

FILES: scripts/phase-02-vault-preflight.sh, phase-03-core-verify.sh (model- presence gate: race read a PRESENT model as absent -> false HOLD); phase-06-bootstrap.sh (role verify-first: race re-ran grants); phase-06-net-setup.sh (SG-rule verify-first: race -> duplicate create -> 409 aborts the run under set -e); phase-07-conductor-graft.sh + its runbook twin (helm version check: converted to substring param-expansion, no pipe at all); tenant-onboard.sh x2 -- the serious pair: the manager-grant verify could FALSE-DIE on a successful grant, and the duplicate-CIDR guard FAILED OPEN (match -> SIGPIPE 141 -> '&& die' skipped -> colliding subnet proceeds). NEW TEST: tests/tenant-onboard/run-tests.sh proves the CIDR guard fires on a collision and stays quiet on a clear CIDR. REVERT: each is a commented, anchored block -- git history per file.

18. DOCFIX-080 -- tenant-onboard image-ID resolution off column-order luck

FILE: scripts/tenant-onboard.sh (2 sites) + a jq presence gate added. WHY: -f value -c ID -c Name | awk '{print $1}' rode alphabetical column ordering (ID<Name today). Converted to -f json | jq per the house rule. Display-only multi-column -f value uses elsewhere (phase-05/06 confirms, tenant-acceptance echo) were left as-is deliberately -- output is read by eyes, not parsed; converting them changes operator-facing format for no robustness gain. ADVISORY: do not parse those lines positionally later.

19. DOCFIX-081 -- cert-san verifier: dead SKIP branch / silent non-TLS drop

FILE: scripts/phase-04-internal-cert-san-verify.sh. WHY: a jq https-only pre-filter made the non-TLS SKIP branch unreachable -- an unexpectedly plain-HTTP internal endpoint was silently hidden, which is precisely a finding the operator must see. Feed now carries ALL internal endpoints; the existing SKIP line reports non-TLS visibly. This also un- breaks tests/phase-04-internal-cert-san (the harness encoded the correct behavior; the script had regressed under it). REVERT: restore the jq select() -- not recommended.

20. NEW -- tests/phase-00-teardown-d061: harness for the D-061 pair

The most destructive scripts in the repo (teardown-release/-destroy) shipped with NO tests despite a --no-prompt flag documented "tested automation only". Stateful fakebin (maas machine state advances on remove/destroy; instant sleep): 8 cases incl. canary-stops-after-one, DECOMPOSE-detection fails loud and blocks destroy-model, substrate-sid collision aborts pre-mutation, --release-storage vs --destroy-storage flag assertions, orphan deletion. REVERT: delete the test dir (coverage loss only).

21. Harness-estate findings (pre-existing at HEAD; verified against pristine)

RETIRE (manual git actions -- harnesses for retired scripts test nothing): git rm -r tests/phase-00-teardown # script git rm'd this patchset;

                                       # replaced by tests/phase-00-teardown-d061

git rm -r tests/provider-vip-standup # script already gone at HEAD git rm -r tests/phase-00-maas-recidr # script already gone at HEAD REPAIR SPECS (stale-but-live; harnesses still test the D-058 world their scripts left at D-060 -- red-at-HEAD today, which trains ignoring red):

  • tests/phase-00-maas-standup: regenerate make_fixtures.py fixtures and the WOULD/SKIP/DRIFT expectation strings for the D-052/D-053 six-plane table (source of truth: scripts/lib-net.sh) -- drop provider-vip/VID-104/ 10.12.20.0/22 expectations; fresh-cloud case expects the six-plane create plan + current reserved ranges; drift cases keyed to the current wording ("occupied by the wrong plane", "re-CIDR/migration").
  • tests/carve-host-interfaces: rebuild fix/*.json interface trees and expectations to Pattern A (br-ex OVS on enp1s0, br-metal/.103/br-internal, raw data/storage/replication statics; metal-internal fixtures must carry VID 103) -- drop enp1s0.104 / br-prov-api expectations. Process lesson encoded for the future: a script migration commit MUST carry its harness (these went stale in the D-060 revert commit).

22. Audit-clean attestations (for the record)

Runbook paste blocks: every exit verified nested in a subshell or a remote- quoted command -- zero operator-shell-killing exits. Inner ssh/juju-ssh: all argv-style invocations route through </dev/null-appending wrappers or terminate with </dev/null; heredoc-payload ssh (bash -s) correctly exempt. No /tmp-snap, gawk-ism, unasserted sed -i, or capture-die instances anywhere.