diff --git a/docs/handoff-20260703-open-items.md b/docs/handoff-20260703-open-items.md new file mode 100644 index 0000000..9b95428 --- /dev/null +++ b/docs/handoff-20260703-open-items.md @@ -0,0 +1,164 @@ +# Handoff -- open items register (2026-07-03 session close) + +STATUS SNAPSHOT at handoff: repo HEAD abc144a; gauntlet ALL GREEN (26 +harnesses); repo-lint 0 fail (1 documented legacy WARN); Claude Code live on +the jumphost with CLAUDE.md + permission rules + PreToolUse guard (13/13); +skill single-sourced at .claude/skills/openstack-cloud-ops (v1.3). Two +blockers from the redeploy-readiness sweep (DOCFIX-066 teardown spine, +DOCFIX-071 policy delivery) are FIXED and merged. Full change record: +docs/changelog-20260703-process-hardening.md (items 1-32, per-item reverts). + +Numbering at handoff (re-grep before assigning -- L5 prints it): +next-free D-071, DOCFIX-082, BUNDLEFIX-009. + +Conventions for whoever picks this up: repo wins over any session memory; +grep design-decisions before changing a built surface; every script change +ships with its harness green + a changelog entry with a revert; mutations are +individually human-gated (the permission ask-rules enforce this on the +jumphost -- do not work around them). + +-------------------------------------------------------------------------------- +## 1. Immediate (small, well-defined; good first Claude Code tasks) +-------------------------------------------------------------------------------- + +H-1 (DOCFIX-082 candidate) -- tenant-offboard.sh hardening + The new offboard script is well-built (audit-default, typed gate, protected- + domain blocklist, dependency-ordered) but runs `set -u` only. It MUTATES in + --apply: a silently failed pipeline mid-sweep strands resources while + reporting progress. FIX: `set -uo pipefail` + capture-then-test conversions + per references/script-authoring.md (both SIGPIPE directions); state the + chosen error regime in the header; strengthen is_id to the 32-hex form. + ACCEPT: tests/tenant-offboard green incl. a new failure-injection case; + changelog entry with revert. + +H-2 (DOCFIX-082 or 083) -- vault-kv-inner-probe.sh: secret off argv + The AppRole probe builds the login body with role_id/secret_id and (verify) + passes it via curl argv -- visible in `ps` on the unit. TTL is 60s so + exposure is bounded, but the house rule is absolute: secrets never transit + argv. FIX: feed the body via stdin (`curl --data @-`). ACCEPT: probe + harness green; a grep proves no secret-bearing var appears in a curl argv. + +H-3 -- record-keeping for the two new scripts + tenant-offboard.sh + tenant-offboard tests and vault-kv-health.sh (+ inner + probe, + tests) have NO changelog entries or identifiers. Add entries + (what/why/revert) under the next-free DOCFIX numbers; vault-kv-health cites + D-068 item 3 -- link it there too. + +H-4 -- claude.ai skill copy parity + The chat-side installed skill visible at session close lacked the v1.3 + hermetic-harness scar. Confirm the re-upload landed (a fresh chat session + sees the current mount); if not, re-upload the .skill regenerated from + .claude/skills/openstack-cloud-ops. Standing habit: any edit to the skill + source -> repackage -> re-upload. + +H-5 -- Claude Code guardrail smoke test (if not already done) + Fresh session on the jumphost: `/permissions` shows the three rule sets; + `maas list` is hard-blocked by the hook; a `juju status` runs unprompted; + any `--apply` prompts. One minute; proves settings loaded. + +-------------------------------------------------------------------------------- +## 2. Verify-live queue (read-only CHECKs; jumphost) +-------------------------------------------------------------------------------- + +V-1 (DOCFIX-067 severity ruling) -- octavia deployed-cert SAN read + The runbook now derives the SAN dynamically; the LIVE cloud's controller + cert was generated under the old literal. Read what is actually deployed: + + CHECK (read-only) -- jumphost + ```bash + ( { + CRT=$(juju ssh -m openstack octavia/leader -- \ + 'sudo cat /etc/octavia/certs/controller_cert.pem 2>/dev/null || sudo ls /etc/octavia/certs' &1) + printf '%s\n' "$CRT" | openssl x509 -noout -text 2>/dev/null \ + | grep -A2 'Subject Alternative Name' || printf '%s\n' "$CRT" | head -5 + } ) + ``` + (Path may differ -- if the ls fallback fires, locate the controller cert per + phase-01 1.0-GEN and re-read.) RULE ON RESULT: if SAN carries the old + 10.12.4.233 and octavia is green (expected), record RISK-accepted in the + changelog -- regenerated at next redeploy; do NOT hot-rotate certs for this. + +V-2 (D-069) -- second-person unseal rehearsal + Someone other than the initializing operator performs the manual 3-of-5 + unseal per ops-restart-procedure Stage 3. This is an ACCEPTANCE item, not a + nicety: "keys work" and "a second human can use them" are different facts. + Requires SEC-003 custodian assignment first (section 3). + +V-3 -- channel assert: AUTOMATED (no action) + preflight.sh P3 now runs the Charmhub channel assert live on every pre- + deploy run; the manual `juju info` probe from the old queue is retired. + +-------------------------------------------------------------------------------- +## 3. Operator rulings pending (nothing proceeds without these) +-------------------------------------------------------------------------------- + +R-1 (SEC-003 / D-069) -- vault unseal-key custodian assignment + Policy is ADOPTED (split custody, no individual holds threshold); the + ASSIGNMENT (who, what media) is operator input, recorded off-repo; flip the + ledger row when done. + +R-2 (D-043 caveat) -- capi-mgmt-v2 auto-resume exclusion + Cloud-wide auto-resume means the mgmt VM WILL resume after host reboots; + its manual-start policy now governs deliberate stops only. If a REAL + exclusion is wanted, it needs a mechanism (options to draft on request); + otherwise close the caveat with a one-line D-043 amendment accepting it. + +R-3 (D-063, PROPOSED/OPEN) -- capi-mgmt SG hardening + Options a/b/c recorded in design-decisions; option (a) requires the + MEASURED post-NAT conductor source, never inferred. Phase-07 depends on + 6443 reachability -- rule before the next redeploy or explicitly defer. + +-------------------------------------------------------------------------------- +## 4. The deploy path (the reason for all of the above) +-------------------------------------------------------------------------------- + +D-1 -- Pattern-A full redeploy (VR0 DC0) + Now runnable end-to-end from the docs alone: phase-00 (D-061 DESTROY path: + teardown-destroy -> 8_lbaas -> OSD wipe -> reenroll -> carve x4 -> standup) + -> `bash scripts/preflight.sh` PASS -> phases 01-08 gated (keystone policy + now arrives via the bundle; phase-03 Step 3.4 gates PO: + behavioral G3) + -> `bash scripts/cloud-assert.sh --capture` -> commit the asbuilt/ BOM. + Session discipline: `bash scripts/run-logged.sh phase-NN-` first. + +D-2 -- D-011 acceptance: implement validate.sh + The placeholder's TODO is now aligned to the AMENDED decisions (manual + unseal, no snapshots, no Designate) and points at the building blocks + (cloud-assert / tenant-acceptance / run-tests-all). Still to author: VIP + reachability from jumphost + tenant VM, the LB round-robin/failover pattern + test, timed fresh-tenant Magnum e2e, and V-2 as a checklist item. + +D-3 -- confirm D-042 fix status + Memory says the seven-stage magnum-capi-helm contract-ref runbook was + staged pending execution; the repo record should say whether it ran. + Confirm from the changelog/decisions and either execute or close. + +-------------------------------------------------------------------------------- +## 5. v1-close checklist (execute after D-011 passes) +-------------------------------------------------------------------------------- + +C-1 Consolidate the 10 per-phase do-documents into docs/v1-deploy-runbook.md + (structure preserved in git history via the consolidation commit). +C-2 Consolidate/close docs/docfix-draft-20260702.md -- everything implemented + graduates to the changelog; anything decision-shaped to design-decisions. +C-3 Repo visibility back to PRIVATE (SEC-004; Settings -> Options). +C-4 Rotate the libvirt SSH credential (SEC-001 -- exposed 2026-06-26). +C-5 Ledger review pass (docs/security-ledger.md) -- every row OWNED + dated. + +-------------------------------------------------------------------------------- +## 6. Deferred, with explicit revisit triggers +-------------------------------------------------------------------------------- + +- OS sandboxing (bubblewrap) for Claude Code Bash: revisit at Roosevelt or if + a prompt-injection-shaped incident ever occurs on the jumphost. +- Managed settings (disableBypassPermissionsMode): when a SECOND operator + gets jumphost access. +- Plugin graduation (wrap the skill + slash-commands /preflight /cloud-assert + + hooks): when Claude Code is the daily driver and prompt-count friction is + measurable. The skill remains the source either way. +- Jenkins CI poll job for repo-lint/gauntlet: when the tenant-rehearsal + Jenkins graduates to an ops instance. +- rabbitmq min-cluster-size:3 (D-009 amendment): Roosevelt bundle delta only. +- SSH on GitBucket (port 29418), IPv6 dual-stack, NetBox import bundle: + v2-deferred as previously recorded. + +-- end of register --