Newer
Older
openstack-caracal-ipv4 / docs / handoff-20260703-open-items.md

Handoff -- open items register (2026-07-03 session close)

STATUS SNAPSHOT at handoff: repo HEAD abc144a; gauntlet ALL GREEN (26 harnesses); repo-lint 0 fail (1 documented legacy WARN); Claude Code live on the jumphost with CLAUDE.md + permission rules + PreToolUse guard (13/13); skill single-sourced at .claude/skills/openstack-cloud-ops (v1.3). Two blockers from the redeploy-readiness sweep (DOCFIX-066 teardown spine, DOCFIX-071 policy delivery) are FIXED and merged. Full change record: docs/changelog-20260703-process-hardening.md (items 1-32, per-item reverts).

Numbering at handoff (re-grep before assigning -- L5 prints it): next-free D-071, DOCFIX-082, BUNDLEFIX-009.

Conventions for whoever picks this up: repo wins over any session memory; grep design-decisions before changing a built surface; every script change ships with its harness green + a changelog entry with a revert; mutations are individually human-gated (the permission ask-rules enforce this on the jumphost -- do not work around them).


1. Immediate (small, well-defined; good first Claude Code tasks)


H-1 (DOCFIX-082 candidate) -- tenant-offboard.sh hardening The new offboard script is well-built (audit-default, typed gate, protected- domain blocklist, dependency-ordered) but runs set -u only. It MUTATES in --apply: a silently failed pipeline mid-sweep strands resources while reporting progress. FIX: set -uo pipefail + capture-then-test conversions per references/script-authoring.md (both SIGPIPE directions); state the chosen error regime in the header; strengthen is_id to the 32-hex form. ACCEPT: tests/tenant-offboard green incl. a new failure-injection case; changelog entry with revert.

H-2 (DOCFIX-082 or 083) -- vault-kv-inner-probe.sh: secret off argv The AppRole probe builds the login body with role_id/secret_id and (verify) passes it via curl argv -- visible in ps on the unit. TTL is 60s so exposure is bounded, but the house rule is absolute: secrets never transit argv. FIX: feed the body via stdin (curl --data @-). ACCEPT: probe harness green; a grep proves no secret-bearing var appears in a curl argv.

H-3 -- record-keeping for the two new scripts tenant-offboard.sh + tenant-offboard tests and vault-kv-health.sh (+ inner probe, + tests) have NO changelog entries or identifiers. Add entries (what/why/revert) under the next-free DOCFIX numbers; vault-kv-health cites D-068 item 3 -- link it there too.

H-4 -- claude.ai skill copy parity The chat-side installed skill visible at session close lacked the v1.3 hermetic-harness scar. Confirm the re-upload landed (a fresh chat session sees the current mount); if not, re-upload the .skill regenerated from .claude/skills/openstack-cloud-ops. Standing habit: any edit to the skill source -> repackage -> re-upload.

H-5 -- Claude Code guardrail smoke test (if not already done) Fresh session on the jumphost: /permissions shows the three rule sets; maas list is hard-blocked by the hook; a juju status runs unprompted; any --apply prompts. One minute; proves settings loaded.


2. Verify-live queue (read-only CHECKs; jumphost)


V-1 (DOCFIX-067 severity ruling) -- octavia deployed-cert SAN read The runbook now derives the SAN dynamically; the LIVE cloud's controller cert was generated under the old literal. Read what is actually deployed:

CHECK (read-only) -- jumphost

  ( {
    CRT=$(juju ssh -m openstack octavia/leader -- \
      'sudo cat /etc/octavia/certs/controller_cert.pem 2>/dev/null || sudo ls /etc/octavia/certs' </dev/null 2>&1)
    printf '%s\n' "$CRT" | openssl x509 -noout -text 2>/dev/null \
      | grep -A2 'Subject Alternative Name' || printf '%s\n' "$CRT" | head -5
  } )

(Path may differ -- if the ls fallback fires, locate the controller cert per phase-01 1.0-GEN and re-read.) RULE ON RESULT: if SAN carries the old 10.12.4.233 and octavia is green (expected), record RISK-accepted in the changelog -- regenerated at next redeploy; do NOT hot-rotate certs for this.

V-2 (D-069) -- second-person unseal rehearsal Someone other than the initializing operator performs the manual 3-of-5 unseal per ops-restart-procedure Stage 3. This is an ACCEPTANCE item, not a nicety: "keys work" and "a second human can use them" are different facts. Requires SEC-003 custodian assignment first (section 3).

V-3 -- channel assert: AUTOMATED (no action) preflight.sh P3 now runs the Charmhub channel assert live on every pre- deploy run; the manual juju info probe from the old queue is retired.


3. Operator rulings pending (nothing proceeds without these)


R-1 (SEC-003 / D-069) -- vault unseal-key custodian assignment Policy is ADOPTED (split custody, no individual holds threshold); the ASSIGNMENT (who, what media) is operator input, recorded off-repo; flip the ledger row when done.

R-2 (D-043 caveat) -- capi-mgmt-v2 auto-resume exclusion Cloud-wide auto-resume means the mgmt VM WILL resume after host reboots; its manual-start policy now governs deliberate stops only. If a REAL exclusion is wanted, it needs a mechanism (options to draft on request); otherwise close the caveat with a one-line D-043 amendment accepting it.

R-3 (D-063, PROPOSED/OPEN) -- capi-mgmt SG hardening Options a/b/c recorded in design-decisions; option (a) requires the MEASURED post-NAT conductor source, never inferred. Phase-07 depends on 6443 reachability -- rule before the next redeploy or explicitly defer.


4. The deploy path (the reason for all of the above)


D-1 -- Pattern-A full redeploy (VR0 DC0) Now runnable end-to-end from the docs alone: phase-00 (D-061 DESTROY path: teardown-destroy -> 8_lbaas -> OSD wipe -> reenroll -> carve x4 -> standup) -> bash scripts/preflight.sh PASS -> phases 01-08 gated (keystone policy now arrives via the bundle; phase-03 Step 3.4 gates PO: + behavioral G3) -> bash scripts/cloud-assert.sh --capture -> commit the asbuilt/ BOM. Session discipline: bash scripts/run-logged.sh phase-NN-<topic> first.

D-2 -- D-011 acceptance: implement validate.sh The placeholder's TODO is now aligned to the AMENDED decisions (manual unseal, no snapshots, no Designate) and points at the building blocks (cloud-assert / tenant-acceptance / run-tests-all). Still to author: VIP reachability from jumphost + tenant VM, the LB round-robin/failover pattern test, timed fresh-tenant Magnum e2e, and V-2 as a checklist item.

D-3 -- confirm D-042 fix status Memory says the seven-stage magnum-capi-helm contract-ref runbook was staged pending execution; the repo record should say whether it ran. Confirm from the changelog/decisions and either execute or close.


5. v1-close checklist (execute after D-011 passes)


C-1 Consolidate the 10 per-phase do-documents into docs/v1-deploy-runbook.md (structure preserved in git history via the consolidation commit). C-2 Consolidate/close docs/docfix-draft-20260702.md -- everything implemented graduates to the changelog; anything decision-shaped to design-decisions. C-3 Repo visibility back to PRIVATE (SEC-004; Settings -> Options). C-4 Rotate the libvirt SSH credential (SEC-001 -- exposed 2026-06-26). C-5 Ledger review pass (docs/security-ledger.md) -- every row OWNED + dated.


6. Deferred, with explicit revisit triggers


  • OS sandboxing (bubblewrap) for Claude Code Bash: revisit at Roosevelt or if a prompt-injection-shaped incident ever occurs on the jumphost.
  • Managed settings (disableBypassPermissionsMode): when a SECOND operator gets jumphost access.
  • Plugin graduation (wrap the skill + slash-commands /preflight /cloud-assert
    • hooks): when Claude Code is the daily driver and prompt-count friction is measurable. The skill remains the source either way.
  • Jenkins CI poll job for repo-lint/gauntlet: when the tenant-rehearsal Jenkins graduates to an ops instance.
  • rabbitmq min-cluster-size:3 (D-009 amendment): Roosevelt bundle delta only.
  • SSH on GitBucket (port 29418), IPv6 dual-stack, NetBox import bundle: v2-deferred as previously recorded.

-- end of register --