# DOCFIX draft -- redeploy-readiness review (bundle + channels + runbooks)

STATUS: DRAFT / OPEN -- accreting during the 2026-07-02 review session.
Numbers are PROVISIONAL from next-free DOCFIX-066 (verified against HEAD
690779a: DOCFIX-065 / D-068 / BUNDLEFIX-008 consumed). Renumber-check again
at commit time. ASCII + LF.

Scope of this review:
  1. bundle.yaml -- YAML validity, structural consistency, known anti-patterns.
  2. Charm channel pins -- staleness review against current Charmhub guidance.
  3. Runbook sweep -- cross-reference integrity, stale values, anything that
     would break the next redeploy.

Severity key: BLOCKER (breaks redeploy) / RISK (may break or mislead) /
NIT (consistency only).

--------------------------------------------------------------------------------
## Findings
--------------------------------------------------------------------------------

(appended as found)
### DOCFIX-066 (BLOCKER) -- teardown runbook drives the DEPRECATED teardown script
File: runbooks/phase-00-teardown-maas-reset.md (steps 2, plan table, lines 22/30/67/74/85).
The runbook's execution spine is `scripts/phase-00-teardown.sh --apply` with the narrative
"hosts release to MAAS Ready" -- the exact premise DOCFIX-057/D-061 proved WRONG on this
virsh-pod MAAS (destroy-model DECOMPOSES pod-composed machines; observed 3x). The script
itself carries a DO-NOT-USE banner, so the runbook and script now contradict each other;
an operator following the runbook on the next redeploy either hits the deprecation mid-
teardown or, if they push past it, triggers a fourth decompose + full reenroll/recarve.
The D-061 replacements (phase-00-teardown-release.sh --keep-instance + canary;
phase-00-teardown-destroy.sh) exist but are never mentioned in the runbook.
FIX: rewrite the runbook spine as the D-061 fork -- (a) machine-preserving path:
teardown-release.sh with the MANDATORY first-run canary (--apply --canary, verify
openstack0 survives in MAAS, then all-four); (b) from-scratch path: teardown-destroy.sh
+ reenroll + recarve. State which path the standard redeploy uses. Step-5 "hosts Ready"
premise and the OSD-wipe/8_lbaas step ordering must be revalidated per path (release path
leaves hosts Deployed, not Ready -- the wipe/carve preconditions differ).

### DOCFIX-067 (RISK -- verify live before ruling) -- octavia PKI cert SAN carries the pre-R14 VIP
File: runbooks/phase-01-bundle-deploy.md 1.0-GEN.c (lines ~292, ~318).
The controller-cert CNF sets `IP.1 = 10.12.4.233` -- the OLD octavia VIP. R14 relocated
all VIPs to .50-.60; bundle HEAD has octavia at 10.12.4.57. The octavia-pki overlay is
regenerated every phase-01, so the LIVE cloud's controller cert most likely carries .233
today; octavia passed phase-05 validation regardless, so the amphora side evidently does
not verify that SAN IP -- functional impact UNPROVEN, inconsistency CERTAIN, and it is a
latent break if SAN verification ever tightens (or when Roosevelt re-uses this block).
The DNS.1/DNS.2 SANs also reference the D-019-dropped FQDN scheme (harmless, same sweep).
FIX: derive the SAN IP dynamically (lib-net VIP_PREFIX_PROVIDER + octavia's octet from
bundle/juju -- rule 3), not a literal. VERIFY-LIVE first (gated CHECK, jumphost):
read the overlay/live cert SAN and confirm what is actually deployed:
  openssl x509 -in <controller cert path per phase-01> -noout -text | grep -A2 'Subject Alternative Name'
Rule on severity after the read: if the live cert has .233 and octavia is green, keep RISK
(doc fix + regenerate at next redeploy); do not hot-rotate certs mid-cloud for this.

### DOCFIX-068 (RISK) -- phase-01 "Constants and env-literals" block is pre-D-052 stale
File: runbooks/phase-01-bundle-deploy.md lines ~23-26. The block states the RETIRED plane
map (2=metal .8, 6=data .12, 7=storage .16, 8=replication .20, 9=lbaas .32 -- wrong
plane->CIDR pairs under D-052/D-053, incl. the retired `lbaas` space), plus hardcoded MAAS
subnet IDs (violates lib-net PATTERN-1: IDs drift, resolve by CIDR) and hardcoded
system_ids (violates DOCFIX-040: re-minted per enrollment; lib-hosts resolves). Mixed
freshness: the "50 apps, 97 relations" expectation in the same block MATCHES bundle HEAD.
Misleading at the worst moment (mid-deploy reference values).
FIX: replace the stale lines with pointers to scripts/lib-net.sh (planes) and
scripts/lib-hosts.sh (host identity); retain only verified-current literals.

### DOCFIX-069 (RISK) -- zero exec bits + bare script invocations
git index: ALL 37 files under scripts/ are mode 100644 (GitHub Desktop workflow strips
+x). Fresh clone on the jumphost -> every BARE invocation fails "Permission denied".
runbooks/phase-00-teardown-maas-reset.md invokes bare in ~10 places (teardown, carve,
standup); other runbooks appear bash-prefixed (sweep found no other bare hits).
FIX (durable, matches the Windows commit constraint): bash-prefix every script invocation
in runbooks (`bash scripts/x.sh ...`). Optional belt: `git update-index --chmod=+x
scripts/*.sh` -- but Windows-side recommits can strip again, so the bash prefix is the
invariant; do both if desired.

### DOCFIX-070 (RISK) -- scripts/review-bundle.py is pre-D-052 stale; NOT CLEAN is noise
Against bundle HEAD it reports FAIL=71: expects space `metal` (retired), DUAL VIPs (D-020
form; D-052 moved to triples), no per-endpoint bindings (D-052 introduced them), vault
1.8 (D-068 pinned 1.16), baselines 51 apps/98 rels (now 50/97: VIP set changed -- vault
dropped its VIP, ceph-radosgw gained one -- blessed by provider-bundle-check.py, which
PASSES clean). Hazard: a pre-deploy NOT CLEAN verdict that must be ignored trains alarm
fatigue and will eventually mask a real defect.
OPTIONS: (a) update review-bundle.py expectations to the D-052/D-060/D-062/D-068 model;
(b) retire it (git rm) and fold any still-unique checks (relation-endpoint syntax,
phantom-key detection reworked for the per-endpoint model) into provider-bundle-check.py;
(c) banner it historical. RECOMMEND (b): one authoritative gate beats two disagreeing
ones -- same reasoning as the D-060 d057-bundle-check retirement.

### NIT-A -- D-002 channel matrix drift (design-decisions)
The D-002 table still lists `etcd, easyrsa -> latest/stable` (etcd/easyrsa dropped; R3 /
phase-02 record vault-on-mysql), omits memcached (bundle: latest/stable -- apparently the
only track that charm publishes; upstream's "never latest/stable" applies to OpenStack-
project charms, which memcached is not), and its vault row (1.8) is superseded by D-068
(1.16). Append-only fix: a dated amendment note under D-002, not an edit.
VERIFY-LIVE (gated CHECK, jumphost) before finalizing:
  for c in memcached rabbitmq-server vault hacluster; do juju info "$c" 2>/dev/null | sed -n '/channels:/,$p' | head -12; done
Expected: memcached publishes only latest/*; rabbitmq-server tops out at 3.9; vault
carries 1.16/stable; hacluster 2.4/stable.

### NIT-B -- channel-pin review conclusion (informational; no change)
All bundle pins judged CURRENT for Caracal/jammy: 2024.1/stable core (18), OVN
24.03/stable, ceph squid/stable, mysql 8.0/stable (12), hacluster 2.4/stable (11),
rabbitmq-server 3.9/stable (terminal track for this charm), vault 1.16/stable (D-068),
memcached latest/stable (sole track; see NIT-A). Upstream charm-guide delivery page is
frozen (last updated 2023-12) -- Charmhub/juju info is the only live authority; the
NIT-A verify block doubles as the pre-deploy channel assert. Candidate: fold that assert
into scripts/pre-flight-checks.sh (D-002 claims pre-flight verifies channels -- confirm
it actually does; not yet audited).

### NIT-C -- ASCII-rule violations in docs/
docs/v1-pre-deploy-fixes.md (277 non-ASCII bytes), docs/netbox-vip-queue.md (81). The
repo rule is ASCII-only for all committed files (mod_wsgi lesson). Low functional risk
(docs, not conf), but the rule is stated absolute -- sanitize or record a carve-out.

### NIT-D -- identifier index gaps
DOCFIX-027/028/029/034/037 and BUNDLEFIX-001..006 are defined only at point of use
(runbook/bundle comments) and absent from appendix-A / the changelog index -- appendix-A
claims to be the index "keyed by the same identifiers used inline". Add one-line index
entries (or mark point-of-use-only identifiers as such).

### NIT-E -- appendix-A lacks a mysql-innodb-cluster recovery entry
D-062 material (blocked 'Instance not yet configured' = single-unit seed; half-join
instanceErrors = mid-life rescan; reboot-cluster-from-complete-outage ONLY on confirmed
outage -- destructive against a healthy cluster) exists in design-decisions + the restart
procedure but has no appendix-A symptom entry. Add one; also consider committing the
restart-procedure doc to the repo (it currently lives outside it).

--------------------------------------------------------------------------------
## Verify-live queue (gated CHECKs for the jumphost before findings finalize)
1. Octavia controller cert SAN (DOCFIX-067) -- read the deployed overlay/cert.
2. juju info channel probe (NIT-A/B) -- memcached / rabbitmq-server / vault / hacluster.
3. pre-flight-checks.sh -- confirm whether it performs the D-002 channel assert.

--------------------------------------------------------------------------------
## Deployment-flow parity findings (decision vs bundle vs schedule)
--------------------------------------------------------------------------------

### DOCFIX-071 (BLOCKER) -- D-064 keystone policy attach is not reachable from the deploy schedule
Evidence: bundle.yaml keystone has use-policyd-override=True but NO resources: stanza;
`attach-resource keystone` appears in NO phase runbook or script -- only appendix-C:73-74.
phase-01:183 knowingly deploys into "PO (broken)" and phase-02:167 re-notes it as
FINDING-1 "not a regression"; no phase ever resolves it. The live cloud got the policy
via a session action (D-064), never folded into the schedule. NEXT REDEPLOY as written:
keystone stays PO (broken), the SCS Domain Manager RBAC (the commercial tenant-isolation
core, D-051) is ABSENT, and tenant onboarding fails at G3.
Compounding defect: the appendix-C block zips to and attaches FROM /tmp -- the
documented snap-confinement trap (attach-resource cannot read /tmp on this jumphost;
use $HOME). The only written procedure is the known-broken form.
FIX OPTIONS (debate):
 (a) Bundle-native resource: add to keystone `resources: {policyd-override:
     ./policies/overrides.zip}` and commit the zip beside its source yaml. Deploy-time
     attach becomes automatic -- the bundle describes the WHOLE desired state, zero
     manual step, zero Roosevelt delta. Sync risk (zip vs yaml drift) is closed by a
     pre-flight assert: rebuild the zip from policies/, byte-compare against committed,
     HOLD on mismatch.
 (b) Schedule step: a gated attach block in phase-03 (post-TLS-settle), $HOME-pathed,
     gating on `PO:` in juju status.
 EITHER WAY: the G3 BEHAVIORAL gate (manager can self-service own domain; admin-grant
 and cross-domain DENIED; cloud-admin unaffected) must be a phase step -- D-051's own
 warning: the charm validates YAML only, `PO:` proves parse, not policy. RECOMMEND
 (a) + G3 gate in phase-03: post-deploy manual steps are exactly the class D-046
 proved unreliable ("reports ready regardless").

### DOCFIX-072 (RISK) -- bundle implements the still-PROPOSED D-043
bundle.yaml nova-compute sets resume-guests-state-on-host-boot: True while D-043
(tenant-VM auto-resume) remains PROPOSED / decision-pending. The bundle is ahead of the
decision record -- the exact drift the discipline forbids (and the restart-procedure doc
already assumes the option is in force). FIX: rule on D-043 -- RECOMMEND adopting its
option (a) (auto-resume + monitoring; industry norm for tenant VMs; customers at
Roosevelt expect VMs back after host maintenance; D-041's down-is-a-signal stance is
preserved for CONTROL-PLANE services, which auto-resume does not touch) -- and mark the
decision ADOPTED with the bundle line as its implementation. Alternative: strip the
option until ruled; NOT recommended (regresses the validated restart procedure).

### NIT-F -- D-011.6 text not amended to the phase-08 ruling
design-decisions D-011 item 6 still reads "Vault unseal + auto-unseal-after-reboot
pattern verified"; phase-08 D-011.6 rules MANUAL unseal is the v1 standard (auto-unseal
NOT configured). Append an amendment note to D-011 so the acceptance bar and the
acceptance runbook agree.

### NIT-G (Roosevelt-forward) -- rabbitmq-server scale-up will race without min-cluster-size
Testcloud: num_units=1, no min-cluster-size -- correct per D-009 (decorative HA). But the
D-009 promise is "Roosevelt scale-up is mechanical: 1 -> 3 and rerun". For rabbitmq that
is NOT sufficient: without `min-cluster-size: 3` the charm accepts client relations
before the cluster forms (same failure CLASS as D-062's mysql formation race; upstream
charm docs call min-cluster-size best practice). Record now as a Roosevelt bundle-delta
note on D-009 so the mechanical scale-up story stays true.

--------------------------------------------------------------------------------
## Patchset status (2026-07-02, patchset-20260702-redeploy-readiness)
--------------------------------------------------------------------------------
IMPLEMENTED in the delivered ZIP (numbers verified next-free at HEAD 690779a;
re-grep at commit): DOCFIX-066 (teardown runbook rewritten around the D-061 fork,
destroy path = validated spine, reenroll step added, all invocations bash-prefixed),
DOCFIX-067 (octavia SAN IP derived from bundle at generation time; verify-live of the
deployed cert still queued), DOCFIX-068 (phase-01 constants -> lib-net/lib-hosts),
DOCFIX-069 (bash-prefix; optional chmod noted in apply-notes), DOCFIX-070 (checks
absorbed into provider-bundle-check.py + 8-case harness; review-bundle.py to git rm),
DOCFIX-071 (bundle-native keystone policy resource + committed zip + drift guard +
phase-03 Step 3.4 two-stage gate + appendix-C /tmp fix + subshell wrap), DOCFIX-072
(D-043 RESOLVED->ADOPTED(a)). D-doc amendments appended: D-002, D-009, D-011, D-043,
D-051, D-061. STILL OPEN: NIT-C (docs/ ASCII sanitize), NIT-D (identifier index),
NIT-E (appendix-A mysql entry), verify-live queue items 1-3.


## Patchset status addendum (2026-07-03, Block 2)
IMPLEMENTED: DOCFIX-073 (preflight + channel assert + phase-01 gate), DOCFIX-074
(repo-lint + full ASCII sanitize incl. .gitignore/netbox; closed NIT-C), DOCFIX-075
(cloud-assert + committed ops-restart-procedure; closed the health-check gap),
DOCFIX-076 (as-executed convention + run-logged + index), DOCFIX-077 (appendix-A
mysql entry + identifier index; closed NIT-D/E), DOCFIX-078 (security ledger),
D-069 (vault custody policy), D-070 (supersedes D-012). Verify-live queue item 2
(channel probe) is now AUTOMATED by preflight P3. Remaining operator inputs:
SEC-003 custodian assignment; capi-mgmt auto-resume exclusion ruling; octavia
deployed-cert SAN read. See docs/changelog-20260703-process-hardening.md.
