Newer
Older
openstack-caracal-ipv4 / docs / docfix-draft-20260702.md
@JANeumatrix JANeumatrix 12 hours ago 14 KB Patches

DOCFIX draft -- redeploy-readiness review (bundle + channels + runbooks)

STATUS: DRAFT / OPEN -- accreting during the 2026-07-02 review session. Numbers are PROVISIONAL from next-free DOCFIX-066 (verified against HEAD 690779a: DOCFIX-065 / D-068 / BUNDLEFIX-008 consumed). Renumber-check again at commit time. ASCII + LF.

Scope of this review:

  1. bundle.yaml -- YAML validity, structural consistency, known anti-patterns.
  2. Charm channel pins -- staleness review against current Charmhub guidance.
  3. Runbook sweep -- cross-reference integrity, stale values, anything that would break the next redeploy.

Severity key: BLOCKER (breaks redeploy) / RISK (may break or mislead) / NIT (consistency only).


Findings


(appended as found)

DOCFIX-066 (BLOCKER) -- teardown runbook drives the DEPRECATED teardown script

File: runbooks/phase-00-teardown-maas-reset.md (steps 2, plan table, lines 22/30/67/74/85). The runbook's execution spine is scripts/phase-00-teardown.sh --apply with the narrative "hosts release to MAAS Ready" -- the exact premise DOCFIX-057/D-061 proved WRONG on this virsh-pod MAAS (destroy-model DECOMPOSES pod-composed machines; observed 3x). The script itself carries a DO-NOT-USE banner, so the runbook and script now contradict each other; an operator following the runbook on the next redeploy either hits the deprecation mid- teardown or, if they push past it, triggers a fourth decompose + full reenroll/recarve. The D-061 replacements (phase-00-teardown-release.sh --keep-instance + canary; phase-00-teardown-destroy.sh) exist but are never mentioned in the runbook. FIX: rewrite the runbook spine as the D-061 fork -- (a) machine-preserving path: teardown-release.sh with the MANDATORY first-run canary (--apply --canary, verify openstack0 survives in MAAS, then all-four); (b) from-scratch path: teardown-destroy.sh

  • reenroll + recarve. State which path the standard redeploy uses. Step-5 "hosts Ready" premise and the OSD-wipe/8_lbaas step ordering must be revalidated per path (release path leaves hosts Deployed, not Ready -- the wipe/carve preconditions differ).

DOCFIX-067 (RISK -- verify live before ruling) -- octavia PKI cert SAN carries the pre-R14 VIP

File: runbooks/phase-01-bundle-deploy.md 1.0-GEN.c (lines ~292, ~318). The controller-cert CNF sets IP.1 = 10.12.4.233 -- the OLD octavia VIP. R14 relocated all VIPs to .50-.60; bundle HEAD has octavia at 10.12.4.57. The octavia-pki overlay is regenerated every phase-01, so the LIVE cloud's controller cert most likely carries .233 today; octavia passed phase-05 validation regardless, so the amphora side evidently does not verify that SAN IP -- functional impact UNPROVEN, inconsistency CERTAIN, and it is a latent break if SAN verification ever tightens (or when Roosevelt re-uses this block). The DNS.1/DNS.2 SANs also reference the D-019-dropped FQDN scheme (harmless, same sweep). FIX: derive the SAN IP dynamically (lib-net VIP_PREFIX_PROVIDER + octavia's octet from bundle/juju -- rule 3), not a literal. VERIFY-LIVE first (gated CHECK, jumphost): read the overlay/live cert SAN and confirm what is actually deployed: openssl x509 -in -noout -text | grep -A2 'Subject Alternative Name' Rule on severity after the read: if the live cert has .233 and octavia is green, keep RISK (doc fix + regenerate at next redeploy); do not hot-rotate certs mid-cloud for this.

DOCFIX-068 (RISK) -- phase-01 "Constants and env-literals" block is pre-D-052 stale

File: runbooks/phase-01-bundle-deploy.md lines ~23-26. The block states the RETIRED plane map (2=metal .8, 6=data .12, 7=storage .16, 8=replication .20, 9=lbaas .32 -- wrong plane->CIDR pairs under D-052/D-053, incl. the retired lbaas space), plus hardcoded MAAS subnet IDs (violates lib-net PATTERN-1: IDs drift, resolve by CIDR) and hardcoded system_ids (violates DOCFIX-040: re-minted per enrollment; lib-hosts resolves). Mixed freshness: the "50 apps, 97 relations" expectation in the same block MATCHES bundle HEAD. Misleading at the worst moment (mid-deploy reference values). FIX: replace the stale lines with pointers to scripts/lib-net.sh (planes) and scripts/lib-hosts.sh (host identity); retain only verified-current literals.

DOCFIX-069 (RISK) -- zero exec bits + bare script invocations

git index: ALL 37 files under scripts/ are mode 100644 (GitHub Desktop workflow strips +x). Fresh clone on the jumphost -> every BARE invocation fails "Permission denied". runbooks/phase-00-teardown-maas-reset.md invokes bare in ~10 places (teardown, carve, standup); other runbooks appear bash-prefixed (sweep found no other bare hits). FIX (durable, matches the Windows commit constraint): bash-prefix every script invocation in runbooks (bash scripts/x.sh ...). Optional belt: git update-index --chmod=+x scripts/*.sh -- but Windows-side recommits can strip again, so the bash prefix is the invariant; do both if desired.

DOCFIX-070 (RISK) -- scripts/review-bundle.py is pre-D-052 stale; NOT CLEAN is noise

Against bundle HEAD it reports FAIL=71: expects space metal (retired), DUAL VIPs (D-020 form; D-052 moved to triples), no per-endpoint bindings (D-052 introduced them), vault 1.8 (D-068 pinned 1.16), baselines 51 apps/98 rels (now 50/97: VIP set changed -- vault dropped its VIP, ceph-radosgw gained one -- blessed by provider-bundle-check.py, which PASSES clean). Hazard: a pre-deploy NOT CLEAN verdict that must be ignored trains alarm fatigue and will eventually mask a real defect. OPTIONS: (a) update review-bundle.py expectations to the D-052/D-060/D-062/D-068 model; (b) retire it (git rm) and fold any still-unique checks (relation-endpoint syntax, phantom-key detection reworked for the per-endpoint model) into provider-bundle-check.py; (c) banner it historical. RECOMMEND (b): one authoritative gate beats two disagreeing ones -- same reasoning as the D-060 d057-bundle-check retirement.

NIT-A -- D-002 channel matrix drift (design-decisions)

The D-002 table still lists etcd, easyrsa -> latest/stable (etcd/easyrsa dropped; R3 / phase-02 record vault-on-mysql), omits memcached (bundle: latest/stable -- apparently the only track that charm publishes; upstream's "never latest/stable" applies to OpenStack- project charms, which memcached is not), and its vault row (1.8) is superseded by D-068 (1.16). Append-only fix: a dated amendment note under D-002, not an edit. VERIFY-LIVE (gated CHECK, jumphost) before finalizing: for c in memcached rabbitmq-server vault hacluster; do juju info "$c" 2>/dev/null | sed -n '/channels:/,$p' | head -12; done Expected: memcached publishes only latest/*; rabbitmq-server tops out at 3.9; vault carries 1.16/stable; hacluster 2.4/stable.

NIT-B -- channel-pin review conclusion (informational; no change)

All bundle pins judged CURRENT for Caracal/jammy: 2024.1/stable core (18), OVN 24.03/stable, ceph squid/stable, mysql 8.0/stable (12), hacluster 2.4/stable (11), rabbitmq-server 3.9/stable (terminal track for this charm), vault 1.16/stable (D-068), memcached latest/stable (sole track; see NIT-A). Upstream charm-guide delivery page is frozen (last updated 2023-12) -- Charmhub/juju info is the only live authority; the NIT-A verify block doubles as the pre-deploy channel assert. Candidate: fold that assert into scripts/pre-flight-checks.sh (D-002 claims pre-flight verifies channels -- confirm it actually does; not yet audited).

NIT-C -- ASCII-rule violations in docs/

docs/v1-pre-deploy-fixes.md (277 non-ASCII bytes), docs/netbox-vip-queue.md (81). The repo rule is ASCII-only for all committed files (mod_wsgi lesson). Low functional risk (docs, not conf), but the rule is stated absolute -- sanitize or record a carve-out.

NIT-D -- identifier index gaps

DOCFIX-027/028/029/034/037 and BUNDLEFIX-001..006 are defined only at point of use (runbook/bundle comments) and absent from appendix-A / the changelog index -- appendix-A claims to be the index "keyed by the same identifiers used inline". Add one-line index entries (or mark point-of-use-only identifiers as such).

NIT-E -- appendix-A lacks a mysql-innodb-cluster recovery entry

D-062 material (blocked 'Instance not yet configured' = single-unit seed; half-join instanceErrors = mid-life rescan; reboot-cluster-from-complete-outage ONLY on confirmed outage -- destructive against a healthy cluster) exists in design-decisions + the restart procedure but has no appendix-A symptom entry. Add one; also consider committing the restart-procedure doc to the repo (it currently lives outside it).


Verify-live queue (gated CHECKs for the jumphost before findings finalize)

  1. Octavia controller cert SAN (DOCFIX-067) -- read the deployed overlay/cert.
  2. juju info channel probe (NIT-A/B) -- memcached / rabbitmq-server / vault / hacluster.
  3. pre-flight-checks.sh -- confirm whether it performs the D-002 channel assert.

Deployment-flow parity findings (decision vs bundle vs schedule)


DOCFIX-071 (BLOCKER) -- D-064 keystone policy attach is not reachable from the deploy schedule

Evidence: bundle.yaml keystone has use-policyd-override=True but NO resources: stanza; attach-resource keystone appears in NO phase runbook or script -- only appendix-C:73-74. phase-01:183 knowingly deploys into "PO (broken)" and phase-02:167 re-notes it as FINDING-1 "not a regression"; no phase ever resolves it. The live cloud got the policy via a session action (D-064), never folded into the schedule. NEXT REDEPLOY as written: keystone stays PO (broken), the SCS Domain Manager RBAC (the commercial tenant-isolation core, D-051) is ABSENT, and tenant onboarding fails at G3. Compounding defect: the appendix-C block zips to and attaches FROM /tmp -- the documented snap-confinement trap (attach-resource cannot read /tmp on this jumphost; use $HOME). The only written procedure is the known-broken form. FIX OPTIONS (debate): (a) Bundle-native resource: add to keystone resources: {policyd-override: ./policies/overrides.zip} and commit the zip beside its source yaml. Deploy-time attach becomes automatic -- the bundle describes the WHOLE desired state, zero manual step, zero Roosevelt delta. Sync risk (zip vs yaml drift) is closed by a pre-flight assert: rebuild the zip from policies/, byte-compare against committed, HOLD on mismatch. (b) Schedule step: a gated attach block in phase-03 (post-TLS-settle), $HOME-pathed, gating on PO: in juju status. EITHER WAY: the G3 BEHAVIORAL gate (manager can self-service own domain; admin-grant and cross-domain DENIED; cloud-admin unaffected) must be a phase step -- D-051's own warning: the charm validates YAML only, PO: proves parse, not policy. RECOMMEND (a) + G3 gate in phase-03: post-deploy manual steps are exactly the class D-046 proved unreliable ("reports ready regardless").

DOCFIX-072 (RISK) -- bundle implements the still-PROPOSED D-043

bundle.yaml nova-compute sets resume-guests-state-on-host-boot: True while D-043 (tenant-VM auto-resume) remains PROPOSED / decision-pending. The bundle is ahead of the decision record -- the exact drift the discipline forbids (and the restart-procedure doc already assumes the option is in force). FIX: rule on D-043 -- RECOMMEND adopting its option (a) (auto-resume + monitoring; industry norm for tenant VMs; customers at Roosevelt expect VMs back after host maintenance; D-041's down-is-a-signal stance is preserved for CONTROL-PLANE services, which auto-resume does not touch) -- and mark the decision ADOPTED with the bundle line as its implementation. Alternative: strip the option until ruled; NOT recommended (regresses the validated restart procedure).

NIT-F -- D-011.6 text not amended to the phase-08 ruling

design-decisions D-011 item 6 still reads "Vault unseal + auto-unseal-after-reboot pattern verified"; phase-08 D-011.6 rules MANUAL unseal is the v1 standard (auto-unseal NOT configured). Append an amendment note to D-011 so the acceptance bar and the acceptance runbook agree.

NIT-G (Roosevelt-forward) -- rabbitmq-server scale-up will race without min-cluster-size

Testcloud: num_units=1, no min-cluster-size -- correct per D-009 (decorative HA). But the D-009 promise is "Roosevelt scale-up is mechanical: 1 -> 3 and rerun". For rabbitmq that is NOT sufficient: without min-cluster-size: 3 the charm accepts client relations before the cluster forms (same failure CLASS as D-062's mysql formation race; upstream charm docs call min-cluster-size best practice). Record now as a Roosevelt bundle-delta note on D-009 so the mechanical scale-up story stays true.


Patchset status (2026-07-02, patchset-20260702-redeploy-readiness)


IMPLEMENTED in the delivered ZIP (numbers verified next-free at HEAD 690779a; re-grep at commit): DOCFIX-066 (teardown runbook rewritten around the D-061 fork, destroy path = validated spine, reenroll step added, all invocations bash-prefixed), DOCFIX-067 (octavia SAN IP derived from bundle at generation time; verify-live of the deployed cert still queued), DOCFIX-068 (phase-01 constants -> lib-net/lib-hosts), DOCFIX-069 (bash-prefix; optional chmod noted in apply-notes), DOCFIX-070 (checks absorbed into provider-bundle-check.py + 8-case harness; review-bundle.py to git rm), DOCFIX-071 (bundle-native keystone policy resource + committed zip + drift guard + phase-03 Step 3.4 two-stage gate + appendix-C /tmp fix + subshell wrap), DOCFIX-072 (D-043 RESOLVED->ADOPTED(a)). D-doc amendments appended: D-002, D-009, D-011, D-043, D-051, D-061. STILL OPEN: NIT-C (docs/ ASCII sanitize), NIT-D (identifier index), NIT-E (appendix-A mysql entry), verify-live queue items 1-3.

Patchset status addendum (2026-07-03, Block 2)

IMPLEMENTED: DOCFIX-073 (preflight + channel assert + phase-01 gate), DOCFIX-074 (repo-lint + full ASCII sanitize incl. .gitignore/netbox; closed NIT-C), DOCFIX-075 (cloud-assert + committed ops-restart-procedure; closed the health-check gap), DOCFIX-076 (as-executed convention + run-logged + index), DOCFIX-077 (appendix-A mysql entry + identifier index; closed NIT-D/E), DOCFIX-078 (security ledger), D-069 (vault custody policy), D-070 (supersedes D-012). Verify-live queue item 2 (channel probe) is now AUTOMATED by preflight P3. Remaining operator inputs: SEC-003 custodian assignment; capi-mgmt auto-resume exclusion ruling; octavia deployed-cert SAN read. See docs/changelog-20260703-process-hardening.md.