diff --git a/bundle.yaml b/bundle.yaml index d80e20f..c6bd4b4 100644 --- a/bundle.yaml +++ b/bundle.yaml @@ -130,6 +130,14 @@ access: metal-internal certificates: metal-internal cluster: metal-internal + # BUNDLEFIX-007 (D-067 amendment, 2026-07-02): external MUST equal access. The charm computes + # vault_url (access) AND vault_url_external (external) and, because the vault-kv interface + # ignores remote_binding (deprecated, LP#1895185), the external publish CLOBBERS vault_url on + # ALL kv consumer relations. If external falls to the '' default (metal-admin), every consumer + # (barbican-vault) is told to dial the metal-admin address while its AppRole secret_id is + # CIDR-bound to its metal-internal /32 -> deterministic login reject -> Barbican 500. + # Present on charm-vault stable/1.8 AND master (checked 2026-07-02) -- keep this line on upgrades. + external: metal-internal ha: metal-internal secrets: metal-internal shared-db: metal-internal diff --git a/docs/design-decisions.md b/docs/design-decisions.md index 9f8384c..c4a63e7 100644 --- a/docs/design-decisions.md +++ b/docs/design-decisions.md @@ -1320,6 +1320,48 @@ D-057 (prior topology-didn't-follow-binding defect -- same family). **Corrects:** the in-session "secret_id TTL expiry" hypothesis (refuted by the refresh-secrets test). +### D-067 -- AMENDMENT (2026-07-02, post-fix): corrected mechanism; FIXED live; CLOSED for v1 + +The "bundle correct / live drifted" framing above is WRONG and is corrected here (append-only +discipline; the body above stands as the historical record). Root cause, read from charm source at +the exact vendored versions: + +1. Live bindings were NEVER drifted -- `juju show-application` matched the bundle exactly (all + secrets-path endpoints metal-internal). The drift was one relation-data key: vault/0 advertised + binding-correct ingress (10.12.12.117) but an explicit `vault_url` of http://10.12.8.190:8200. +2. Mechanism: charm-vault `send_vault_url_and_ca()` (re-fires on every non-update-status hook) + computes vault_url from the `access` binding AND vault_url_external from the `external` binding, + publishing BOTH when they differ. The vault-kv interface (vendored commit 6f7848c, per + src/build.lock) IGNORES the `remote_binding` selector (deprecated, LP#1895185) and writes the + single `vault_url` key on ALL relations -- so the external publish CLOBBERS the access URL for + every kv consumer. Simultaneously the AppRole secret_id is CIDR-bound to the CONSUMER's + secrets-relation ingress /32 (10.12.12.110/32) -- self-inconsistent whenever access != external. +3. Why external resolved to metal-admin: the bundle OMITS vault's `external` binding, so it falls + to the `''` default (metal-admin). D-052's judgment call ("vault external -> metal-admin, + operator/unseal path") rested on a mistaken premise: in this charm the `external` endpoint's only + functional use is this kv-URL advertisement (plus an inert VIP check); operator/unseal access + does not traverse it, and the listener binds [::]:8200 regardless. So a fresh deploy from the + bundle REPRODUCES the failure -- this was a bundle+charm defect, not live drift. + +Fix as executed (2026-07-02, gated): `juju bind -m openstack vault external=metal-internal`. The two +computed URLs became equal, the charm's own equality guard suppressed the second publish, relation +data flipped to http://10.12.12.117:8200 within one poll cycle, barbican-vault re-rendered +barbican.conf, and no refresh-secrets was needed. Validated end-to-end: AppRole login HTTP 200 from +the authentic source (kernel route src 10.12.12.110), then a full admin `openstack secret` +store/get/payload-compare/delete round-trip through the exact POST /v1/secrets path that 500'd. +Bundle hardened with BUNDLEFIX-007 (`external: metal-internal` + comment); the double-publish is +present on charm-vault master as of 2026-07-02, so the guard is permanent, including the D-068 +1.16 pin. + +Observation logged during validation (not actioned): barbican's `Secret href` renders as +https://None:9312/... (host_href unset/None). Cosmetic for barbicanclient/castellan consumers +(clients extract the UUID and dial their own catalog endpoint) -- stage-6 cluster cert refs will +exercise it for real; investigate only if cert-ref retrieval misbehaves. + +**Status:** CLOSED for v1 (fixed + validated live 2026-07-02). **Adds:** BUNDLEFIX-007. +**Corrects:** the "bundle is CORRECT; the LIVE env drifted" mechanism above. **Related:** D-052 +(premise correction noted), D-068 (carry BUNDLEFIX-007 through the 1.16 pin). + --- ## D-068: PROPOSED -- Vault substrate hardening (Roosevelt) diff --git a/docs/v1-redeploy-changelog.md b/docs/v1-redeploy-changelog.md index 0c7ee16..76f8aab 100644 --- a/docs/v1-redeploy-changelog.md +++ b/docs/v1-redeploy-changelog.md @@ -1361,3 +1361,29 @@ capi-mgmt scope preamble + flavor floor; 8.1 D-039 role + keypair pre-checks; octavia prereq real-exit capture), to be written at phase-08 close. D-063 = capi-mgmt-sg 0.0.0.0/0 hardening, PROPOSED/OPEN. DOCFIX-063 = phase-07 reconciliation, six fixes.) + +## 2026-07-02 (session 2) -- D-067 FIXED live + CLOSED; root cause corrected to bundle+charm defect + +Fixed the Barbican/Vault cert-gen blocker with ONE gated mutation: +`juju bind vault external=metal-internal`. Root cause CORRECTED from "live drift" to bundle+charm: +the bundle omitted vault's `external` binding (-> '' default metal-admin); charm-vault publishes +vault_url (access) then vault_url_external (external), and the vault-kv interface ignores +remote_binding (LP#1895185, verified at vendored commit 6f7848c), so the external URL clobbers +vault_url on all kv consumers while the AppRole stays CIDR-bound to the consumer's metal-internal +/32. Diagnosis was code-grounded (charm-vault stable/1.8 + build.lock-pinned interface), gated: +Gate 1 (config/network-get/hacluster preflight) -> Gate 2 (bind + relation-data + conf-render +polls; base64-to-file delivery after a raw-paste truncation) -> Gate 3 (AppRole login 200 from +authentic source .110 + admin secret store/get/payload/delete round-trip). No refresh-secrets +needed; no CIDR widening. BUNDLEFIX-007 adds `external: metal-internal` permanently (double-publish +present on charm-vault master). D-052's "vault external -> metal-admin operator path" premise +corrected (endpoint's only functional use is the kv URL advertisement). D-067 amendment appended; +tenant-onboard.sh stage-6 gating note updated to RESOLVED. + +Also this session: handoff commit 22a1eef verified 7/8 (bundle.yaml was DELETED by the loose-file +commit pattern; restored + verified in 10e9186 -- diff vs pre-delete is exactly the vault 1.16 pin). +Observation (logged, not actioned): barbican Secret href renders https://None:9312/... (host_href +None) -- cosmetic for UUID-extracting clients; watch during stage-6 cert refs. + +### Next-free numbers +Design decision: D-069. Doc fix: DOCFIX-065 (unchanged). Bundle fix: BUNDLEFIX-008 (007 ASSIGNED +above; repo tree grep showed 002-004 but 001-006 exist in history, so 007 is the first safe free). diff --git a/scripts/tenant-onboard.sh b/scripts/tenant-onboard.sh index e4a3cbf..cfcf837 100644 --- a/scripts/tenant-onboard.sh +++ b/scripts/tenant-onboard.sh @@ -1,8 +1,7 @@ #!/usr/bin/env bash # tenant-onboard.sh -- Option-3 multi-tenant onboarding (D-066), Omega Cloud v1 -# STATUS: DRAFT 2026-07-02. Stages 0-5 validated live (tenant acme). Stage 6 (cluster create) is -# GATED on D-067 (barbican<->Vault metal-internal rebind) -- it will fail at cert-gen until that is -# fixed. Run stages 0-5 to build the tenant; run stage 6 only after D-067 is resolved. +# STATUS: DRAFT 2026-07-02. Stages 0-5 validated live (tenant acme). Stage 6 gate CLEARED: +# D-067 (barbican<->Vault) FIXED + validated live 2026-07-02 (see D-067 amendment / BUNDLEFIX-007). # # Model (D-066): operator creates domain + manager; manager creates project + -cluster (password, # trust-capable, cluster lifecycle) + -svc (unrestricted app cred, non-trust automation). Cluster @@ -148,8 +147,7 @@ stage6(){ # cluster create as -cluster PASSWORD [GATED ON D-067] echo "== stage6: cluster create as ${CLIENT}-cluster (PASSWORD) ==" - echo " NOTE: GATED ON D-067 -- until the barbican<->Vault metal-internal rebind is done, this dies" - echo " at cert-gen (Barbican 500 / Vault AppRole CIDR reject). Proceed only post-D-067." + echo " NOTE: D-067 RESOLVED 2026-07-02 (vault external binding -> metal-internal; BUNDLEFIX-007)." admin_env local DOM; DOM=$(openstack domain show "$CLIENT" -f value -c id &1) local CF="$OUT/${CLIENT}-cluster-cred.txt"; local PID; PID=$(awk -F= '/^project_id=/{print $2}' "$CF") @@ -172,7 +170,7 @@ stage4|4) stage4 ;; stage5|5) stage5 ;; stage6|6) stage6 ;; - all) stage0; stage1; stage2; stage3; stage4; stage5; echo "== stages 0-5 done. stage6 (cluster) is gated on D-067 -- run explicitly: tenant-onboard.sh $CLIENT 6 ==" ;; + all) stage0; stage1; stage2; stage3; stage4; stage5; echo "== stages 0-5 done. run stage6 (cluster) explicitly: tenant-onboard.sh $CLIENT 6 ==" ;; *) die "unknown stage: $STAGE" ;; esac echo "handover creds in: $OUT (0600)"