Session findings -- 2026-07-02 (multi-tenant tenant->cluster buildout)
Executive summary
The tenant IDENTITY/TRUST path is DONE and PROVEN. A tenant password identity now creates a Magnum cluster through create_user (D-064), create_trust (D-065), and into certificate generation. Cluster COMPLETION is blocked one step later by an OPERATOR-side Barbican/Vault substrate defect (D-067), independent of the tenant model. The Option-3 tenant account model (D-066) is adopted and to be used from the first tenant. Next session: fix D-067 (live), then full tenant buildout + tenant-facing tests.
The trust-blocker chain (how we got from "cluster 403s" to "done + one substrate bug")
- D-064 (prior): create_user template fix unblocked trustee-user creation.
- create_trust then 403'd for EVERY caller (admin included), even trustor==self via direct
openstack trust create. Root cause: base policy shipped identity:create_trust with the non-resolving user_id:%(trust.trustor_user_id)s (Caracal populates target.trust.trustor_user_id). -> D-065: override with the target-prefixed form keystone itself ships. PROVEN by toggling the override off (still 403 -> base policy owns it) then on.
- After D-065, create_trust via APP CRED still failed: keystone
_check_application_credential blocks trust creation from app-cred tokens "regardless of the unrestricted flag" (this build's docstring). -> D-066: cluster-create MUST be PASSWORD auth; adopt Option-3 account split. allow_insecure_application_credential_trust_escalation REJECTED (isolation).
- Password create_trust PASSED. Cluster then failed at cert-gen -> Barbican 500 -> castellan vault_key_manager -> Vault AppRole login rejected: "source address 10.12.8.176 unauthorized through CIDR restrictions". -> D-067.
D-067 root cause (and a corrected mis-diagnosis)
barbican reaches Vault on the METAL-ADMIN plane (vault_url=10.12.8.190, egress 10.12.8.176). Vault's barbican-vault AppRole binds the secret_id to the METAL-INTERNAL CIDR (where east-west service traffic belongs, D-052/D-053). Off-plane source -> rejected. The bundle is CORRECT (vault/barbican/barbican-vault all bind secrets endpoints to metal-internal, lines 130/667/700); the LIVE env drifted. Fix = live rebind to metal-internal (gated, next session), NOT CIDR-widen. CORRECTED: mid-session I hypothesized "secret_id TTL expiry". REFUTED -- juju run vault/leader refresh-secrets rotated the secret_id (barbican.conf re-rendered, service restarted) and the login STILL failed with the CIDR error. It was never expiry; it is plane/CIDR.
What is validated live (tenant acme)
- Manager persona self-service via CLI (create_project/user/grant) -- D-064 G3. PASS.
- Tenant isolation: anti-escalation (admin grant DENIED); cross-domain resource reads DENIED/hidden; domain enumeration OWN-DOMAIN-ONLY (tighter than appendix-C's SCS worst-case -- appendix-C corrected).
- App-cred + keypair self-mint; tenant L3 (net/subnet/router/ext-gw, SNAT proven) by a non-admin app-cred identity.
- Cluster template create (image by UUID -- name form has a quoting/derivation hazard).
- Cluster create through create_user + create_trust (password) into cert-gen.
Decisions logged
- D-066: Option-3 tenant accounts (domain-admin/cluster/svc); cluster-create requires password auth.
- D-067: barbican-vault -> Vault must use metal-internal (live drift; the cert-gen blocker). ADOPTED, fix pending.
- D-068: PROPOSED -- Vault substrate hardening (1.16 pin [bundle done], TLS, AppRole lifecycle).
Probe-discipline lessons (now runbook conventions -- these recurred and cost time)
- Validate raw output WHOLE, never extract-then-check. A
tr -dc 0-9 MARK guard turned an error string ("...10.12.8.30:17070...") into MARK=123101283017070 and passed. Use case "$raw" in ''|*[!0-9]*) fail;; *) ok;; esac.
- Whitelist-print secrets, never blacklist-redact.
approle_secret_id leaked past a secret-keyed redact (the key is _secret_id). Print only an allowlist of safe fields; never pipe secrets.
- No
exit/bare-return in interactive PASTE blocks (they escape to the login shell and logged the operator out). Subshell-wrap ( ... ). NOTE: executed .sh scripts may use exit normally.
- Privileged reads over
juju ssh use sudo cat file | ..., never sudo cmd < file (the redirect runs UNPRIVILEGED -> Permission denied).
- Use the deployment's DECLARED endpoint/scheme, not the conventional one (assumed Vault https; it serves http -- every probe errored on scheme until corrected).
- A parser that can print NOTHING has a silent third state -- read raw + self-report inputs (field lengths, raw body) so a malformed-request 400 can't masquerade as an auth failure.
Roosevelt hardening backlog (from this session)
- D-067/D-068: metal-internal binding discipline for ALL vault-kv consumers; Vault 1.16 + TLS; AppRole secret_id lifecycle (TTL/renewal + proactive auth health probe).
- Endpoint/credential "follow the topology" is now a recurring class (with D-057): consider stable VIP/DNS endpoints for substrate services so leader/re-IP changes don't silently break consumers.
Next session plan
- Repair live env: read-only binding diagnosis (
juju show-application vault barbican barbican-vault, spaces<->subnets), then GATED juju bind of the barbican<->Vault secrets path to metal-internal; re-run refresh-secrets if needed; confirm barbican AppRole login HTTP 200 from metal-internal.
- Re-run tenant cluster-create (acme, ${CLIENT}-cluster password) -> cert-gen clears -> watch to CREATE_COMPLETE; capture the CAPO child-cred mint identity (confirms D-066).
- Full tenant buildout via scripts/tenant-onboard.sh; then clean-room
beta (zero admin fallback).
- Tenant-facing tests: kubeconfig, nodes/CNI/CCM, a tenant LB, tenant isolation from a second tenant.