# Session findings -- 2026-07-02 (multi-tenant tenant->cluster buildout)

## Executive summary
The tenant IDENTITY/TRUST path is DONE and PROVEN. A tenant password identity now creates a Magnum
cluster through create_user (D-064), create_trust (D-065), and into certificate generation. Cluster
COMPLETION is blocked one step later by an OPERATOR-side Barbican/Vault substrate defect (D-067),
independent of the tenant model. The Option-3 tenant account model (D-066) is adopted and to be used
from the first tenant. Next session: fix D-067 (live), then full tenant buildout + tenant-facing tests.

## The trust-blocker chain (how we got from "cluster 403s" to "done + one substrate bug")
1. D-064 (prior): create_user template fix unblocked trustee-user creation.
2. create_trust then 403'd for EVERY caller (admin included), even trustor==self via direct
   `openstack trust create`. Root cause: base policy shipped identity:create_trust with the
   non-resolving `user_id:%(trust.trustor_user_id)s` (Caracal populates target.trust.trustor_user_id).
   -> D-065: override with the target-prefixed form keystone itself ships. PROVEN by toggling the
   override off (still 403 -> base policy owns it) then on.
3. After D-065, create_trust via APP CRED still failed: keystone `_check_application_credential`
   blocks trust creation from app-cred tokens "regardless of the unrestricted flag" (this build's
   docstring). -> D-066: cluster-create MUST be PASSWORD auth; adopt Option-3 account split.
   `allow_insecure_application_credential_trust_escalation` REJECTED (isolation).
4. Password create_trust PASSED. Cluster then failed at cert-gen -> Barbican 500 -> castellan
   vault_key_manager -> Vault AppRole login rejected: "source address 10.12.8.176 unauthorized
   through CIDR restrictions". -> D-067.

## D-067 root cause (and a corrected mis-diagnosis)
barbican reaches Vault on the METAL-ADMIN plane (vault_url=10.12.8.190, egress 10.12.8.176). Vault's
barbican-vault AppRole binds the secret_id to the METAL-INTERNAL CIDR (where east-west service traffic
belongs, D-052/D-053). Off-plane source -> rejected. The bundle is CORRECT (vault/barbican/barbican-vault
all bind secrets endpoints to metal-internal, lines 130/667/700); the LIVE env drifted. Fix = live
rebind to metal-internal (gated, next session), NOT CIDR-widen.
CORRECTED: mid-session I hypothesized "secret_id TTL expiry". REFUTED -- `juju run vault/leader
refresh-secrets` rotated the secret_id (barbican.conf re-rendered, service restarted) and the login
STILL failed with the CIDR error. It was never expiry; it is plane/CIDR.

## What is validated live (tenant acme)
- Manager persona self-service via CLI (create_project/user/grant) -- D-064 G3. PASS.
- Tenant isolation: anti-escalation (admin grant DENIED); cross-domain resource reads DENIED/hidden;
  domain enumeration OWN-DOMAIN-ONLY (tighter than appendix-C's SCS worst-case -- appendix-C corrected).
- App-cred + keypair self-mint; tenant L3 (net/subnet/router/ext-gw, SNAT proven) by a non-admin
  app-cred identity.
- Cluster template create (image by UUID -- name form has a quoting/derivation hazard).
- Cluster create through create_user + create_trust (password) into cert-gen.

## Decisions logged
- D-066: Option-3 tenant accounts (domain-admin/cluster/svc); cluster-create requires password auth.
- D-067: barbican-vault -> Vault must use metal-internal (live drift; the cert-gen blocker). ADOPTED, fix pending.
- D-068: PROPOSED -- Vault substrate hardening (1.16 pin [bundle done], TLS, AppRole lifecycle).

## Probe-discipline lessons (now runbook conventions -- these recurred and cost time)
1. Validate raw output WHOLE, never extract-then-check. A `tr -dc 0-9` MARK guard turned an error
   string ("...10.12.8.30:17070...") into MARK=123101283017070 and passed. Use `case "$raw" in
   ''|*[!0-9]*) fail;; *) ok;; esac`.
2. Whitelist-print secrets, never blacklist-redact. `approle_secret_id` leaked past a `secret`-keyed
   redact (the key is *_secret_id*). Print only an allowlist of safe fields; never pipe secrets.
3. No `exit`/bare-`return` in interactive PASTE blocks (they escape to the login shell and logged the
   operator out). Subshell-wrap `( ... )`. NOTE: executed .sh scripts may use exit normally.
4. Privileged reads over `juju ssh` use `sudo cat file | ...`, never `sudo cmd < file` (the redirect
   runs UNPRIVILEGED -> Permission denied).
5. Use the deployment's DECLARED endpoint/scheme, not the conventional one (assumed Vault https; it
   serves http -- every probe errored on scheme until corrected).
6. A parser that can print NOTHING has a silent third state -- read raw + self-report inputs (field
   lengths, raw body) so a malformed-request 400 can't masquerade as an auth failure.

## Roosevelt hardening backlog (from this session)
- D-067/D-068: metal-internal binding discipline for ALL vault-kv consumers; Vault 1.16 + TLS;
  AppRole secret_id lifecycle (TTL/renewal + proactive auth health probe).
- Endpoint/credential "follow the topology" is now a recurring class (with D-057): consider stable
  VIP/DNS endpoints for substrate services so leader/re-IP changes don't silently break consumers.

## Next session plan
1. Repair live env: read-only binding diagnosis (`juju show-application vault barbican barbican-vault`,
   spaces<->subnets), then GATED `juju bind` of the barbican<->Vault secrets path to metal-internal;
   re-run refresh-secrets if needed; confirm barbican AppRole login HTTP 200 from metal-internal.
2. Re-run tenant cluster-create (acme, ${CLIENT}-cluster password) -> cert-gen clears -> watch to
   CREATE_COMPLETE; capture the CAPO child-cred mint identity (confirms D-066).
3. Full tenant buildout via scripts/tenant-onboard.sh; then clean-room `beta` (zero admin fallback).
4. Tenant-facing tests: kubeconfig, nodes/CNI/CCM, a tenant LB, tenant isolation from a second tenant.
