Newer
Older
openstack-caracal-ipv4 / runbooks / phase-03-core-verify.md

Phase 03 -- Core Verify (settle, admin-openrc, Horizon)

After vault's cert cascade (phase-02), confirm the cloud settled to active/idle (except the expected post-deploy blocks), build the IP-only admin-openrc, verify API reachability, and repoint the external Horizon reverse proxy.

Decisions: B5 (IP-only endpoints; no FQDN), D-021 (octavia stays BLOCKED awaiting configure-resources -- expected, cleared in phase-05), D-044 (Horizon Secure-cookie override on the plain-HTTP proxy leg; Step 3.3, PER-REBUILD), D-045 / DOCFIX-031 (haproxy backends confirmed LOADED via a functional sweep, NOT juju status; Step 3.1). Troubleshooting: appendix-A -- DOCFIX-021 (action human-output corrupts captured artifacts), DOCFIX-018 (IP-only OS_AUTH_URL), DOCFIX-022 (admin project discovered, not hardcoded), D-045/DOCFIX-031 (haproxy plaintext-check-vs-SSL backend DOWN), nginx reverse-proxy lessons.


Prerequisites (must be true entering phase-03)

  • phase-02 done: vault unsealed + authorized; root CA generated; the cert cascade is running/settling.
  • The Vault root CA is available via the vault charm action (pulled below).

Constants and env-literals (TAG: confirm per site on rebuild)

  • ENV(keystone-vip) 10.12.4.50 (keystone PUBLIC endpoint = provider VIP; verify vs bundle)
  • ENV(admin-domain) admin_domain (charmed-keystone admin user + project domain)
  • ENV(dashboard-vip) 10.12.4.58 (Horizon provider VIP; was .234 pre-R14)
  • admin project: DISCOVERED at runtime (do not hardcode -- DOCFIX-022).

Run-location legend

  • # RUN: jumphost -- vopenstack-jesse as jessea123; juju + openstack + openssl.

Step 3.1 -- Settle the cert cascade + acceptance walk

# RUN: jumphost The cascade here is NARROW (mysql bootstrapped before vault init, so only the Vault consumers clear: ovn-central x3, ovn-chassis x3, ovn-chassis-octavia, neutron-api-plugin-ovn, barbican-vault). Watch, then walk units AND subordinates.

juju status --color --watch 30s -m openstack     # Ctrl-C once settled

Acceptance walk (counts non-active/idle across units + subordinates):

juju status -m openstack --format=yaml | python3 -c "
import yaml,sys
d=yaml.safe_load(sys.stdin); apps=d.get('applications',{}); bad=[]
def chk(n,u):
    ws=(u.get('workload-status') or {}).get('current',''); js=(u.get('juju-status') or {}).get('current','')
    msg=(u.get('workload-status') or {}).get('message','')
    if ws!='active' or js!='idle': bad.append('%s: workload=%s juju=%s msg=%s'%(n,ws,js,msg))
for app,info in apps.items():
    for un,ud in (info.get('units') or {}).items():
        chk(un,ud)
        for sn,sd in (ud.get('subordinates') or {}).items(): chk(sn,sd)
print('Non-active/idle units: %d'%len(bad))
for b in bad: print('  '+b)
"

GATE: expected non-active/idle = 1 (octavia/0 BLOCKED "Awaiting configure-resources", the D-021 next step) or briefly 2 (+ glance-simplestreams-sync, normal pre-run). Any TLS consumer (the five above) persisting waiting/error past ~15 min is the concern -- STOP and read its log + relations (do NOT assume TLS; a prior stall was a MySQL 1045 desync):

juju status --relations -m openstack ovn-central ovn-chassis ovn-chassis-octavia neutron-api-plugin-ovn barbican-vault
# juju ssh -m openstack <unit> -- 'sudo tail -120 /var/log/juju/unit-<unit-dashed>.log' </dev/null

Step 3.1 backend-health gate (DOCFIX-031 / D-045) -- juju status is BLIND to a dead backend

# RUN: jumphost The acceptance walk above gates on juju active/idle. That is NOT sufficient: a unit can be active/idle while a charm-rendered haproxy backend is silently DOWN -- observed 2026-06-12, nova-cc nova-api down ~3.2 days with juju green (root cause D-045: haproxy not reloaded after the cert cascade -> plaintext checks vs the SSL backend). Probe haproxy's own verdict on every unit:

( {
  echo "=== POST-TLS GATE: haproxy backend health sweep across all units ==="
  for unit in $(juju status -m openstack --format=json | python3 -c 'import json,sys; d=json.load(sys.stdin); [print(u) for a in d.get("applications",{}).values() for u in (a.get("units") or {})]'); do
    juju ssh -m openstack "$unit" -- "test -S /var/run/haproxy/admin.sock || exit 0; sudo python3 -c 'import socket;s=socket.socket(socket.AF_UNIX);s.connect(\"/var/run/haproxy/admin.sock\");s.sendall(b\"show stat\n\");print(s.makefile().read())' | grep -vE 'FRONTEND|BACKEND' | grep ',DOWN,'" </dev/null 2>/dev/null | sed "s|^|[$unit] DOWN: |"
  done
  echo "=== sweep complete -- no DOWN lines above means every haproxy backend is UP ==="
} )

GATE: zero [unit] DOWN: lines. On a DOWN line (check token L7STS/400 == plaintext-vs-SSL), remediate the flagged unit (set U, then validate-and-reload):

U=nova-cloud-controller/0
juju ssh -m openstack "$U" -- 'sudo haproxy -c -f /etc/haproxy/haproxy.cfg' </dev/null   # gate: must say valid
juju ssh -m openstack "$U" -- 'sudo systemctl reload haproxy' </dev/null                 # graceful master-worker

Re-run the sweep until clean. (Signature confirm if needed: plaintext curl http://SVC-IP:876x/ returns 400, TLS curl -k https://SVC-IP:876x/ returns 200 -- the transport is the difference.)

Step 3.2 -- Build admin-openrc (IP-only; canonical block)

# RUN: jumphost Keystone PUBLIC = the provider VIP IP over HTTPS with the vault CA (no FQDN, no /etc/hosts -- B5). This canonical block folds in three fixes: the CA is pulled via --format json + jq because the action's human output wraps the PEM in an INDENTED YAML block that is not valid PEM (appendix-A: DOCFIX-021); the OS_AUTH_URL is the VIP IP (DOCFIX-018); and the admin project is DISCOVERED by a scope-test loop rather than hardcoded, because the scoping project name varies by charm rev (DOCFIX-022 -- the cause of a prior HTTP 401). ( set -e ) keeps OS_PASSWORD inside the subshell and aborts cleanly on any failure.

KEYSTONE_VIP=10.12.4.50              # keystone PUBLIC endpoint = provider VIP (verify vs bundle on rebuild)
ADMIN_DOMAIN=admin_domain            # charmed-keystone admin user + project domain
PROJECT_CANDIDATES="admin admin_domain"   # tried in order; first that SCOPES wins (DOCFIX-022 variance)
CA="$HOME/vault-init/vault-ca-root.pem"
RC="$HOME/admin-openrc"

( set -e
  mkdir -p "$HOME/vault-init"
  # 1. Vault root CA -> file (JSON extract; DOCFIX-021 -- human output indents the PEM)
  juju run vault/leader get-root-ca -m openstack --format json \
    | jq -r '[.. | strings | select(test("-----BEGIN CERTIFICATE-----"))][0]' > "$CA"
  openssl x509 -in "$CA" -noout -subject -dates
  # 2. Admin password -> var (JSON extract, not human output)
  ADMIN_PASS=$(juju run keystone/leader get-admin-password -m openstack --format json | python3 -c "
import json,sys
d=json.load(sys.stdin)
def f(o):
    if isinstance(o,dict):
        for k in ('admin-password','password','Stdout'):
            if k in o and o[k]: return str(o[k]).strip()
        for v in o.values():
            r=f(v)
            if r: return r
    elif isinstance(o,list):
        for v in o:
            r=f(v)
            if r: return r
    return ''
print(f(d))
")
  [ -n "$ADMIN_PASS" ] || { echo "FATAL: password extract failed"; exit 1; }
  # 3. PROJECT LOOKUP: first candidate that issues a SCOPED token wins (DOCFIX-022)
  export OS_AUTH_URL="https://${KEYSTONE_VIP}:5000/v3" OS_USERNAME=admin OS_PASSWORD="$ADMIN_PASS"
  export OS_USER_DOMAIN_NAME="$ADMIN_DOMAIN" OS_PROJECT_DOMAIN_NAME="$ADMIN_DOMAIN"
  export OS_IDENTITY_API_VERSION=3 OS_REGION_NAME=RegionOne OS_CACERT="$CA"
  ADMIN_PROJECT=""
  for P in $PROJECT_CANDIDATES; do
    if OS_PROJECT_NAME="$P" openstack token issue >/dev/null 2>&1; then ADMIN_PROJECT="$P"; break; fi
  done
  [ -n "$ADMIN_PROJECT" ] || { echo "FATAL: no candidate project scoped (tried: $PROJECT_CANDIDATES)"; exit 1; }
  echo "[OK] admin project = $ADMIN_PROJECT ; password len ${#ADMIN_PASS}"
  # 4. Write ~/admin-openrc (backs up any existing one first)
  [ -f "$RC" ] && mv "$RC" "$RC.pre-$(date -u +%Y%m%dT%H%M%SZ)"
  cat > "$RC" <<EOF
export OS_AUTH_URL=https://${KEYSTONE_VIP}:5000/v3
export OS_USERNAME=admin
export OS_PASSWORD='$ADMIN_PASS'
export OS_PROJECT_NAME=$ADMIN_PROJECT
export OS_USER_DOMAIN_NAME=$ADMIN_DOMAIN
export OS_PROJECT_DOMAIN_NAME=$ADMIN_DOMAIN
export OS_IDENTITY_API_VERSION=3
export OS_REGION_NAME=RegionOne
export OS_CACERT=$CA
EOF
  chmod 600 "$RC"
)
# 5. Verify from the written file (password stayed inside the subshell above)
( source "$RC"; echo "auth -> $OS_AUTH_URL  project=$OS_PROJECT_NAME"; openstack token issue 2>&1 | head -6 )
( source "$RC"; openstack endpoint list -f value -c "Service Name" -c Interface -c URL 2>&1 | sort )

GATE: token issue returns a SCOPED token; endpoint list is IP-only across all services (public on the provider VIP .5x, internal+admin on the metal VIP .8.5x, keystone admin on :35357). Two non-blocking notes for later: s3/swift is registered on the radosgw VIP .60:443 (re-check vs the radosgw :80 listener during any Swift/S3 smoke); the gss image-stream is HTTP on metal 10.12.8.172.

Step 3.3 -- Horizon access via the external nginx reverse proxy

# RUN: operator (outside the Juju model) + jumphost Horizon is fronted by an operator-managed nginx reverse proxy. On each rebuild / VIP relocation: (1) repoint the upstream to the CURRENT dashboard provider VIP (now https://10.12.4.58, was .234 pre-R14), and (2) reapply the Horizon Secure-cookie override (DOCFIX-030 / D-044, PER-REBUILD -- below). Two interplays:

  • ALLOWED_HOSTS: Horizon (bundle B5) must permit the Host header that reaches it, else HTTP 400 DisallowedHost. As-built keeps the client Host (proxy_set_header Host $http_host); rewriting it to the VIP would emit redirects corporate clients may not route.
  • Upstream TLS name-match: the dashboard cert is vault-signed and embeds the unit HOSTNAME as a DNS SAN (e.g. juju-ffe3b8-2-lxd-2) alongside IP SANs. nginx upstream verification is DNS-only (X509_check_host), so the proxy_pass IP NEVER matches -- proxy_ssl_verify on requires proxy_ssl_name set to the cert's DNS SAN. (B5 is IP-only for ENDPOINTS, not certs.)

Proxy topology (as-executed 2026-06-12 -- confirm/refresh per site):

  • Proxy host nginx 10.12.4.7 (Ubuntu 24.04, nginx 1.24.0 native/systemd); also fronts MAAS (listen 80 -> 10.12.4.10:5240). Horizon vhost /etc/nginx/sites-available/openstack (symlinked into sites-enabled), listen 81; corporate clients reach it via 10.17.11.246:81.

As-executed change set (gate every edit -- sed -i exits 0 on zero matches, so grep-assert the expected line after any mutation):

# RUN: jumphost -- ship the vault root CA to the proxy
scp ~/vault-init/vault-ca-root.pem jessea123@10.12.4.7:/tmp/
# RUN: operator ON 10.12.4.7 -- install CA, back up + edit the Horizon vhost, validate, restart.
sudo install -o root -g root -m 644 /tmp/vault-ca-root.pem /etc/nginx/vault-ca-root.pem && rm -f /tmp/vault-ca-root.pem
sudo cp -a /etc/nginx/sites-available/openstack "/etc/nginx/sites-available/openstack.bak-$(date -u +%Y%m%dT%H%M%SZ)"
# Set in the Horizon server block (then `grep` to confirm each landed):
#   proxy_pass https://10.12.4.58:443;
#   proxy_ssl_trusted_certificate /etc/nginx/vault-ca-root.pem;
#   proxy_ssl_verify on;
#   proxy_ssl_name juju-ffe3b8-2-lxd-2;   # the dashboard cert's DNS SAN -- per site (discover: openssl s_client -connect 10.12.4.58:443 </dev/null 2>/dev/null | openssl x509 -noout -ext subjectAltName)
#   proxy_redirect https://$http_host/ http://$http_host/;   # unwind the scheme-mismatch redirect loop (Horizon emits absolute https:// on the client Host -> browser then speaks TLS to the :81 plaintext listener)
sudo nginx -t                       # GATE: configuration ok
sudo systemctl restart nginx        # prefer restart over reload for a definitive cutover (a curl ~2s after `reload` can be served by a draining old worker; ~2s blip incl. the co-hosted MAAS proxy)

GATE (on the proxy): curl -sI http://127.0.0.1:81/horizon/ -> 302 to .../auth/login; no TLS errors in error.log.

The charm renders CSRF_COOKIE_SECURE/SESSION_COOKIE_SECURE = True (vault:certificates). On the plain-HTTP client leg the browser drops the Secure csrftoken and login fails with "CSRF cookie not set" -- so a clean follow of 3.3 otherwise stalls at the browser login. Drop an ASCII-only post-load override on the dashboard unit, then graceful-reload apache2:

# RUN: jumphost -- D-044 cookie override on the dashboard unit (ASCII-only; PER-REBUILD)
juju ssh -m openstack openstack-dashboard/leader -- "printf 'CSRF_COOKIE_SECURE = False\nSESSION_COOKIE_SECURE = False\n' | sudo tee /usr/share/openstack-dashboard/openstack_dashboard/local/local_settings.d/_99_internal_http_cookies.py >/dev/null && sudo systemctl reload apache2" </dev/null

Verify the csrftoken Set-Cookie carries NO Secure attribute (over the VIP, vault CA):

# RUN: jumphost
CK=$(curl -s -o /dev/null -D - --cacert ~/vault-init/vault-ca-root.pem https://10.12.4.58/horizon/auth/login/ | grep -i 'set-cookie:.*csrftoken')
[ -n "$CK" ] || echo "WARN: no csrftoken Set-Cookie on this GET -- confirm via the browser login"
printf '%s\n' "$CK" | grep -iq 'secure' && echo "FAIL: csrftoken still Secure" || echo "OK: csrftoken not Secure"

Then confirm an external browser login over the proxy succeeds. PER-REBUILD: teardown wipes the unit; reapply each rebuild until edge TLS (the Roosevelt access/DNS workstream) makes the override unnecessary. The upstream stays PLAIN HTTP (as-built); the abandoned upstream-TLS and self-signed-client-TLS approaches are NOT part of v1 (D-044 rationale). Diagnostic lessons (reload race, proxy_ssl_name DNS-SAN, sed no-op, scheme-mismatch redirect loop) are in appendix-A.


EXIT GATE (phase-03 complete)

  • Cloud settled: acceptance walk shows only the expected block(s) (octavia; maybe gss).
  • Every haproxy backend UP/L7OK by the functional sweep (DOCFIX-031), not merely juju active/idle.
  • ~/admin-openrc (0600) authenticates and returns a SCOPED token; endpoint list IP-only.
  • Vault root CA at ~/vault-init/vault-ca-root.pem validates TLS to the keystone VIP.
  • Horizon reachable through the repointed reverse proxy AND login works (D-044 cookie override applied). Dashboard webroot is the charm default /horizon (the root path 404s) -- probe /horizon/auth/login/. Two VIPs per bundle B1: 10.12.4.58 (provider) + 10.12.8.58 (metal).

As-built reference (2026-06-03 run -- audit trail)

  • Cascade settled ~04:15Z: all five Vault consumers active/idle; only expected non-active/idle = octavia (blocked, D-021) + gss (pre-run). mysql primary on mysql-innodb-cluster/1 (R/W), /0+/2 R/O (normal innodb-cluster).
  • admin-openrc IP-only: OS_AUTH_URL=https://10.12.4.50:5000/v3, OS_USERNAME=admin, OS_PROJECT_NAME=admin (scoped; project_id 65ce73e6798e4d1e8dd066609b7033ef), domains admin_domain, OS_CACERT=~/vault-init/vault-ca-root.pem.
  • Vault root CA: subject "Vault Root Certificate Authority (charm-pki-local)", notBefore 2026-06-03, notAfter 2036-05-31; TLS to 10.12.4.50:5000 OK (B5 IP-SAN holds).
  • Dashboard VIP 10.12.4.58 (provider) + 10.12.8.58 (metal); nginx upstream repoint + D-044 cookie override captured in Step 3.3 (as-executed 2026-06-12).
  • gss image-stream public endpoint is HTTP on the unit IP this rebuild (10.12.8.196; was .172 the 06-03 snapshot -- a rebuild-variable container primary, not a defect; off the critical path, no jumphost route to the container space). Refresh the snapshot per rebuild.
  • Exit-gate re-confirm (2026-06-16, read-only, all green): settle (only octavia D-021 + gss between-runs); admin-openrc scoped + Nova compute service list up (D-045 holds end-to-end); vault-CA -> keystone VIP TLS verify rc 0; haproxy backend sweep zero DOWN cloud-wide.

Next

phase-04 -- network carve (external provider network).