Newer
Older
openstack-caracal-ipv4 / skills / openstack-cloud-ops / references / environment.md

Environment - Omega Cloud (VR0 DC0 testcloud)

Facts here are ANCHORS, not command inputs. Anything marked (verify) must be re-measured or re-read from the repo/live cloud before use in a command - hard rule 2 applies. Snapshot date: 2026-07. The repo is fresher than this file.

The two deployments

  • Testcloud (now): VR0 DC0, four KVM host VMs (openstack0-3) on a single hypervisor, managed by MAAS + Juju. Single-DC virtual rehearsal.
  • Roosevelt (future): bare-metal, multi-DC, commercial production (3310 Roosevelt Blvd, Eugene OR). Dedicated node roles (gateway/controller/ compute split) - unlike the hyperconverged testcloud. Every design choice is judged by its transfer to Roosevelt.

Stack (verify against appendix-B for pins)

Charmed OpenStack Caracal 2024.1 - Juju 3.6, MAAS 3.7.2, Vault TLS (charm-pki root CA), OVN 24.03, Ceph Squid, Octavia (amphora), Barbican, mysql-innodb-cluster, RabbitMQ, Magnum + magnum-capi-helm driver + azimuth capi-helm-charts (kubeadm engine), in-cloud single-homed CAPI mgmt VM (capi-mgmt-v2, k8s-snap, D-035). NetBox is the IPAM apex: never hand-edit downstream MAAS or overlays for network values.

Control points

  • Jumphost: vopenstack-jesse - all live commands run here. Has juju, the openstack CLI (SNAP - cannot read /tmp; use $HOME), jq, kubectl.
  • Repo: https://git.baldurkeep.com/OpenStack/openstack-caracal-ipv4 (web) / .../git/OpenStack/openstack-caracal-ipv4.git (clone). Operator commits from Windows (PowerShell / GitHub Desktop - strips exec bits; .gitattributes pins LF); the jumphost only pulls.
  • Juju model: openstack. MAAS profile: admin (call maas admin ... directly; NEVER maas list - it prints the API key).
  • Management substrate (verify; NEVER touch in teardown): the MAAS machines hosting juju, lxd, and tailscale are hard-excluded from teardown scripts. Resolve system_ids live via scripts/lib-hosts.sh - system_ids are re-minted on every re-enrollment (DOCFIX-040).

The six network planes (D-052 / D-053; verify against scripts/lib-net.sh)

Plane CIDR Carries Notes
provider-public 10.12.4.0/22 Public API VIPs + tenant FIPs (Pattern A, D-060) gw .4.1; untagged
metal-admin 10.12.8.0/22 MAAS PXE, operator/admin endpoint, default binding gw .8.1; DC-local
metal-internal 10.12.12.0/22 ALL service-to-service control (internal API, DB, MQ, Vault, peers) tagged VID 103 via br-internal; no gw
data-tenant 10.12.16.0/22 Tenant Geneve overlay no gw
storage 10.12.32.0/22 Ceph public no gw
replication 10.12.36.0/22 Ceph cluster (OSD replication) no gw
  • API VIPs: triple per clustered charm (provider/admin/internal), matching last octet in the .50-.60 band, 11 clustered charms (verify count live).
  • Tenant pool: 10.20.0.0/16 (hybrid model D-016 - pool in NetBox, per-project /24s Neutron-managed). Avoid collisions with capi-mgmt (10.20.0.0/24) and existing tenant /24s - list live before allocating.
  • Provider NIC rule (D-057/D-060): the provider uplink must land in OVS br-ex, never enslaved to a Linux bridge, and br-ex carries no L3 config.

Repo map (what lives where)

  • bundle.yaml - the canonical bundle; VIPs/units baked in for testcloud.
  • runbooks/phase-00..08-*.md - the gated deploy sequence, in order, each ending in a hard gate. runbooks/README.md has the label conventions.
  • runbooks/appendix-A-troubleshooting.md - symptom->cause->fix index keyed by D-NNN/DOCFIX-NNN. First stop for any known-looking failure.
  • runbooks/appendix-B - version lock. appendix-C - identity/RBAC. appendix-D - Magnum trust model. ops-capi-recovery.md - CAPI/Magnum post-deploy operations.
  • docs/design-decisions.md - the D-NNN architectural record (append-only discipline; superseded entries stay, marked).
  • scripts/ - phase scripts + lib-net.sh / lib-hosts.sh (pinned values)
    • tenant onboarding/acceptance. tests/<script>/ - offline fakebin regression harnesses.
  • policies/domain-manager-policy.yaml + policies/overrides.zip - the SCS Domain Manager RBAC override (D-051/D-064); the zip ships IN the bundle (keystone resources, DOCFIX-071) and provider-bundle-check drift-guards it.
  • Operational tooling (2026-07 hardening set): scripts/preflight.sh (single pre-deploy gate: lint -> bundle invariants -> Charmhub channel assert -> live MAAS pre-flight), scripts/repo-lint.sh/repo_lint.py (static hygiene, L1-L6), scripts/cloud-assert.sh (behavioral verifier + --capture BOM to asbuilt/<date>/), scripts/run-logged.sh (as-executed session logger), scripts/channel_assert.py. runbooks/ops-restart-procedure.md (full-cloud restart). docs/security-ledger.md (exposure/obligation rows). logs/as-executed-index.md (committed index; log content stays jumphost-only).
  • No KVM snapshot restore path exists (D-070 superseded D-012): rebuild-from-runbooks IS the restore path; baselines come from cloud-assert --capture.

Identity / tenancy model (see appendix-C/D and D-051, D-064, D-066)

Domain-per-client. Operator provisions: domain + a domain manager (SCS Domain Manager persona - the plain admin role is NOT domain-confinable) + quotas. The tenant self-services everything inside: projects, users, roles (only member + load-balancer_member assignable - never admin/manager), app credentials, networks, templates, clusters. Magnum mints per-cluster trust app-creds carrying the trustor's roles frozen at mint time (D-039: trustor needs load-balancer_member or CAPO 403s on Octavia). Cluster create must run as a password identity, not an app-cred (trust-creation block, D-066). Every identity command is DOMAIN-QUALIFIED (--domain, --user-domain, --project-domain) - scope-default resolution silently lands in the wrong domain and 404s misleadingly.