---
name: openstack-cloud-ops
description: "Operate, install, extend, and troubleshoot the Omega Cloud - a commercial multi-tenant Charmed OpenStack (Caracal 2024.1) deployment managed with Juju and MAAS, with Vault TLS, OVN, Ceph, Octavia, and Magnum/CAPI tenant Kubernetes. Use this skill for ANY work touching OpenStack, Juju, MAAS, Magnum, CAPI, Ceph, OVN, Octavia, Keystone, Vault-for-OpenStack, tenant onboarding, or the openstack-caracal-ipv4 repository - including writing or reviewing bash/python operational scripts, debugging failed deploys or cluster creates, runbook work, design-decision (D-NNN) discussion, and incident triage. Use it even for seemingly simple OpenStack questions: this deployment has strict operating discipline and known charm traps that make generic answers wrong."
---

# openstack-cloud-ops

Operating skill for the Omega Cloud: a commercial, multi-tenant, tenant
self-administered OpenStack cloud. Current phase: single-DC virtual rehearsal
("testcloud", VR0 DC0) on four KVM hosts, rehearsing a future bare-metal
multi-datacenter deployment ("Roosevelt"). The governing design constraint is
MINIMIZE DELTA TO ROOSEVELT: the runbooks and scripts are primary deliverables
alongside the running cloud, so transferable answers beat quick fixes.

## Step 0 - locate the source of truth

The repository `openstack-caracal-ipv4` (GitBucket, git.baldurkeep.com) is
authoritative for everything: bundle, runbooks, scripts, design decisions,
as-built values. This skill is a discipline-and-routing layer OVER that repo,
not a substitute for it.

1. Look for a local clone (common paths: `~/openstack-caracal-ipv4`, a repo
   dir in the working tree, `/home/claude/repo`). If found, `git log -1` to
   note HEAD and work from it.
2. No clone and you have shell + network: ask before cloning
   (`https://git.baldurkeep.com/git/OpenStack/openstack-caracal-ipv4.git`).
   The repo may be private; if the clone fails, ask the operator to provide
   access or the relevant files.
3. No clone obtainable (e.g. chat without sandbox network): say so, ask the
   operator to paste the relevant runbook/script, and proceed only on what is
   actually in front of you.

**Divergence rule:** if this skill and repo HEAD disagree, the repo wins -
but FLAG the divergence to the operator rather than silently following either.
The repo is a living draft; this skill's invariants (discipline, hardening)
change slowly, its facts (IPs, versions, phase status) go stale fast.

## Step 1 - detect the environment

- **Live shell to the jumphost / infra** (Claude Code on `vopenstack-jesse` or
  similar): you may RUN read-only audits directly. Every mutation remains
  individually human-gated - present the command, state what it changes, wait
  for approval. A live shell relaxes the transport, never the discipline.
- **Chat / no infra shell**: operate the gated copy-paste model - prepare
  labeled blocks, the operator runs them and pastes output back. Never assume
  a block ran or succeeded; wait for the pasted evidence.

Read `references/operating-discipline.md` before doing either.

## The three hard operating rules (non-negotiable)

1. **Execute only the current runbook step, exactly as written.** No added
   scope, no adjacent improvements, no live re-architecture mid-step. Findings
   and improvement ideas are LOGGED (changelog / D-NNN proposal), never
   executed live mid-step.
2. **Never use an inferred value.** No IP, ID, name, or scope goes into a
   command unless it was measured this session or carried from confirmed
   as-built. If a value would be inferred: stop and measure it. Never run a
   destructive or session-altering command from memory without confirming it
   is the minimal correct action for the current live state.
3. **Prefer dynamic lookups over hardcoded literals.** Discover VIPs, project
   names, IDs, and version sets at runtime. Where a literal is unavoidable it
   is tagged and centralized (`scripts/lib-net.sh`, `lib-hosts.sh`), keyed by
   stable identity (CIDR, hostname) - never by drifting IDs.

Corollary that governs everything: **verify before mutate**. A read-only audit
precedes every mutation; destructive and secret-handling steps are gated
individually, never batched.

## Routing - where to go for what

| Task | Read first |
|---|---|
| Any command block, script, or paste block you are about to write | `references/script-authoring.md` |
| Deploy / redeploy / teardown | repo `runbooks/README.md`, then the phase-NN runbook; conventions in `references/operating-discipline.md` |
| Something is broken (triage, incidents) | `references/troubleshooting.md`, then repo `runbooks/appendix-A-troubleshooting.md` |
| CAPI / Magnum / mgmt-VM recovery | repo `runbooks/ops-capi-recovery.md` |
| Deliver ANY repo change (script, runbook, doc) | run `bash scripts/repo-lint.sh` + the touched script's `tests/<name>/run-tests.sh` BEFORE handing it over |
| Pre-deploy gate (before add-model) | `bash scripts/preflight.sh` -- THE single entry; do not run the sub-gates piecemeal |
| Is the cloud actually healthy? (post-deploy, post-restart, pre-change baseline, incident) | `bash scripts/cloud-assert.sh` (add `--capture` at deploy completion for the committed BOM) |
| Full-cloud restart after outage/maintenance | repo `runbooks/ops-restart-procedure.md` |
| Starting any consequential live session | `bash scripts/run-logged.sh <label>` first (as-executed log; docs/as-executed-log-convention.md) |
| Credential exposures / security TODOs | repo `docs/security-ledger.md` -- add a row, never only a script comment |
| Tenant onboarding / tenant self-service | repo `scripts/tenant-onboard.sh` + `runbooks/tenant-onboarding-v2-DRAFT.md` + `appendix-C/D` |
| Network / plane / IPAM questions | `references/environment.md`, repo `scripts/lib-net.sh`, NetBox (the IPAM apex) |
| ANY change request to a built surface | grep repo `docs/design-decisions.md` for the governing D-NNN FIRST - PROPOSED/OPEN means the operator has not ruled: present options, do not implement |
| Why is it built this way? / proposing changes | repo `docs/design-decisions.md` (D-NNN); grep before assigning a new number |
| Versions, channels, pins | repo `runbooks/appendix-B-asbuilt-version-lock.md` |
| Environment facts (hosts, repo, planes) | `references/environment.md` |

## Standard loops (repeatable session shapes)

**Session bootstrap (jumphost):** `git -C ~/openstack-caracal-ipv4 pull` ->
`bash scripts/repo-lint.sh` (0 fail expected) -> if touching the live cloud,
`bash scripts/run-logged.sh <label>` to open the logged shell. Repo HEAD and a
clean lint are the preconditions for everything else.

**Change-delivery loop:** grep for prior art (zeroth decision) -> grep
design-decisions for the governing D-NNN -> edit -> `bash scripts/repo-lint.sh`
-> run the touched script's harness (create one if missing -- no script change
ships without its harness) -> deliver as repo-relative ZIP + a changelog entry
with a per-item revert. Under blanket approval, the changelog IS the review
surface: every item states what, why (evidence), and how to revert.

**Deploy loop:** phase-00 runbook (D-061 destroy path) -> `bash
scripts/preflight.sh` PASS -> phase-01..08 gated -> `bash
scripts/cloud-assert.sh --capture` -> commit the asbuilt/ BOM.

**Incident loop:** capture the verbatim error -> `bash scripts/cloud-assert.sh`
(the service-own-verdict sweep localizes the layer) -> appendix-A by exact
message -> recorded fix, gated -> log the finding (new root causes become
appendix-A/DOCFIX material).

## Posture

- This is a commercial multi-tenant cloud with HARD tenant isolation (SCS
  Domain Manager persona). Treat tenant-visible surfaces and cross-domain
  boundaries as security-relevant in every change.
- The operator community here values debate and industry best practice over
  quick fixes. Push back with sources when you disagree; own mistakes plainly
  and concisely. Fabricated flags, values, or version numbers are the cardinal
  sin - if you have not verified an option name or version, say so and verify.
- Responses stay concise. Decisions get explicit rationale.
