Newer
Older
openstack-caracal-ipv4 / skills / openstack-cloud-ops / SKILL.md

name: openstack-cloud-ops

description: "Operate, install, extend, and troubleshoot the Omega Cloud - a commercial multi-tenant Charmed OpenStack (Caracal 2024.1) deployment managed with Juju and MAAS, with Vault TLS, OVN, Ceph, Octavia, and Magnum/CAPI tenant Kubernetes. Use this skill for ANY work touching OpenStack, Juju, MAAS, Magnum, CAPI, Ceph, OVN, Octavia, Keystone, Vault-for-OpenStack, tenant onboarding, or the openstack-caracal-ipv4 repository - including writing or reviewing bash/python operational scripts, debugging failed deploys or cluster creates, runbook work, design-decision (D-NNN) discussion, and incident triage. Use it even for seemingly simple OpenStack questions: this deployment has strict operating discipline and known charm traps that make generic answers wrong."

openstack-cloud-ops

Operating skill for the Omega Cloud: a commercial, multi-tenant, tenant self-administered OpenStack cloud. Current phase: single-DC virtual rehearsal ("testcloud", VR0 DC0) on four KVM hosts, rehearsing a future bare-metal multi-datacenter deployment ("Roosevelt"). The governing design constraint is MINIMIZE DELTA TO ROOSEVELT: the runbooks and scripts are primary deliverables alongside the running cloud, so transferable answers beat quick fixes.

Step 0 - locate the source of truth

The repository openstack-caracal-ipv4 (GitBucket, git.baldurkeep.com) is authoritative for everything: bundle, runbooks, scripts, design decisions, as-built values. This skill is a discipline-and-routing layer OVER that repo, not a substitute for it.

  1. Look for a local clone (common paths: ~/openstack-caracal-ipv4, a repo dir in the working tree, /home/claude/repo). If found, git log -1 to note HEAD and work from it.
  2. No clone and you have shell + network: ask before cloning (https://git.baldurkeep.com/git/OpenStack/openstack-caracal-ipv4.git). The repo may be private; if the clone fails, ask the operator to provide access or the relevant files.
  3. No clone obtainable (e.g. chat without sandbox network): say so, ask the operator to paste the relevant runbook/script, and proceed only on what is actually in front of you.

Divergence rule: if this skill and repo HEAD disagree, the repo wins - but FLAG the divergence to the operator rather than silently following either. The repo is a living draft; this skill's invariants (discipline, hardening) change slowly, its facts (IPs, versions, phase status) go stale fast.

Step 1 - detect the environment

  • Live shell to the jumphost / infra (Claude Code on vopenstack-jesse or similar): you may RUN read-only audits directly. Every mutation remains individually human-gated - present the command, state what it changes, wait for approval. A live shell relaxes the transport, never the discipline.
  • Chat / no infra shell: operate the gated copy-paste model - prepare labeled blocks, the operator runs them and pastes output back. Never assume a block ran or succeeded; wait for the pasted evidence.

Read references/operating-discipline.md before doing either.

The three hard operating rules (non-negotiable)

  1. Execute only the current runbook step, exactly as written. No added scope, no adjacent improvements, no live re-architecture mid-step. Findings and improvement ideas are LOGGED (changelog / D-NNN proposal), never executed live mid-step.
  2. Never use an inferred value. No IP, ID, name, or scope goes into a command unless it was measured this session or carried from confirmed as-built. If a value would be inferred: stop and measure it. Never run a destructive or session-altering command from memory without confirming it is the minimal correct action for the current live state.
  3. Prefer dynamic lookups over hardcoded literals. Discover VIPs, project names, IDs, and version sets at runtime. Where a literal is unavoidable it is tagged and centralized (scripts/lib-net.sh, lib-hosts.sh), keyed by stable identity (CIDR, hostname) - never by drifting IDs.

Corollary that governs everything: verify before mutate. A read-only audit precedes every mutation; destructive and secret-handling steps are gated individually, never batched.

Routing - where to go for what

Task Read first
Any command block, script, or paste block you are about to write references/script-authoring.md
Deploy / redeploy / teardown repo runbooks/README.md, then the phase-NN runbook; conventions in references/operating-discipline.md
Something is broken (triage, incidents) references/troubleshooting.md, then repo runbooks/appendix-A-troubleshooting.md
CAPI / Magnum / mgmt-VM recovery repo runbooks/ops-capi-recovery.md
Deliver ANY repo change (script, runbook, doc) run bash scripts/repo-lint.sh + the touched script's tests/<name>/run-tests.sh BEFORE handing it over
Pre-deploy gate (before add-model) bash scripts/preflight.sh -- THE single entry; do not run the sub-gates piecemeal
Is the cloud actually healthy? (post-deploy, post-restart, pre-change baseline, incident) bash scripts/cloud-assert.sh (add --capture at deploy completion for the committed BOM)
Full-cloud restart after outage/maintenance repo runbooks/ops-restart-procedure.md
Starting any consequential live session bash scripts/run-logged.sh <label> first (as-executed log; docs/as-executed-log-convention.md)
Credential exposures / security TODOs repo docs/security-ledger.md -- add a row, never only a script comment
Tenant onboarding / tenant self-service repo scripts/tenant-onboard.sh + runbooks/tenant-onboarding-v2-DRAFT.md + appendix-C/D
Network / plane / IPAM questions references/environment.md, repo scripts/lib-net.sh, NetBox (the IPAM apex)
ANY change request to a built surface grep repo docs/design-decisions.md for the governing D-NNN FIRST - PROPOSED/OPEN means the operator has not ruled: present options, do not implement
Why is it built this way? / proposing changes repo docs/design-decisions.md (D-NNN); grep before assigning a new number
Versions, channels, pins repo runbooks/appendix-B-asbuilt-version-lock.md
Environment facts (hosts, repo, planes) references/environment.md

Standard loops (repeatable session shapes)

Session bootstrap (jumphost): git -C ~/openstack-caracal-ipv4 pull -> bash scripts/repo-lint.sh (0 fail expected) -> if touching the live cloud, bash scripts/run-logged.sh <label> to open the logged shell. Repo HEAD and a clean lint are the preconditions for everything else.

Change-delivery loop: grep for prior art (zeroth decision) -> grep design-decisions for the governing D-NNN -> edit -> bash scripts/repo-lint.sh -> run the touched script's harness (create one if missing -- no script change ships without its harness) -> deliver as repo-relative ZIP + a changelog entry with a per-item revert. Under blanket approval, the changelog IS the review surface: every item states what, why (evidence), and how to revert.

Deploy loop: phase-00 runbook (D-061 destroy path) -> bash scripts/preflight.sh PASS -> phase-01..08 gated -> bash scripts/cloud-assert.sh --capture -> commit the asbuilt/ BOM.

Incident loop: capture the verbatim error -> bash scripts/cloud-assert.sh (the service-own-verdict sweep localizes the layer) -> appendix-A by exact message -> recorded fix, gated -> log the finding (new root causes become appendix-A/DOCFIX material).

Posture

  • This is a commercial multi-tenant cloud with HARD tenant isolation (SCS Domain Manager persona). Treat tenant-visible surfaces and cross-domain boundaries as security-relevant in every change.
  • The operator community here values debate and industry best practice over quick fixes. Push back with sources when you disagree; own mistakes plainly and concisely. Fabricated flags, values, or version numbers are the cardinal sin - if you have not verified an option name or version, say so and verify.
  • Responses stay concise. Decisions get explicit rationale.