diff --git a/skills/openstack-cloud-ops/SKILL.md b/skills/openstack-cloud-ops/SKILL.md new file mode 100644 index 0000000..c7c78ec --- /dev/null +++ b/skills/openstack-cloud-ops/SKILL.md @@ -0,0 +1,123 @@ +--- +name: openstack-cloud-ops +description: "Operate, install, extend, and troubleshoot the Omega Cloud - a commercial multi-tenant Charmed OpenStack (Caracal 2024.1) deployment managed with Juju and MAAS, with Vault TLS, OVN, Ceph, Octavia, and Magnum/CAPI tenant Kubernetes. Use this skill for ANY work touching OpenStack, Juju, MAAS, Magnum, CAPI, Ceph, OVN, Octavia, Keystone, Vault-for-OpenStack, tenant onboarding, or the openstack-caracal-ipv4 repository - including writing or reviewing bash/python operational scripts, debugging failed deploys or cluster creates, runbook work, design-decision (D-NNN) discussion, and incident triage. Use it even for seemingly simple OpenStack questions: this deployment has strict operating discipline and known charm traps that make generic answers wrong." +--- + +# openstack-cloud-ops + +Operating skill for the Omega Cloud: a commercial, multi-tenant, tenant +self-administered OpenStack cloud. Current phase: single-DC virtual rehearsal +("testcloud", VR0 DC0) on four KVM hosts, rehearsing a future bare-metal +multi-datacenter deployment ("Roosevelt"). The governing design constraint is +MINIMIZE DELTA TO ROOSEVELT: the runbooks and scripts are primary deliverables +alongside the running cloud, so transferable answers beat quick fixes. + +## Step 0 - locate the source of truth + +The repository `openstack-caracal-ipv4` (GitBucket, git.baldurkeep.com) is +authoritative for everything: bundle, runbooks, scripts, design decisions, +as-built values. This skill is a discipline-and-routing layer OVER that repo, +not a substitute for it. + +1. Look for a local clone (common paths: `~/openstack-caracal-ipv4`, a repo + dir in the working tree, `/home/claude/repo`). If found, `git log -1` to + note HEAD and work from it. +2. No clone and you have shell + network: ask before cloning + (`https://git.baldurkeep.com/git/OpenStack/openstack-caracal-ipv4.git`). + The repo may be private; if the clone fails, ask the operator to provide + access or the relevant files. +3. No clone obtainable (e.g. chat without sandbox network): say so, ask the + operator to paste the relevant runbook/script, and proceed only on what is + actually in front of you. + +**Divergence rule:** if this skill and repo HEAD disagree, the repo wins - +but FLAG the divergence to the operator rather than silently following either. +The repo is a living draft; this skill's invariants (discipline, hardening) +change slowly, its facts (IPs, versions, phase status) go stale fast. + +## Step 1 - detect the environment + +- **Live shell to the jumphost / infra** (Claude Code on `vopenstack-jesse` or + similar): you may RUN read-only audits directly. Every mutation remains + individually human-gated - present the command, state what it changes, wait + for approval. A live shell relaxes the transport, never the discipline. +- **Chat / no infra shell**: operate the gated copy-paste model - prepare + labeled blocks, the operator runs them and pastes output back. Never assume + a block ran or succeeded; wait for the pasted evidence. + +Read `references/operating-discipline.md` before doing either. + +## The three hard operating rules (non-negotiable) + +1. **Execute only the current runbook step, exactly as written.** No added + scope, no adjacent improvements, no live re-architecture mid-step. Findings + and improvement ideas are LOGGED (changelog / D-NNN proposal), never + executed live mid-step. +2. **Never use an inferred value.** No IP, ID, name, or scope goes into a + command unless it was measured this session or carried from confirmed + as-built. If a value would be inferred: stop and measure it. Never run a + destructive or session-altering command from memory without confirming it + is the minimal correct action for the current live state. +3. **Prefer dynamic lookups over hardcoded literals.** Discover VIPs, project + names, IDs, and version sets at runtime. Where a literal is unavoidable it + is tagged and centralized (`scripts/lib-net.sh`, `lib-hosts.sh`), keyed by + stable identity (CIDR, hostname) - never by drifting IDs. + +Corollary that governs everything: **verify before mutate**. A read-only audit +precedes every mutation; destructive and secret-handling steps are gated +individually, never batched. + +## Routing - where to go for what + +| Task | Read first | +|---|---| +| Any command block, script, or paste block you are about to write | `references/script-authoring.md` | +| Deploy / redeploy / teardown | repo `runbooks/README.md`, then the phase-NN runbook; conventions in `references/operating-discipline.md` | +| Something is broken (triage, incidents) | `references/troubleshooting.md`, then repo `runbooks/appendix-A-troubleshooting.md` | +| CAPI / Magnum / mgmt-VM recovery | repo `runbooks/ops-capi-recovery.md` | +| Deliver ANY repo change (script, runbook, doc) | run `bash scripts/repo-lint.sh` + the touched script's `tests//run-tests.sh` BEFORE handing it over | +| Pre-deploy gate (before add-model) | `bash scripts/preflight.sh` -- THE single entry; do not run the sub-gates piecemeal | +| Is the cloud actually healthy? (post-deploy, post-restart, pre-change baseline, incident) | `bash scripts/cloud-assert.sh` (add `--capture` at deploy completion for the committed BOM) | +| Full-cloud restart after outage/maintenance | repo `runbooks/ops-restart-procedure.md` | +| Starting any consequential live session | `bash scripts/run-logged.sh