diff --git a/docs/design-decisions.md b/docs/design-decisions.md
index 995e473..5cef1c9 100644
--- a/docs/design-decisions.md
+++ b/docs/design-decisions.md
@@ -573,3 +573,56 @@
 | 2026-06-08 | D-034 (CAPI constellation pinned to dependencies.json; supersedes D-022), D-035 (in-cloud single-homed mgmt VM; supersedes D-033), D-036 (driver/chart/CAPO coherence resolved), D-037 ([capi_helm] via /etc/default DAEMON_ARGS) added. | In-cloud mgmt pivot |
 | 2026-06-09 | D-040 (reserved-host-memory 8192), D-041 (non-HA manual-start policy), D-042 (driver<->core contract coherence; 1.4.0 pin) added. | OOM incident + driver fix |
 | 2026-06-09 | D-019..D-042 consolidated into this document (15 decisions). Existing D-001..D-018 left intact (em-dash style preserved); the new entries are ASCII. | Repo sanitation / doc refresh |
+
+<!-- patchset-20260610-decisions-addendum -->
+
+---
+
+## D-042 -- AMENDMENT (2026-06-10): mechanism evidence + signature taxonomy
+
+Evidence gathered during the 2026-06-10 recovery session strengthens and
+sharpens D-042:
+
+1. Steady-state cosmetic signature, verbatim: health_status=UNHEALTHY with
+   reason {'cluster': 'Ready', 'infrastructure': 'Infrastructure resource not
+   found.', 'controlplane': 'Ready', 'nodegroup': 'Ready'}. "Resource NOT
+   FOUND" (a lookup/contract miss) rather than "resource unhealthy"
+   corroborates the driver-vs-CAPO API-contract coherence diagnosis.
+2. The driver health maps to the cluster-level v1beta2 Available aggregation.
+   This session also observed a REAL Available=False (InfrastructureReady
+   failed: kube-api LB reconcile timeout during a mgmt-VM cold start, LB left
+   provisioning=ERROR). Two distinct mechanisms therefore share the UNHEALTHY
+   presentation -- the operator cannot distinguish real from cosmetic without
+   reading the reason field -- which strengthens the case for the staged
+   forward-fix (magnum-capi-helm-driver-fix runbook).
+3. Signature taxonomy for operators: reason EMPTY = conductor cannot reach the
+   mgmt API (VM parked/down; NOT D-042); reason populated all-'Ready' with
+   infrastructure 'Infrastructure resource not found.' = D-042 cosmetic;
+   reason citing an LB/infrastructure failure = real, check Octavia first
+   (ops-capi-recovery Step 4/5).
+
+## D-043 -- PROPOSED: tenant-VM auto-resume policy (resume-guests-state-on-host-boot)
+
+STATUS: PROPOSED (decision pending). Recorded 2026-06-10.
+
+QUESTION: should nova-compute `resume-guests-state-on-host-boot=true` be set,
+so tenant VMs (including the in-cloud CAPI management VM, D-035) return
+automatically after host reboots?
+
+TENSION: D-041 ("a non-HA component staying down is a signal to investigate,
+not a nuisance to auto-mask") was written for control-plane charm services.
+Tenant VMs are a different class: on Roosevelt, customers will expect their
+VMs back after host maintenance, and auto-resume is the industry norm for
+that class. Cost of the manual policy observed 2026-06-10: a deliberately
+parked mgmt VM was mistaken for an outage and consumed roughly two hours of
+diagnosis before the stop was traced to an API action.
+
+OPTIONS:
+  (a) Enable auto-resume cloud-wide + monitoring/alerting on VM state, so
+      "down" remains a signal without being an outage. RECOMMENDED for
+      Roosevelt; candidate for v1 redeploy as well.
+  (b) Keep manual start for v1 (preserves D-041 discipline on the testcloud),
+      explicitly record auto-resume as the Roosevelt setting.
+Note: the restart procedure's failure-mode table already references the config
+key for SHUTOFF guests; whichever option is chosen, align that table, this
+decision, and the bundle/runbook with each other.
diff --git a/runbooks/README.md b/runbooks/README.md
index 2195bde..89f4cd9 100644
--- a/runbooks/README.md
+++ b/runbooks/README.md
@@ -45,3 +45,5 @@
 This `phase-NN` set supersedes the earlier `v1-do-doc-NN-*` execution documents (and the
 older `NN-*.md` set and the `deprecated/` folder), which were removed in the repo
 sanitation sweep. Git history preserves them.
+
+- ops-capi-recovery.md -- parking, restart, and LB repair for the CAPI/Magnum stack (post-deploy operations companion; not a deploy phase). Added 2026-06-10.
diff --git a/runbooks/appendix-A-troubleshooting.md b/runbooks/appendix-A-troubleshooting.md
index 3b3a9ee..6638d6c 100644
--- a/runbooks/appendix-A-troubleshooting.md
+++ b/runbooks/appendix-A-troubleshooting.md
@@ -364,3 +364,66 @@
   with a "phase NN" back-reference, and decision rationale left to design-decisions.md.
 - memcached track drift is recorded in appendix-B (B.1), not here (it is a
   version-lock note, not a troubleshooting entry).
+
+<!-- patchset-20260610-appendix-addendum -->
+
+---
+
+## Addendum 2026-06-10 -- CAPI/Magnum operations findings
+
+Five entries from the 2026-06-10 recovery session. Full procedures with
+verified blocks: runbooks/ops-capi-recovery.md.
+
+### Parked-state signatures (mgmt VM deliberately stopped)
+While capi-mgmt-v2 is stopped: Magnum reports UNHEALTHY with an EMPTY
+health_status_reason (distinct from the D-042 cosmetic, which has a populated
+reason); the Horizon Container Infra panel may 504 through the jumphost nginx
+proxy and `coe` CLI calls may stall; the workload cluster keeps serving (no
+runtime dependency on the mgmt cluster). If jumphost secrets were filed during
+parking, the convention is ~/sweep-YYYYMMDD/secrets/. See ops-capi-recovery
+Section 0 (expectations table) and Section 1 (parking block).
+
+### Amphora orphan/zombie sweep after host-pressure events
+Causal chain (traced live 2026-06-10): host CPU/memory pressure -> amphora
+heartbeats go stale -> Octavia health-manager marks amphorae ERROR and launches
+auto-failovers -> failovers fail NoValidHost (no placement headroom) -> amphora
+servers accumulate with NO Octavia DB row. Two variants: an ERROR server
+(failed spawn) and an ACTIVE heartbeating zombie (health-manager logs "missing
+from the DB ... An operator must manually delete it" every 10 s). Remedy:
+verify-then-delete by SERVER UUID under admin scope -- the
+`loadbalancer amphora list` output is the DB truth; Nova name lookup is
+project-scoped (amphorae live in the Octavia services project). Procedure:
+ops-capi-recovery 5a. Do NOT retry failover against the same blocker; each
+attempt mints another zombie.
+
+### Octavia failover requires +1 amphora placement headroom
+STANDALONE failover builds the replacement amphora BEFORE reaping the old one,
+so it transiently needs one extra amphora slot (charm-octavia: 1024 MB / 1 vCPU
+/ 8 GB). Scheduler ceiling per host = physical_MB * ram_allocation_ratio (1.5)
+- reserved_host_memory (8192 per D-040). A cloud allocated to that ceiling
+cannot heal its own load balancers: the failover fast-fails to ERROR in
+~15 seconds on NoValidHost. Verified to the megabyte 2026-06-10. Roosevelt
+sizing requirement: reserve at least one amphora slot per concurrent failover
+on top of workload allocation (feeds the node-role/rebalancing recommendation).
+
+### juju ssh `</dev/null` vs an expired macaroon (DOCFIX-021 interaction)
+DOCFIX-021's `</dev/null` on juju ssh assumes valid macaroon auth. When the
+jumphost macaroon goes stale, juju falls back to an interactive password
+prompt; `</dev/null` feeds that prompt EOF and the symptom is the misleading
+"cannot get discharge from https://<controller>:17070/auth: EOF". Triage: run
+`juju status` interactively -- if it succeeds after a password prompt, the
+controller is healthy and only the credential cache is stale. Workaround for
+the session: stdin from `</dev/tty`. Fix at a calm moment: `juju logout` then
+`juju login`.
+
+### Horizon visibility of CAPI instances, LBs, and amphorae
+CAPI/Magnum VMs are owned by the capi-mgmt project; an empty Project ->
+Compute -> Instances page under admin scope is correct, not a defect. Map:
+tenant VMs -> Instances in the OWNING project's scope (use the header project
+switcher; admin holds member on capi-mgmt per phase-06 6.0-BOOT); LB objects ->
+Project -> Network -> Load Balancers in the owning project's scope; amphora
+VMs -> Admin -> Compute -> Instances ONLY (they belong to the Octavia services
+project); everything at once -> CLI `openstack server list --all-projects`.
+Warning about the asymmetry: the Container Infra panel lists clusters
+cross-project under admin policy, which makes the strictly-scoped Nova panel
+look broken when it is not.
diff --git a/runbooks/ops-capi-recovery.md b/runbooks/ops-capi-recovery.md
new file mode 100644
index 0000000..01e6886
--- /dev/null
+++ b/runbooks/ops-capi-recovery.md
@@ -0,0 +1,241 @@
+# v1 ops -- CAPI/Magnum stack recovery procedure (parking, restart, LB repair)
+
+Status: blocks below are AS-EXECUTED-VERIFIED 2026-06-10 (this is their first
+formal consolidation). Destination: runbooks/ as an ops companion to the
+phase-NN deploy runbook, cross-referenced from appendix-A and from
+OpenStack_Test_Deployment-restart-procedure.md.
+
+Applies when: capi-mgmt-v2 has been stopped (parking, host event, OOM) and the
+CAPI/Magnum stack must be returned to service. ORDER MATTERS: repair from the
+bottom up (VM -> k8s -> CAPI controllers -> Octavia LB -> CAPO conditions ->
+Magnum health). Everything upstream stays red until the layer below is green.
+
+Scope-hygiene preambles are the canonical ones from the 2026-06-09 as-executed
+log. ENV literals: project capi-mgmt 674171fd28d446d3a37073b6a761e910; mgmt FIP
+10.12.7.40; kube-api LB 0f968008-...; regenerate per site on rebuild.
+
+---
+
+## 0. Expectations table (read FIRST; saves an hour of false alarms)
+
+| Observation | Meaning |
+|---|---|
+| Magnum UNHEALTHY, reason EMPTY | Conductor cannot reach the mgmt API (VM down / booting). Not D-042. |
+| Magnum UNHEALTHY, reason populated, all components 'Ready', infrastructure 'Infrastructure resource not found.' | D-042 cosmetic false-negative. Known good. |
+| Horizon Container Infra 504 right after mgmt VM start | Conductor stalled mid-reconnect; nginx proxy timeout. Retry after Step 3. |
+| k8sd control.socket deadline / apiserver TLS handshake timeout / mount failures during first ~20 min after boot | Cold-start convergence noise on gp.mid (2 vCPU). Judge by load trend + `k8s status`, not by these. |
+| Cluster Available=False with InfrastructureReady LB-timeout message after a cold start | CAPO reconcile raced the storm. Check the LB (Step 4) BEFORE blaming CAPI. |
+| LB provisioning ERROR, operating ONLINE | Control-plane op failed; dataplane fine. Needs admin failover (Step 5). No urgency. |
+| openstack server list empty in Horizon/CLI | Wrong project scope. CAPI VMs live in capi-mgmt. |
+| juju ssh: "cannot get discharge ... EOF" | Stale macaroon + `</dev/null` ate the password prompt. Use `</dev/tty` or re-login. NOT a controller outage if `juju status` works interactively. |
+
+## 1. Parking (deliberate stop) -- forward procedure
+
+```
+------------------------------------------------------------------------
+BEGIN runbook block: capi-mgmt parking (pre-maintenance / pre-teardown)
+------------------------------------------------------------------------
+# capi-mgmt scope
+source ~/admin-openrc
+unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME OS_PROJECT_DOMAIN_ID
+export OS_PROJECT_ID=674171fd28d446d3a37073b6a761e910
+unset OS_PROJECT_NAME OS_PROJECT_DOMAIN_NAME OS_TENANT_NAME OS_TENANT_ID
+openstack server stop capi-mgmt-v2
+# NOTE: Nova ACPI stop does NOT produce a clean guest shutdown on this VM
+# (no wtmp shutdown entry; verified 2026-06-10). Accepted for this VM class.
+# If filing jumphost secrets, record the destination IN THIS LOG, e.g.:
+#   ~/sweep-YYYYMMDD/secrets/{capi-mgmt.kubeconfig, capi-test-1-kc/config}
+# EXPECT while parked: Magnum UNHEALTHY with EMPTY reason; Container Infra
+# panel may 504; workload cluster keeps running (no runtime dependency).
+------------------------------------------------------------------------
+END runbook block
+------------------------------------------------------------------------
+```
+
+## 2. Start + boot gate
+
+```
+------------------------------------------------------------------------
+BEGIN runbook block: capi-mgmt-v2 start + ssh-port gate (D-041 manual start)
+------------------------------------------------------------------------
+( {
+  source ~/admin-openrc
+  unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME OS_PROJECT_DOMAIN_ID
+  export OS_PROJECT_ID=674171fd28d446d3a37073b6a761e910
+  unset OS_PROJECT_NAME OS_PROJECT_DOMAIN_NAME OS_TENANT_NAME OS_TENANT_ID
+  openstack server start capi-mgmt-v2
+  for i in $(seq 1 20); do
+    ST=$(openstack server show capi-mgmt-v2 -f value -c status 2>/dev/null)
+    echo "[$i] status=$ST"
+    [ "$ST" = ACTIVE ] && break
+    sleep 10
+  done
+  echo "=== TCP probe loop: FIP :22 (sshd lags ACTIVE by ~3 min) ==="
+  for i in $(seq 1 18); do
+    timeout 5 bash -c 'exec 3<>/dev/tcp/10.12.7.40/22' 2>/dev/null \
+      && { echo "[$i] SSH-PORT-OK"; break; } || echo "[$i] not yet"
+    sleep 10
+  done
+} )
+------------------------------------------------------------------------
+END runbook block
+------------------------------------------------------------------------
+```
+GATE: SSH-PORT-OK. Timing (verified, gp.mid): ACTIVE ~20 s; sshd ~3.5 min.
+
+## 3. k8s-snap readiness (PATIENCE GATE)
+
+```
+------------------------------------------------------------------------
+BEGIN runbook block: mgmt k8s readiness poll (cold-start aware)
+------------------------------------------------------------------------
+( {
+  for i in $(seq 1 15); do
+    echo "--- [$i] $(date -u +%H:%M:%S) ---"
+    ssh -i ~/.ssh/id_ed25519 -o BatchMode=yes -o StrictHostKeyChecking=no \
+        -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10 ubuntu@10.12.7.40 \
+        'uptime; sudo k8s status 2>&1 </dev/null | head -4'
+    sleep 120
+  done
+} )
+------------------------------------------------------------------------
+END runbook block
+------------------------------------------------------------------------
+```
+GATE: `cluster status: ready`. Verified convergence on gp.mid: ~20-21 min from
+boot, load peak >100 on 2 vCPUs. Do NOT restart services or re-bootstrap inside
+this window; the Section-0 noise is expected. (On the phase-06-spec gp.large,
+expect substantially faster.)
+
+## 4. CAPI stack + LB verification (read-only; decides Step 5)
+
+```
+------------------------------------------------------------------------
+BEGIN runbook block: post-start CAPI + LB verify
+------------------------------------------------------------------------
+( {
+  export KUBECONFIG="$HOME/capi-mgmt.kubeconfig"
+  kubectl get nodes -o wide
+  kubectl get pods -A | egrep 'capi-|capo-|cert-manager|orc-system|janitor|addon'
+  NS=magnum-674171fd28d446d3a37073b6a761e910
+  kubectl -n "$NS" get cluster,openstackcluster,machines
+} )
+# kubeconfig missing? Re-emit (phase-06 Step 6.5, verbatim):
+#   ssh -i ~/.ssh/id_ed25519 -o BatchMode=yes -o StrictHostKeyChecking=no \
+#       -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10 ubuntu@10.12.7.40 \
+#       "sudo k8s config server=https://10.12.7.40:6443 </dev/null" > ~/capi-mgmt.kubeconfig
+( {
+  source ~/admin-openrc
+  unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME OS_PROJECT_DOMAIN_ID
+  export OS_PROJECT_ID=674171fd28d446d3a37073b6a761e910
+  unset OS_PROJECT_NAME OS_PROJECT_DOMAIN_NAME OS_TENANT_NAME OS_TENANT_ID
+  openstack loadbalancer list -f yaml
+} )
+------------------------------------------------------------------------
+END runbook block
+------------------------------------------------------------------------
+```
+DECISION: controllers Running + Machines Running + every LB provisioning=ACTIVE
+-> skip to Step 6. Any LB provisioning=ERROR (operating ONLINE is typical)
+-> Step 5. Cluster Available=False with an LB-timeout message -> the LB is the
+cause; fix it first, the condition clears itself afterward.
+
+## 5. LB repair: zombie sweep, headroom, sequential failover
+
+5a. ZOMBIE/ORPHAN SWEEP (admin scope). Confirmed pattern, twice in one day:
+failed failovers leave amphora servers with no Octavia DB row. Two variants:
+ERROR server (failed spawn) and ACTIVE heartbeating zombie (health-manager logs
+"missing from the DB ... An operator must manually delete it" every 10 s).
+
+```
+------------------------------------------------------------------------
+BEGIN runbook block: amphora orphan/zombie sweep (admin scope; verify-then-delete)
+------------------------------------------------------------------------
+( {
+  source ~/admin-openrc
+  unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME
+  echo "=== octavia's amphora inventory (the DB truth) ==="
+  openstack loadbalancer amphora list -f yaml
+  echo "=== nova's amphora servers (compare; extras are orphans) ==="
+  openstack server list --all-projects --long -f yaml \
+    | grep -B6 -A4 'amphora-haproxy' | grep -E '^(- |  (ID|Name|Status)):'
+} )
+# For each server whose amphora-NAME-uuid is ABSENT from the amphora list:
+#   1) re-grep the amphora list for the uuid (ABORT if present)
+#   2) openstack server delete <SERVER-UUID>   # by UUID; name lookup is project-scoped
+# Each deletion frees one amphora slot (charm-octavia: 1024 MB / 1 vCPU / 8 GB).
+------------------------------------------------------------------------
+END runbook block
+------------------------------------------------------------------------
+```
+
+5b. HEADROOM CHECK. Failover transiently needs +1 amphora placement (replacement
+is built BEFORE the old one is reaped). Scheduler ceiling per host =
+physical_MB * ram_allocation_ratio(1.5) - reserved_host_memory(8192, D-040).
+Verify at least one host clears Used + 1024 <= ceiling:
+`openstack hypervisor list --long -f yaml | grep -E 'Hostname|Memory MB'`.
+If no host clears: free 1024+ MB first (zombie sweep usually suffices; else
+power off a disposable VM, e.g. a backend-* test instance). DO NOT retry
+failover against NoValidHost -- each attempt mints another zombie.
+
+5c. FAILOVER, STRICTLY SEQUENTIAL (one slot of headroom = one failover at a
+time; completion of each reaps its old amphora and re-frees the slot).
+
+```
+------------------------------------------------------------------------
+BEGIN runbook block: LB failover + poll (admin scope; v4 Arc D pattern)
+------------------------------------------------------------------------
+( {
+  source ~/admin-openrc
+  unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME
+  LB=<LB-ID>
+  openstack loadbalancer failover "$LB"
+  sleep 2
+  for i in $(seq 1 60); do
+    prov=$(openstack loadbalancer show "$LB" -f value -c provisioning_status 2>/dev/null)
+    op=$(  openstack loadbalancer show "$LB" -f value -c operating_status    2>/dev/null)
+    printf '%s  prov=%s  op=%s\n' "$(date +%T)" "${prov:-?}" "${op:-?}"
+    case "$prov" in
+      ACTIVE) echo "failover succeeded"; break ;;
+      ERROR)  echo "failover FAILED -- read octavia-worker.log; do NOT retry blind"; break ;;
+    esac
+    sleep 10
+  done
+} )
+------------------------------------------------------------------------
+END runbook block
+------------------------------------------------------------------------
+```
+Verified timing: ~108 s to ACTIVE; op holds ONLINE; VIP+FIP preserved (VIP port
+is Octavia-owned). A 10-20 s fast-fail to ERROR = early-flow failure (usually
+NoValidHost; see 5b). STANDALONE amphora = brief kube-api endpoint blip
+mid-failover; nodes/pods unaffected.
+
+## 6. Top-of-stack verification
+
+```
+------------------------------------------------------------------------
+BEGIN runbook block: final verify (amphorae, CAPO condition, magnum health)
+------------------------------------------------------------------------
+( {
+  source ~/admin-openrc
+  unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME
+  openstack loadbalancer amphora list -f yaml          # all ALLOCATED
+  export KUBECONFIG="$HOME/capi-mgmt.kubeconfig"
+  NS=magnum-674171fd28d446d3a37073b6a761e910
+  kubectl -n "$NS" get cluster,openstackcluster        # Available=True (allow ~10 min post-failover for CAPO resync)
+  source ~/admin-openrc
+  unset OS_PROJECT_ID OS_TENANT_ID OS_TENANT_NAME OS_PROJECT_DOMAIN_ID
+  export OS_PROJECT_ID=674171fd28d446d3a37073b6a761e910
+  unset OS_PROJECT_NAME OS_PROJECT_DOMAIN_NAME OS_TENANT_NAME OS_TENANT_ID
+  openstack coe cluster show capi-test-1 -f value -c health_status
+  openstack coe cluster show capi-test-1 -f value -c health_status_reason
+} )
+------------------------------------------------------------------------
+END runbook block
+------------------------------------------------------------------------
+```
+SUCCESS = amphorae ALLOCATED; Cluster Available=True; Magnum reason POPULATED
+with the D-042 cosmetic signature (or HEALTHY post-D-042-fix). Reload Horizon
+Container Infra last. Workload check if desired:
+`KUBECONFIG=~/capi-test-1-kc/config kubectl get nodes -o wide`.
diff --git a/runbooks/phase-06-incloud-mgmt-cluster.md b/runbooks/phase-06-incloud-mgmt-cluster.md
index da84bd5..010e4b6 100644
--- a/runbooks/phase-06-incloud-mgmt-cluster.md
+++ b/runbooks/phase-06-incloud-mgmt-cluster.md
@@ -470,7 +470,7 @@
 - Proceed to phase-07 (conductor graft).
 
 ## As-built reference (2026-06-08/09 run -- audit trail; values are run-specific)
-- VM `capi-mgmt-v2`: gp.large, ubuntu-24.04-noble; tenant IP 10.20.0.45 (ens3); FIP 10.12.7.40.
+- VM `capi-mgmt-v2`: gp.large per the 6.2 spec; v1 AS-BUILT DEVIATION: ran gp.mid (8192/2/40) -- measured ~20 min cold-start convergence, load >100 on 2 vCPUs (see runbooks/ops-capi-recovery.md Section 3); redeploy at spec. Image ubuntu-24.04-noble; tenant IP 10.20.0.45 (ens3); FIP 10.12.7.40.
 - Net `capi-mgmt-net` / subnet `capi-mgmt-subnet` 10.20.0.0/24; router `capi-mgmt-router`.
 - k8s-snap: 1.32-classic/stable, rev 5326, v1.32.13 (classic confinement); CNI Cilium 1.17.12-ck0.
 - pod CIDR 10.1.0.0/16; svc CIDR 10.152.183.0/24; cluster DNS 10.152.183.31.