diff --git a/docs/design-decisions.md b/docs/design-decisions.md index a97091f..f621e95 100644 --- a/docs/design-decisions.md +++ b/docs/design-decisions.md @@ -1119,6 +1119,58 @@ Apply is a SEPARATE gated mutation with its own verify-before/after block; this amendment records the decision, not the change. **Status: ADOPTED (a-amended) on commit; apply pending.** +### D-063 -- RESOLVED / CLOSED (2026-07-03): applied and functionally verified + +APPLIED via scripts/d063-apply.sh --apply (DOCFIX-084/085): four ingress rules created and +read-back-gated BEFORE the two wide 0.0.0.0/0 rules were removed by measured ID (no window +where the conductor source was unauthorized). Final SG state (measured): 6443 from +10.12.4.154/32 (conductor), 6443 from 10.12.4.1/32 (operator, coarse), 6443 from 10.20.0.0/24 +(hairpin), 22 from 10.12.4.1/32; wide 22/6443 ingress gone (only default egress remains). + +FLOW MODEL (corrected + doc-backed). Earlier in-session I twice mischaracterized the +conductor->mgmt traffic as a "continuous health poll"; both were WRONG. As-built (phase-07: +/etc/magnum/kubeconfig server = mgmt FIP 10.12.7.222:6443, client-cert auth) plus upstream +docs (StackHPC magnum-capi-helm deep-dive; openstack magnum-capi-helm install guide) establish +the true model: magnum-conductor is a Kubernetes CLIENT of the mgmt apiserver, acting ONLY on +cluster lifecycle events (create/update/delete -> helm install/upgrade + watch). Connections +are per-operation and short-lived (client-cert kubeconfig, no held session), so the path is +IDLE at rest and busy only during ops. This fully explains why idle established-state and SYN +snapshots repeatedly showed nothing -- expected, not a fault. Required routes per StackHPC: +conductor -> mgmt cluster; mgmt cluster -> OpenStack external net; mgmt cluster -> workload API +LBs. Only the first traverses capi-mgmt-sg ingress; the rule set covers it. + +VERIFICATION (proof by exercise, through the tightened SG): +- POSITIVE, load-bearing: R1 -- a live TCP handshake from magnum/0 to 10.12.7.222:6443 + (phase-07 Step 7.1 primitive: exec 3<>/dev/tcp) returned TCP-OK. A handshake cannot complete + if the SG drops the source; conductor source measured = 10.12.4.154 (= the applied /32). +- POSITIVE, functional (gold standard): C3 -- forced a real op (openstack coe cluster resize + beta-cluster 1->2 as the tenant); it reached UPDATE_COMPLETE / HEALTHY / node_count=2. A + magnum-capi-helm resize CANNOT complete unless the conductor helm-upgraded the manifest on + the mgmt apiserver (10.12.7.222:6443) through the SG and CAPI built the worker. The + successful state transition is dispositive proof the path carries real helm ops post-tighten. +- NEGATIVE (isolation value): a beta tenant POD -> 10.12.7.222:6443 TIMED OUT (blocked). Tenant + workloads egress as their router's own 10.12.4.x SNAT, which is NOT in the allow set. This is + the isolation the ruling exists to provide. +- OPERATOR: kubectl get --raw /readyz via the mgmt kubeconfig returned ok (10.12.4.1 coarse + rule, as designed). +- NOT RELIED ON: SYN packet captures (V4/R5) came back empty -- the transient per-op flow is + the wrong target for snapshot/window SYN sniffing (learned 3x this thread; established-state + and functional-outcome are the correct instruments). A conductor-log corroboration probe (C2) + was empty because it queried journald; charmed magnum logs to + /var/log/magnum/magnum-conductor.log. Neither is needed: C3's functional success supersedes + wire-level corroboration (industry practice closes on functional verification of the flow). + +BETA left at node_count=2 (UPDATE_COMPLETE/HEALTHY) -- the resize doubles as previously-untested +resize acceptance coverage; a change from the count=1 acceptance baseline, retained deliberately. + +Roosevelt carry-forward: the rule set is per-UNIT and per-ENVIRONMENT -- 3-unit magnum yields 3 +conductor /32s; the 10.12.4.1 operator collapse is a VR0 masquerade artifact; re-derive with +d063-apply.sh (live per-unit derivation) in every environment, never copy rules forward. + +**Status:** RESOLVED / CLOSED (applied + functionally verified 2026-07-03). Supersedes the +"apply pending" state above. **Related:** DOCFIX-084/085 (the applier + its stderr-capture +fix), phase-07 Step 7.1 (the TCP primitive reused for R1), D-035/D-036 (mgmt-cluster model). + --- diff --git a/docs/v1-redeploy-changelog.md b/docs/v1-redeploy-changelog.md index bee2c69..9c74d25 100644 --- a/docs/v1-redeploy-changelog.md +++ b/docs/v1-redeploy-changelog.md @@ -1580,3 +1580,28 @@ REVERT: git checkout both files at this commit's parent. Next-free: D-071, DOCFIX-086, BUNDLEFIX-009. + + +### 2026-07-03 (addendum 7) -- D-063 RESOLVED/CLOSED: applied + functionally verified + +d063-apply.sh --apply executed: 4 tight rules added (readback-gated) then 2 wide 0.0.0.0/0 rules +removed by measured ID; final SG = conductor /32 + operator /32 + hairpin /24 on 6443, operator +/32 on 22. VERIFIED through the tightened SG: (R1) TCP handshake magnum/0 -> mgmt 10.12.7.222:6443 += TCP-OK, source measured 10.12.4.154; (C3, gold standard) forced beta resize 1->2 reached +UPDATE_COMPLETE/HEALTHY/2 -- a magnum-capi-helm resize cannot complete without the conductor +helm-upgrading on the mgmt apiserver through the SG; (negative) beta POD -> mgmt:6443 TIMED OUT +(tenant SNAT not in allow set = the intended isolation); (operator) readyz ok via 10.12.4.1. +Doc-backed by StackHPC magnum-capi-helm deep-dive + upstream install guide (conductor is an +event-driven kube CLIENT of the mgmt apiserver, not a continuous poller -- correcting my earlier +mischaracterization; the idle path is why SYN/established snapshots showed nothing). + +Process notes (best-practice, for the record): I mischaracterized the conductor->mgmt flow as a +continuous poll TWICE before as-built + upstream docs corrected it to event-driven client traffic; +the correction is what made the empty idle-captures obviously benign. Packet-capture (SYN) was the +wrong instrument for a bursty per-op flow 3x running -- functional outcome (a successful resize) +and TCP-from-source are the right proofs. Repeated hand-built one-liner parse bugs this thread (ss +column-drop under state filters; cut on mapped-v6; grep greedy over `src IP.`; journald vs +/var/log/magnum) reinforce the standing rule: fixture-test probe parsing like script code, even +for throwaway measurement. Beta left at 2 workers (bonus resize acceptance coverage). + +Next-free: D-071, DOCFIX-086, BUNDLEFIX-009.