diff --git a/docs/design-decisions.md b/docs/design-decisions.md index 77f69bf..a6671af 100644 --- a/docs/design-decisions.md +++ b/docs/design-decisions.md @@ -856,3 +856,48 @@ lib-net.sh + bundle.yaml carry D-058; the foundation scripts/runbooks are swept as each phase is executed (living-draft discipline). NetBox apex (netbox/ipv4-prefixes-import.py) is stale pre-D-052 and must be de-staled to D-052/053 before it can carry D-058. + +## D-059: host NIC budget -- build on five physical NICs; four-NIC collapse parked (2026-06-29) + +**Decision:** Build and validate v1 on FIVE active physical NICs per host +(openstack0-3) -- the live as-built reality. Defer any reduction to a four-NIC +budget until the Roosevelt inbound-server hardware sheets are confirmed. + +**Live inventory (measured 2026-06-29; all four hosts identical):** + +| NIC | plane(s) | sharing | +| ------- | ----------------------------------------------- | ------------ | +| enp1s0 | provider-public + provider-vip (VID 104) | VLAN-shared | +| enp7s0 | metal-admin + metal-internal (VID 103) | VLAN-shared | +| enp8s0 | data-tenant | dedicated | +| enp9s0 | storage (Ceph public) | dedicated | +| enp10s0 | replication (Ceph cluster) | dedicated | +| enp11s0 | ex-lbaas -- idle, REMOVED during this MAAS work | - | + +Provider and metal already carry two planes each via VLAN tagging; data, storage, +and replication each hold a dedicated NIC -- that trio is the five-vs-four overage. +NOTE: enp7s0 reads "unconfigured" in a physical-only interface query because its L3 +lives on the br-metal / br-internal bridges built atop it, not on the raw NIC -- it +is load-bearing, not spare. The genuinely idle NIC is enp11s0 (ex-lbaas), and +removing it leaves five active, not four. + +**Rationale:** Five is the measured truth; building on it avoids guessing a NIC count +before the hardware is known. The four-NIC fold is a contained, well-understood change +that can be made later without reworking the whole model. + +**Four-NIC collapse path (PARKED; execute only if the Roosevelt sheets confirm a +four-NIC budget):** fold storage + replication onto ONE physical NIC -- Ceph public +untagged + Ceph cluster on a tagged VLAN -- mirroring the provider/metal pattern. +Result: enp1s0 (provider+vip) / enp7s0 (metal+internal) / enp8s0 (data) / enp9s0 +(storage+replication) = four. Touches three places: carve-host-interfaces.sh +(storage/replication NIC wiring + a new tagged VLAN), the bundle's Ceph public/cluster +network bindings, and the MAAS VLAN layout (a replication VLAN on the storage fabric). + +**Trade-off (explicit):** separating Ceph public from Ceph cluster onto distinct +physical NICs is a performance best-practice (replication traffic does not contend with +client traffic). The collapse keeps them logically separate via VLAN but shares one +physical NIC's bandwidth -- acceptable for a test rig; a Roosevelt judgement call. Hence +parked pending the sheets, with NIC bonding as an alternative if the hardware offers it. + +**Status:** Adopted 2026-06-29. Pending input: Roosevelt inbound-server NIC count. +**Related:** D-057 (provider-vip plane), D-058 (plane renumber), D-052 (space inversion). \ No newline at end of file diff --git a/docs/v1-redeploy-changelog.md b/docs/v1-redeploy-changelog.md index 2887155..0e8e8ec 100644 --- a/docs/v1-redeploy-changelog.md +++ b/docs/v1-redeploy-changelog.md @@ -978,5 +978,46 @@ 6.0-BOOT / 6.0 / 6.1 / 6.2 remain as-built from the prior session: capi-mgmt-v2 ACTIVE, FIP 10.12.7.107, tenant 10.20.0.84, persisted to ~/capi-mgmt-net.env.) +### CORRECTION -- real project state: rebuild after destructive delete; NIC-enslavement is the target (2026-06-29) + +Supersedes the framing of the 2026-06-27 "Phase-06 6.3 BLOCKED ... D-057" entry +below. That entry cast D-057 as an OPEN phase-06 FIP blocker on the running cloud; +that is NOT the project state and it misled later work (including a stale-summary +assistant). Accurate history: + +- The cloud was fully deployed and FUNCTIONAL. Additional network spaces were created + and several subnets segregated (D-052 / D-053); the live deployment existed to + validate those network-segregation changes -- and they were validated. +- A destructive command (provided by Claude) DELETED openstack0-3. The hosts were + rebuilt from scratch. +- The rebuild surfaced the real defect: the hosts came up with INCORRECT NIC + configuration -- the D-057-class enslavement cascade (provider NIC enp1s0 untagged + + enslaved to a Linux bridge carrying the host provider static L3, starving OVS + br-ex of carrier, killing OVN's gateway ARP responder, darkening the provider FIP + plane). This is a REBUILD defect, not a blocker on the original cloud. D-057's root + cause is understood; the open work is making a rebuild produce CORRECT host NICs. + +What the current MAAS-reconfigure work targets (remediation in scope): + - D-058 -- full plane renumber (clean fabric-grouped /22 scheme; Roosevelt fidelity). + - provider-vip plane (VID 104) -- public API/VIP plane rides a tagged sub-interface so + enp1s0 stays raw + L3-less for OVS br-ex (defeats the enslavement cascade). + - carve-host-interfaces.sh -- produces the correct per-host NIC tree on rebuild + (enp1s0 raw via ensure_raw_unlinked; enp1s0.104 -> br-prov-api; metal stack). + - D-059 -- host NIC budget: build on FIVE physical NICs now; four-NIC collapse + (fold storage+replication) PARKED pending Roosevelt hardware sheets. + +Tooling built for the reconfigure (tested, committed): + - scripts/phase-00-maas-standup.sh -- single MAAS-address authority (topology + + VIP/FIP/mgmt reserves); audit + create-if-absent. + - scripts/phase-00-maas-recidr.sh -- gated D-052/053 -> D-058 re-CIDR, reuse-in-place. + - runbooks/phase-00-maas-reconfigure.md -- teardown -> re-CIDR -> standup -> verify -> + jumphost bridges -> deploy sequence. + - phase-00-maas-carve.sh RETIRED (folded into the standup). + +Live state at correction time (measured 2026-06-29): the openstack model is DEPLOYED +and running (28 machines, 63 units, capi-mgmt Ready) on the pre-D-058 D-052/053 scheme; +the standup dry-run reports the three expected DRIFT planes (.8/.12/.16). Teardown has +NOT been run -- the cloud was kept up for information-gathering, now complete. + ### Next-free numbers Design decision: D-058 (D-057 coined above). Doc fix: DOCFIX-056.