docs: add handoff plan for cluster + codec revamp session

Captures the full commit list, current cluster state, the 172.18 LAN topology gotcha, the remaining recorders.html UI rewrite, and how the next agent should pick it up.
This commit is contained in:
Zac Gaetano 2026-05-21 13:13:42 +00:00
parent 4a3a672cbe
commit 8403355ba9

View file

@ -0,0 +1,57 @@
# Cluster Hardening + Codec Settings Revamp
> Status: **mostly shipped 2026-05-21**. One follow-up remains: the recorders.html UI rewrite. See "Pending" below.
## Goal
Four user-driven asks from 2026-05-20:
1. Fix cluster page: workers were registering with docker bridge IPs, and three duplicate "zampp2" rows kept appearing.
2. Expand recorder codec settings: per-recorder control over bitrate, framerate, audio channels, container format.
3. Better DeckLink port picker: "BM1/BM2" dropdown was unusable -- diagram the card so operators pick a port visually.
4. Validate the cluster end-to-end now that GPUs are in place.
## What shipped (commit list)
| Commit | Area | Summary |
|---|---|---|
| `a39c983` | mam-api | Migration 007 -- dedupe `cluster_nodes` rows + unique index on `hostname`. |
| `049beb8` | mam-api | Migration 008 -- expanded codec columns on `recorders` (video/audio bitrate, framerate, audio channels, container, plus `node_id` / `device_index` pinning). |
| `3b4af6e` | node-agent | Prefer `NODE_IP` env override; skip docker bridge / veth / cni interfaces when auto-detecting. Version bumped to 1.1.0. |
| `0efef0d` | mam-api | `routes/cluster.js`: `pickIp()` fallback to request source IP. New `GET /api/v1/cluster/devices/blackmagic` flattens every node's DeckLink capabilities. |
| `40a66ba` | compose | `docker-compose.worker.yml`: `network_mode: host` for node-agent so it inherits host hostname + LAN IP. |
| `0ebb3cf` | deploy | `onboard-node.sh`: auto-detect host LAN IP and write `NODE_IP` + `BMD_MODEL` to `.env.worker`. |
| `f4a83ee` | capture | `capture-manager.js`: dynamic ffmpeg args. Exports `VIDEO_CODECS`, `AUDIO_CODECS`, `CONTAINER_FMT`, `CONTAINER_EXT`. |
| `485af25` | capture | `index.js` bootstrap forwards every codec env var to `captureManager.start()`. |
| `4c65753` | mam-api | `routes/recorders.js`: full codec field whitelist; `/start` passes settings to the capture sidecar. |
| `d39f86d` | web-ui | `services/web-ui/public/js/bmd-card.js` -- SVG renderer for DeckLink port selection. Models: Duo 2, Quad 2, Mini Recorder 4K, Mini Monitor 4K, UltraStudio 4K Mini. |
| `8aa3783` | deploy | `deploy/test-cluster.sh` cluster smoke test. |
| `4a3a672` | cluster | `mam-api` self-heartbeat reads `NODE_HOSTNAME` (otherwise every restart spawns a new primary row). Smoke test rewritten with `jq` after Python f-strings were found to silently false-pass the docker-bridge check. Bridge alarm narrowed to 172.17.x since this LAN occupies 172.18.0.0/16. |
## Verified cluster state (post-deploy, 2026-05-21)
```
$ MAM_API_URL=http://localhost:47432 bash deploy/test-cluster.sh
6 pass 0 fail
```
Two nodes registered, no duplicate hostnames, real LAN IPs (zampp1=172.18.91.216 primary, zampp2=172.18.91.217 worker), fresh heartbeats, 3 NVIDIA GPUs visible on zampp1, DeckLink Duo 2 reporting all 4 ports on zampp2.
## Deploy state
- **zampp1**: at `4a3a672`, rebuilt `mam-api`/`web-ui`/`worker`/`capture`, migrations 007+008 applied at startup. `.env` has `NODE_HOSTNAME=zampp1`, `NODE_IP=172.18.91.216`.
- **zampp2**: at `4a3a672`, rebuilt `node-agent` + `worker`. `.env` has `NODE_IP=172.18.91.217`, `BMD_COUNT=4`, `BMD_MODEL="DeckLink Duo 2"`, `BMD_DEVICE_0..3` populated.
- **Forgejo PAT** is at `/root/.git-credentials` on zampp1 (mode 600). Pushes from zampp1 need `HOME=/root`.
## LAN topology gotcha
The user's LAN is **172.18.91.0/24** -- inside Docker's reserved 172.16.0.0/12 range. Any heuristic that flags all of 172.16-172.31 as "docker bridge" will produce false positives. The smoke test now alarms only on 172.17.x (default docker0). The server-side `pickIp()` in `routes/cluster.js` has the same vulnerability but the node-agent's `NODE_IP` env-var override masks it in practice.
## Pending
- [ ] **`services/web-ui/public/recorders.html` rewrite.** The supporting pieces are in `main` but the HTML wiring was lost to a context-compaction event mid-session. Required UI:
- Tabbed codec settings (Video / Audio / Container) for both master and proxy.
- SDI source picker: node dropdown + inline `BMDCards.render(...)` SVG with click-to-select.
- Load BMD card data from `GET /api/v1/cluster/devices/blackmagic`.
- `<script src="js/bmd-card.js?v=1"></script>` in the head.
- SVG styles (`.bmd-card-svg`, `.bmd-port-ring`, `.bmd-port-group.is-selected`, ...) inlined or split into a CSS file.
- [ ] **Visual polish pass** with flyonui MCP -- the user noted the current UI "still looks AI-designed". Should happen AFTER the recorders.html rewrite.
## How to pick this up
1. `cd /opt/wild-dragon && git pull` on zampp1 (or zampp2).
2. Read this file end-to-end. Then `services/web-ui/public/js/bmd-card.js` (top JSDoc explains the API) and `services/capture/src/capture-manager.js` (codec catalogs).
3. Inspect `recorders.html` -- it still has the pre-revamp "BM1/BM2" dropdown and flat codec fields. Compare against the `recorders` table columns in `008-codec-settings.sql` for the full field set the UI should drive.
4. Iterate against a live deployment: `bash deploy/test-cluster.sh` for regression check, plus the actual `/recorders.html` page in a browser (web-ui on port 8080, mam-api on 47432).
5. Commit through Forgejo MCP if the diff is small; otherwise push from zampp1 (see Deploy state above for creds location). **Cloudflare WAF blocks large MCP uploads** (the blocked domain is `anthropic.com`, not Forgejo) -- pushing from a host with creds is faster for anything over ~3 KB.