Commit graph

9 commits

Author SHA1 Message Date
a6f045b3d7 fix(node-agent): probe GPU via Docker API async at startup, cache result
Replaced sync execFileSync('docker') approach (no docker CLI in container)
with async Docker socket HTTP API calls:
- POST /containers/create with nvidia runtime + DeviceRequests
- POST /containers/:id/start
- Poll inspect until not running
- GET /containers/:id/logs, strip 8-byte frame headers, parse csv

probeGpusViaSmi() runs once at startup before the first heartbeat.
Result cached in _gpuCache; detectHardware() reads cache on every heartbeat.
Falls back to /dev/nvidia* scan if probe fails or runtime unavailable.
2026-05-26 18:28:03 +00:00
558c18e417 fix(node-agent): detect GPUs via docker run --gpus all ubuntu:22.04
nsenter approach failed (requires SYS_ADMIN in container).
nvidia-smi bind-mount failed (Alpine vs Ubuntu glibc incompatibility).

Working solution: spawn 'docker run --rm --gpus all ubuntu:22.04 nvidia-smi'
via the Docker socket. The NVIDIA Container Runtime injects nvidia-smi and
driver libs into any container with --gpus all, regardless of the base image.
ubuntu:22.04 is already cached on GPU nodes.

Result: GPU reported with name, memory_mb, driver_version — shows as BOUND
in the cluster UI.
2026-05-26 18:25:44 +00:00
5ff507b81b fix(node-agent): use nsenter to run nvidia-smi in host mount namespace
nvidia-smi bind-mount failed due to Alpine vs Ubuntu glibc incompatibility.
Fix: nsenter --mount=/proc/1/ns/mnt -- nvidia-smi runs in the host's mount
namespace where glibc and all NVIDIA driver libs are present.

Requires pid: host in docker-compose.worker.yml (already has network: host).
nsenter is provided by util-linux in Alpine — already in the image.

Falls back to direct nvidia-smi call (for glibc-based containers), then
to /dev/nvidia* file scan if all attempts fail.
2026-05-26 18:22:11 +00:00
726343db96 fix(node-agent): bind nvidia-smi for full GPU info (name, VRAM, driver)
index.js:
- detectGpusViaSmi(): runs nvidia-smi --query-gpu=index,name,memory.total,
  driver_version and parses the output into structured GPU objects with
  name, memory_mb, driver, device — the same fields the cluster UI uses
  to determine BOUND status
- Falls back to /dev/nvidia* file scan if nvidia-smi isn't available

docker-compose.worker.yml:
- Bind-mount /usr/bin/nvidia-smi and libnvidia-ml.so.1 from host into
  node-agent container (read-only). These are the minimum binaries needed
  for nvidia-smi to execute inside the container.
- Mounts are optional — Docker ignores them silently if paths don't exist
  (e.g. on nodes without NVIDIA hardware)
2026-05-26 18:19:23 +00:00
8186b181cc fix(decklink): mount /dev/blackmagic in sidecar + remote node routing via node-agent
Two bugs fixed:
1. SDI capture sidecar never had /dev/blackmagic bound — ffmpeg opened the
   decklink input inside a container with no device nodes, so frame=0.
   Fix: local spawns now push '/dev/blackmagic:/dev/blackmagic' onto Binds
   when source_type='sdi'.

2. recorders.js always spawned sidecars against the local Docker socket
   (zampp1), even when a recorder's node_id pointed at zampp2 (where the
   card is). Fix: resolveNodeTarget() looks up the recorder's cluster node;
   if it's a different hostname the sidecar is spawned via a new
   POST /sidecar/start endpoint on the remote node-agent.

node-agent gains three new routes (all talk to the local Docker socket):
  POST   /sidecar/start         — create + start container (host network,
                                   privileged, /dev/blackmagic bind for sdi)
  DELETE /sidecar/:id           — stop + remove
  GET    /sidecar/:id/status    — inspect + poll capture service

docker-compose.worker.yml: add /var/run/docker.sock and LIVE_DIR to
node-agent so it can spawn sidecars, and document build-capture prerequisite.: index.js
2026-05-21 18:51:09 -04:00
3b4af6ef11 node-agent: prefer NODE_IP and skip docker bridge interfaces
In bridge mode the agent was reporting the container's 172.x address
because the first non-internal interface in os.networkInterfaces() was
docker0. Now honours NODE_IP, skips lo/docker*/br-*/veth*/etc, and
down-ranks the 172.16-31 range so real LAN IPs win. Also exposes the
detected IP on /health for the onboarding script to print.
2026-05-21 00:15:03 -04:00
cc8ee63639 fix(node-agent): replace express with built-in http — no external deps needed 2026-05-20 22:59:03 -04:00
a941f609f0 feat: node-agent detects NVIDIA GPUs and Blackmagic DeckLink cards, reports in heartbeat 2026-05-20 14:18:07 -04:00
c5a358888b feat(node-agent): heartbeat agent — CPU/mem stats, health endpoint, bearer token auth 2026-05-20 13:48:18 -04:00