Zac Gaetano 0f6c715a30 docs: All-Intra HEVC (NVENC) growing-file ingest design

Captures the current working system (capture sidecar, finalize flow, live monitor, capability-routed GPU worker pool, deploy gotchas) and the target design: GPU All-Intra HEVC master to offload the ProRes CPU wall while keeping edit-while-record, scaling to 8 signals + multi-vendor (Blackmagic/Deltacast/AJA). Includes a validation gate (prove Premiere growing-HEVC edit on one channel first).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-29 04:16:17 +00:00

11 KiB

Raw Permalink Blame History

All-Intra HEVC (NVENC) Growing-File Ingest

Date: 2026-05-29 | Status: design, pending validation gate (see §8) Authors: Zac + Claude

1. Purpose

Replace the CPU-bound ProRes capture encode with All-Intra HEVC on NVENC as the growing-file master codec, so we can:

Offload ingest encode from CPU to GPU (the current scaling wall), and
Keep edit-while-record (all-intra => growing file stays editable), and
Scale to up to 8 simultaneous signals per machine, across Blackmagic today and Deltacast + AJA later.

This doc captures the target design AND the current working system it builds on, so it is self-contained for whoever implements it.

2. Why this codec

Growing-file editing (Premiere/Avid mounting a still-recording file over SMB) requires two things: intra-frame (every frame a keyframe, so a partial file is decodable to the last whole frame) and a container whose index is not deferred to EOF. ProRes/DNxHR satisfy this but are CPU-only (NVIDIA has no ProRes encoder). Long-GOP H.264/HEVC/AV1 do NOT work for edit-while-record.

All-Intra HEVC (-g 1 -bf 0) via hevc_nvenc is the one path that is both GPU-accelerated AND all-intra: it breaks the "ProRes must be CPU" constraint without losing edit-while-record. Trade-off: All-Intra bitrate approaches ProRes, so the win is CPU offload, not storage. AV1 is rejected (no NLE edit support; av1_nvenc absent from our ffmpeg builds).

3. Current working system (what we build on)

Topology

zampp1 (172.18.91.200): primary. Runs db (postgres), queue (redis), mam-api (:47432), web-ui (:47434), and the GPU worker pool. GPUs: Tesla P4 + 2x Quadro P400. Repo at /opt/wild-dragon (its own clone).
zampp2 (172.18.91.216): worker/capture node. 12-vCPU QEMU VM, NVIDIA L4, 4x Blackmagic DeckLink (exposed as /dev/blackmagic/io0..io3). Runs node-agent (:7436). Repo at /opt/wild-dragon (separate clone).
The repo is checked out independently on BOTH nodes; node-specific files (node-agent, capture, worker overlay) are edited on the node that runs them.

Capture (current)

mam-api POST /recorders/:id/start pre-creates a live asset and dispatches POST /sidecar/start to the recorder's node-agent, which spawns a wild-dragon-capture:latest container (host network, privileged, /dev/blackmagic bound). The capture ffmpeg:

input: -f decklink -i "DeckLink Duo (N)"
filter: yadif (CPU deinterlace)
output 0 (master): prores_ks (CPU) -> S3 (pipe) or growing SMB file
output 1 (preview): libx264 veryfast HLS -> /live/{assetId} (CPU) DeckLink does capture (cheap); BOTH encodes are CPU. ~5 vCPU per 1080i signal => ~2 signals saturate the 12-vCPU VM. GPUs are idle during capture.

Stop / finalize (working)

node-agent stops the sidecar with a 180s grace (was 10s -> SIGKILL bug). Capture's SIGTERM handler finalises the session and calls POST /assets/:id/finalize (the live asset id passed as ASSET_ID), which flips the asset out of live, records duration + S3 keys, and kicks the proxy -> thumbnail -> filmstrip chain. (Earlier 409 bug: it used to POST a new asset and collide with the live row.)

Live monitor (working)

SDI HLS preview is a 2nd output of the capture ffmpeg (one DeckLink read -> split -> ProRes + H.264 HLS), written to /live/{assetId} on the capture node. node-agent serves GET /live/* over HTTP; mam-api proxies GET /api/v1/recorders/:id/live/* to the recorder's node-agent; the web-ui HlsPreview loads the proxied URL. Browser auth is the session cookie (same-origin).

GPU worker pool (working, post-capture)

BullMQ on shared Redis; queues are type-named (proxy/thumbnail/filmstrip/ conform/trim). Workers are capability-routed by WORKER_QUEUES, one GPU-pinned container per card (NVIDIA_VISIBLE_DEVICES by UUID):

HEAVY (proxy/conform/trim): Tesla P4 (zampp1) + L4 (zampp2), h264_nvenc.
LIGHT (thumbnail/filmstrip): 2x Quadro P400 (zampp1). DB setting gpu_transcode_enabled=true + gpu_codec=h264_nvenc enable NVENC. Each worker stamps WORKER_LABEL onto job data -> Jobs UI "Node" column. RUN_PROMOTION=true on exactly one worker runs the growing-files->S3 scan. The worker GPU image is built from services/worker/Dockerfile.gpu (CUDA base + Ubuntu ffmpeg with h264/hevc_nvenc; NO av1_nvenc).

Deploy gotchas (learned)

Service source is BAKED into images; edits need rebuild + recreate (or the GPU image rebuild reuses cached layers so only final COPY changes -> fast).
The capture image can only build on zampp2 (DeckLink SDK present there).
Per-node .env: zampp2's REDIS_URL/DATABASE_URL/S3_* now point at zampp1 (.200); secrets live only in .env, never in committed compose.
Clear all containers on both nodes before a full redeploy (user preference).

4. Target design

4.1 Capture ffmpeg gains NVENC

The capture image's custom FFmpeg 7.1 is currently built WITHOUT nvenc (only prores_ks/dnxhd/libx264). Rebuild services/capture/Dockerfile ffmpeg with: --enable-cuda-nvcc --enable-libnpp --enable-nvenc --enable-cuvid plus nv-codec-headers (ffnvcodec) installed before configure. Keep --enable-decklink and the existing codecs (ProRes stays available as a selectable fallback). Verify ffmpeg -encoders | grep nvenc shows hevc_nvenc/h264_nvenc afterwards.

4.2 Capture sidecar gets a GPU

node-agent handleSidecarStart currently spawns the capture container with no GPU. Add NVIDIA runtime + device pinning to the sidecar create spec: HostConfig.Runtime='nvidia' (or DeviceRequests with the node's GPU) and env NVIDIA_VISIBLE_DEVICES=<uuid> + NVIDIA_DRIVER_CAPABILITIES=video,compute,utility. The capture node's GPU is shared with its worker-l4 (see capacity, §5).

4.3 Encode parameters (master)

All-Intra HEVC on NVENC: -c:v hevc_nvenc -preset p4 -rc vbr -g 1 -bf 0 -profile:v main10 -pix_fmt p010le (10-bit 4:2:2 is not NVENC-native; NVENC HEVC is 4:2:0 8/10-bit. If 4:2:2 mezzanine is required, that is a HARD blocker for NVENC and we stay on ProRes for those feeds — see §8). Bitrate target tuned per format (1080i59.94 ~100-160 Mbps to rival ProRes HQ). -g 1 -bf 0 => every frame IDR (all-intra).

4.4 Container (growing-file)

Write the master to a growing file on the SMB share (GROWING_PATH), same path the promotion worker already uploads on EOF. Container candidates, in order of preference for Premiere growing-file mounts:

MXF OP1a (-f mxf) — broadcast standard, designed for growing/edit-while- ingest; best Avid/Premiere support. HEVC-in-MXF support in Premiere is the key unknown to validate (§8).
Fragmented MOV/MP4 (-movflags +frag_keyframe+empty_moov+default_base_moof) — no moov-at-EOF, readable while growing; fallback if MXF+HEVC is unsupported. The HLS preview path is unchanged except it can also move to h264_nvenc now that capture has NVENC (frees the last libx264 CPU cost).

5. Capacity & scaling (8 signals/machine)

After the move, per-signal CPU is just: DeckLink capture + yadif + mux + frame upload to the GPU. The heavy HEVC encode is on NVENC. The constraint shifts from CPU to NVENC throughput + GPU memory + PCIe/host bandwidth:

The L4 is a datacenter card => unlimited NVENC sessions (no consumer 3-session cap). 8x 1080i HEVC-I encode sessions are well within an L4.
GPU memory: ~8 concurrent 1080 NVENC sessions + frame buffers fit in 24 GB.
The capture node's L4 is shared between capture (per-signal HEVC-I) and the worker-l4 proxy jobs. Under 8-signal load, give capture priority; consider moving worker-l4 (post-record proxies) to zampp1's P4 only, or gate worker-l4 intake while signals are live.
yadif on CPU is still ~0.5-1 vCPU/signal; consider yadif_cuda/bwdif_cuda (GPU deinterlace) once frames are uploaded to the GPU, keeping CPU near-idle.

Node sizing: a 12-vCPU VM was the ProRes wall; with GPU encode the same VM should carry many more signals, but for 8x SDI + GPU + card passthrough prefer a larger VM or bare metal with proper PCIe passthrough. Or spread signals across multiple capture nodes (the node-agent model already supports N nodes; mam-api routes each recorder to its node).

6. Multi-vendor capture (Blackmagic / Deltacast / AJA)

Today capture is hard-wired to -f decklink. Before three vendors accrue special-cases, introduce a source-backend abstraction in capture-manager: each backend returns ffmpeg input args + device discovery.

Blackmagic: -f decklink -i "<name>" (current). Devices via ffmpeg -sources decklink.
Deltacast: VideoMaster SDK. No native ffmpeg demuxer upstream — needs an SDK-backed capture (their SDK -> pipe to ffmpeg, or a small grabber). Plan a deltacast backend that shells their tool into ffmpeg stdin (rawvideo).
AJA: libajantv2. Also no upstream ffmpeg input; AJA ships ntv2 capture tools. Plan an aja backend feeding rawvideo into ffmpeg. All backends converge on the SAME encode/output stage (HEVC-I NVENC + HLS), so only the input differs. node-agent already binds the right /dev nodes per sourceType (decklink/deltacast); extend for AJA.

7. Risks

4:2:2 / 10-bit chroma: NVENC HEVC is 4:2:0 (8/10-bit). ProRes HQ is 4:2:2 10-bit. If a workflow REQUIRES 4:2:2 mezzanine, NVENC HEVC cannot match it and those feeds stay on ProRes (CPU). Decide per-workflow.
Premiere growing HEVC support: edit-while-record for HEVC-in-MXF (or frag MOV) is unproven in our stack — this is the make-or-break validation (§8).
GPU contention between live capture and post-record proxies on the same L4; mitigate by prioritising capture / relocating proxy load.
Storage: All-Intra HEVC bitrate ~ ProRes; expect similar disk usage.
Editor performance: HEVC-I decode in Premiere is heavier than ProRes on the edit workstation (decode cost moves to the editor). Validate scrubbing.
NVENC quality at all-intra vs ProRes for archival; tune bitrate/preset.

8. Validation gate (do FIRST, before building the pipeline)

Prove the editor story on ONE channel before wiring 8:

Rebuild capture ffmpeg with NVENC; give the sidecar the L4.
Capture one DeckLink feed to All-Intra HEVC, writing a GROWING file to the SMB share in (a) MXF OP1a, then (b) fragmented MOV.
While still recording, mount it in Premiere over SMB and confirm: edit-while-record works, scrubbing is acceptable, audio in sync, file remains valid after stop. Pick the container that works; if neither does, HEVC-I is capture-only (no growing edit) and we keep ProRes for growing workflows.

9. Rollout

Validation gate (§8) on one channel.
Make capture codec/container a recorder setting; default growing feeds to HEVC-I NVENC, keep ProRes selectable.
Move HLS preview to h264_nvenc.
Source-backend abstraction (§6) — land before Deltacast/AJA hardware.
GPU deinterlace + capacity test to 8 signals; finalise node sizing.

11 KiB Raw Permalink Blame History