Commit graph

1131 commits

Author SHA1 Message Date
d3e520e3b1 fix(capture+gui): kill audio-drift regression + fix elapsed/signal status
A/V REGRESSION (no audio + start stutter): capture-manager.js dropped the
-use_wallclock_as_timestamps 1 flag on the audio FIFO input (re-added by
d6b0b3a). Wallclock stamped audio by arrival time while video is CFR
frame-count, so audio ran 3-18% longer and master aresample padded seconds
of LEADING SILENCE → silent head, late video start, apparent 'no audio'.
Removing it restores the sample-count PTS baseline (8e5405c/55a72af):
audio shares the SDI clock domain, no drift, no pad.

GUI BUG A (elapsed showed 1hr+ on standby/just-started): frontend seeded
elapsed from recorder.started_at = the standby CONTAINER boot time (hours
old). Now seeds ONLY from the sidecar session duration (liveStatus.duration
when live.recording), shows nothing when idle. Backend /status now returns
session-scoped duration + recording flag, not container uptime.

GUI BUG B (false 'stopped' signal on idle ports): backend inferred signal
from container Running state (running->receiving, down->stopped) — so idle
standby ports with down sidecars showed red 'stopped'. Now signal comes
from the sidecar session (live.recording); standby = neutral 'idle', never
a false 'stopped'/'receiving'.
2026-06-04 13:21:30 +00:00
727bdaae80 feat(growing): auto-promotion scanner + hours-based delay setting
The growing_promote_after_seconds setting was stored but NEVER read — no
scanner existed, so growing clips only left the SMB share on a manual
right-click 'Move to S3'. This adds the missing automation:

- promotion-scanner.js: every 60s, finds pending_migration assets idle
  (updated_at) longer than settings.growing_promote_after_seconds and
  enqueues a promotion job. Idempotent (status guard + stable jobId) so
  it's safe on every promotion worker. 12h default fallback.
- worker/index.js: starts the scanner on promotion-capable workers.
- Settings UI: the delay field is now 'Auto-promote to S3 after (hours)'
  (converts hours<->seconds; storage stays seconds). Notes the manual
  Library right-click 'Move to S3' option too.

Manual promotion (right-click Move to S3) and the safe HLS-segment live
thumbnail were already implemented and working.
2026-06-04 13:14:03 +00:00
0c405ae7d4 fix(growing): read GROWING_ENABLED from env at record time + drop dead const
Second half of the growing-never-engages bug. start() decided growing via the module-level const GROWING_ENABLED (captured false at standby boot) and referenced the now-removed GROWING_SMB_MOUNT const (ReferenceError, silently swallowed). Both made growingActive=false, so every growing record produced HEVC/S3 instead of XDCAM HD422 MXF. Now reads process.env.GROWING_ENABLED + growingSmbConfig().mount fresh at record start.
2026-06-04 13:02:23 +00:00
b27b9f6909 fix(s3): keep-alive agents + long timeouts to end socket starvation
Root cause of stuck 'processing', failed deletes, and dead playback:

The mam-api proxies media (/video, /hls pipe the full S3 body through
Express), holding long-lived streaming sockets. With the SDK's default
http agents (no keep-alive, unbounded but unpooled) those streams starved
control-plane calls — DeleteObject and the proxy worker's master download
— which timed out (10s connectionTimeout) in bursts.

Fixes:
- mam-api S3 client: dedicated keep-alive http/https Agents (maxSockets 256)
  + requestTimeout raised 30s→300s so large master GETs finish.
- worker S3 client: previously had NO handler config at all (SDK defaults).
  Added keep-alive agents + 600s requestTimeout so proxy/conform master
  downloads (hundreds of MB) don't stall and leave assets in 'processing'.
2026-06-04 12:53:28 +00:00
ac1d7e1e1f fix(growing): read SMB params from env at mount time, not module load
Root cause of growing producing .mov instead of XDCAM HD422 .mxf:

mountGrowingShare() used module-level consts (GROWING_SMB_MOUNT etc.)
captured from process.env at IMPORT time. Standby capture containers boot
with these unset and receive the SMB mount/credentials per-session over
/capture/start (capture.js sets process.env right before start()). Because
the consts were already frozen empty, mountGrowingShare() saw no mount
source, returned false, and growing silently fell back to S3 streaming —
producing an HEVC .mov while the asset key said .mxf.

Fix: growingSmbConfig() reads process.env fresh at mount time. Also drop
the stale const guard in unmountGrowingShare().
2026-06-04 12:51:32 +00:00
cb25711ec6 fix(growing): inline CIFS creds + capture caps + storage probe timeout
Three fixes to restore growing-files (XDCAM HD422 MXF) recording:

1. capture-manager mountGrowingShare: pass username=/password= inline
   instead of a credentials= file. TrueNAS SMB3 rejects the creds-file form
   with EACCES (-13, 'cannot mount read-only') while the identical inline
   creds mount fine. This was causing every growing record to silently fall
   back to the HEVC/S3 path (producing .mov, not .mxf).

2. docker-compose capture: add cap_add SYS_ADMIN + DAC_READ_SEARCH and
   apparmor:unconfined so mount.cifs can run inside the container.

3. storage /overview: wrap S3 HeadBucket/ListObjects probe in a 5s timeout
   so the admin 'Mount health' card stops hanging on 'Probing…' forever
   when S3 is slow.
2026-06-04 12:42:39 +00:00
2812705d1c feat(recorders): expose XDCAM HD422 bitrate field in growing mode
Growing mode now shows an editable 'XDCAM HD422 bitrate (Mbps)' input (codec stays fixed to the growing MXF path). Default seeds to 50 Mbps for growing, 25 for GPU master. Backend already honored recording_video_bitrate via _buildGrowingOrchestrator -b:v/minrate/maxrate; this surfaces the control in the config modal.
2026-06-04 11:52:07 +00:00
91d0d755a5 fix(node-agent): proper stdcopy demux for container logs (clean line starts) 2026-06-04 05:29:56 +00:00
179a740453 feat(admin): cluster-wide Logs page + fix container log demux + poll containers
- mam-api: dockerLogs() + demuxDockerStream() — the local container-log path
  JSON.parsed Docker's raw multiplexed stream and always returned '(no logs)';
  now strips stdcopy framing and returns readable text (tail configurable).
- web-ui: new Logs admin page — every container across every node grouped by
  node in a left rail, live-follow log viewer with filter + copy on the right.
  Reuses the now-working /cluster/containers/:node/:id/logs endpoint.
- web-ui: Containers screen now polls every 5s (was load-once) so the
  cross-cluster view stays live without manual refresh.
- icons: add server + file glyphs (were referenced but missing -> blank).
- nav: Logs wired into the Admin sidebar section + routes + breadcrumbs.
2026-06-04 05:28:17 +00:00
1348db8f33 fix(node-agent): import crypto — auth was ALWAYS failing on remote nodes
THE root cause of 'container view only shows the primary': checkAgentAuth used
crypto.timingSafeEqual but 'crypto' was never imported (ES module). The call
threw ReferenceError, the try/catch swallowed it, _bearerEq returned false, so
EVERY bearer-token check on a node-agent failed. The primary's own containers
showed only because the local node-agent has no NODE_TOKEN (auth skipped).

Adding 'import crypto from crypto' makes token comparison work, so the primary
mam-api can now read containers + logs from every node.
2026-06-04 05:21:33 +00:00
4ad145f00a debug(node-agent): log token prefix/suffix on auth reject 2026-06-04 05:20:13 +00:00
90bd82f49a debug(node-agent): log auth reject token lengths 2026-06-04 05:18:34 +00:00
70c873ae95 fix(cluster): shared CLUSTER_READ_TOKEN so mam-api sees containers on ALL nodes
/cluster/containers only returned the primary's containers: mam-api fanned out
to each node-agent with a single NODE_AGENT_TOKEN, but each node-agent only
accepted its own bound NODE_TOKEN, so remote nodes returned 401 and were
silently dropped (UI showed 'only zampp1').

node-agent now ALSO accepts a shared CLUSTER_READ_TOKEN (= mam-api's
NODE_AGENT_TOKEN) for the read-only container/log endpoints, so the aggregate
container view + per-container logs work across the whole cluster.
2026-06-04 05:14:44 +00:00
d6b0b3a9a6 fix(capture): restore proven-clean wallclock audio (match de509c6 baseline)
Removing wallclock made A/V length drift far worse (audio 11.8% long). The
known-clean config used wallclock + master aresample=async=1; the leading
silence is a standby backlog artifact addressed by the bridge live-edge flush +
record-start audio FIFO drain, not by changing the timestamp source.
2026-06-04 05:06:40 +00:00
55a72af905 fix(capture): derive audio PTS from sample count (kill 2.5s leading silence)
The persistent ~2.5s of leading silence was the master aresample=async=1 PADDING
the audio to reconcile a PTS-origin mismatch: video PTS starts at frame 0
(-framerate), but -use_wallclock_as_timestamps stamped the first audio chunk at
its wall-clock arrival time (~2.5s after the ffmpeg graph opened). aresample
filled the gap with silence.

Drop wallclock: audio PTS now comes from the 48kHz sample count starting at 0 —
the same origin as video frame 0 — so the streams align with no pad. The bridge
already hands live audio (backlog flushed on attach), so no rate reference is
needed from wallclock.
2026-06-04 05:01:05 +00:00
e9e883d06e fix(deltacast-bridge): flush queued audio backlog to live edge on reader attach
The ~2.5s of leading silence at record start was the VHD audio slot QUEUE: while
the recorder is idle (no FIFO reader), the bridge blocks on open(O_WRONLY) but the
board keeps buffering audio slots. When the record ffmpeg attaches, the bridge
streamed that stale backlog first — heard as leading silence and pushing audio
out of alignment with the live video.

On each reader attach, drain slots that lock FAST (already-queued backlog) and
stop at the first lock that takes ~a frame period (= waiting on a live slot), so
the reader is handed the live edge, A/V aligned.
2026-06-04 04:54:32 +00:00
b1a2249f36 fix(capture): align A/V at record start (kill leading silence + length drift)
Root cause of 'silent first ~1s then clean' + ~0.5% audio-too-long: in standby
the bridge keeps filling the audio FIFO while the idle-preview consumes only
video, so when recording starts ffmpeg reads a ~0.5s backlog of stale audio,
AND the video-only pre-roll discards video frames the audio never had.

Fix: (1) skip the video-only pre-roll in standby (warm slot = no unstable
frames), (2) drain the audio FIFO non-blocking immediately before ffmpeg opens
it, so audio starts at the live edge aligned with the first real video frame.
2026-06-04 04:49:53 +00:00
fffb6b63b5 fix(capture): revert 16ch audio to clean 2ch — fixes pitch/rate regression
The 16ch interleave in the deltacast bridge produced audio at HALF the correct
sample rate (measured 24224 vs 48000 samples/s/ch), which broke A/V sync and
pitch. Per the working baseline (audio was clean before the channel selector),
revert the bridge audio thread to the original single-group 2ch extraction and
the capture-manager audio input to -ac 2 + wallclock + aresample.

KEPT the good fixes: long-GOP HEVC for non-growing (NVENC realtime, no frame
drops) and GPU-only codec list. 16ch/channel-select is shelved for a separate,
properly-validated change.
2026-06-04 04:33:34 +00:00
b28393eb76 Revert "fix(capture): skip video-only pre-roll in standby to stop A/V pitch drift"
This reverts commit 51b66d882f.
2026-06-04 04:28:11 +00:00
51b66d882f fix(capture): skip video-only pre-roll in standby to stop A/V pitch drift
The pre-roll drained only the video pipe (fc_pipe) while the audio FIFO kept
buffering, so ffmpeg read ~PRE_ROLL_SECONDS of surplus pre-roll audio — making
audio longer than video, which when synced compresses audio ~0.5% (pitch-up,
measured: 2591573 audio samples vs 2579395 expected for the video duration).

In standby the framecache slot is already warm (no unstable startup frames), so
the drain is unnecessary; skipping it lets ffmpeg open video and audio together
from the same instant. Cold on-demand spawns keep the brief drain.
2026-06-04 04:24:08 +00:00
07eea02109 fix(capture): restore audio wallclock (throughput) + remove CPU codec options
- restore -use_wallclock_as_timestamps on audio input: without it ffmpeg's raw
  s16le reader stalled the graph (NVENC idle at 9%, ~half frames dropped). With
  it + long-GOP HEVC the encoder runs realtime and A/V length stays locked.
- remove all CPU codec options (prores*, dnxh*, libx264/265) from recorder UI;
  GPU NVENC only (hevc_nvenc / h264_nvenc). 3x L4 cluster, no reason for CPU.
- GPU codec defaults in env builders + proxy default h264_nvenc.
2026-06-04 04:14:59 +00:00
0ea22e1e53 fix(capture): gate all-intra HEVC on growing-files; normal record uses long-GOP
The hevc_nvenc codec was hardcoded to all-intra (-force_key_frames expr:1), which
is ~4x the NVENC load. Applied to every recording it exceeded the L4's realtime
budget at 1080p59.94 10-bit -> fc_pipe dropped ~half the frames -> video came out
shorter than the (correct) audio -> A/V drift + pitch-up on playback.

Now all-intra is used ONLY when growing-files is on (where it's required for the
editable head). Normal recordings use efficient long-GOP HEVC (2s GOP, 2 B-frames)
which NVENC sustains in realtime with zero drops.
2026-06-04 04:09:14 +00:00
8e5405c3f9 fix(capture): derive deltacast audio PTS from sample count, not wall-clock
Removing -use_wallclock_as_timestamps on the SDI audio input. The bridge writes
SDI-clock-paced samples, so PTS from the 48kHz sample count shares the video's
clock domain and the audio length tracks the video length exactly. Wall-clock
timestamps made audio length = real elapsed time, which drifted ~1% longer than
the frame-count video when the encoder dipped under realtime (pitch-up).
2026-06-04 04:01:54 +00:00
51f939b1fe fix(deltacast-bridge): use group-0 sample count as authoritative audio length
Taking the MAX sample count across the 4 audio groups could emit more audio
frames per slot than group 0 (the SDI-clock reference), drifting the audio
stream slightly longer than video — heard as a ~1% pitch-up. Group 0 paces the
timeline exactly as the original 2ch path did; shorter groups are silence-padded
to its length, never extending it.
2026-06-04 04:01:25 +00:00
095306d9cf feat(recorders): 16ch SDI audio capture + per-recorder channel select + menu redesign
Audio:
- deltacast-bridge: always extract all 4 SDI audio groups (16ch), interleave to
  one 16ch s16le stream per port FIFO; format JSON reports audio_channels:16
- capture-manager: declare FIFO as 16ch input; keep first N discrete channels
  (2/8/16) via pan channelmap on the master (no downmix); HLS preview stays
  stereo. effAudioChannels drives -ac on the master container.
- config modal: Audio channels select (2/8/16)
- channel count already flows mam-api->node-agent->capture via RECORDING_AUDIO_CHANNELS

UI redesign (production craft):
- recorders grouped into per-node hardware 'rack' cards (online/offline state)
- lifecycle accent rail: grey DISABLED / green ENABLED / pulsing-red RECORDING
- promoted capture-port chip, monospaced metadata, Enable as primary CTA
- dedicated recorder CSS block; built on existing design tokens
2026-06-04 03:34:41 +00:00
de509c66ab feat(recorders): hardware-identity model with Enable/Disable lifecycle
Recorders are now physical capture ports, not user-created rows:
- migration 036: label, enabled, auto_provisioned + UNIQUE(node_id,device_index)
  (the structural fix that makes two recorders sharing a port impossible)
- mam-api: auto-provision one recorder row per port from heartbeat capabilities
  (reconcileRecordersForNode); create-once, never overwrites operator config
- mam-api: POST /:id/enable + /:id/disable (provision/teardown standby sidecar);
  PATCH accepts label; config persists across enable/disable
- node-agent: freeCapturePort() force-removes any container on a capture port
  before standby/start — eliminates the EADDRINUSE collisions
- web-ui: recorder menu grouped by node (online/offline), Enable/Disable toggle,
  per-recorder config modal (codec/bitrate/growing/label/project), friendly
  label over hardware name, no destructive delete

Fixes the delete/recreate churn that orphaned standby sidecars and collided on
capture ports during this session's outage.
2026-06-04 03:14:43 +00:00
9f2eac7b61 merge: capture cleanup + standby reconcile helper (base for recorder redesign) 2026-06-04 03:05:06 +00:00
bf4632b911 feat(mam-api): extract ensureStandbySidecar + add POST /recorders/reconcile-standby
Re-provisions the persistent standby sidecar for SDI/deltacast recorders that
lost theirs (manual cleanup, node redeploy, wiped /dev/shm). Without this the
recorder falls back to slow on-demand spawn on /start, which can collide on the
capture port (EADDRINUSE). Idempotent; { force:true } recreates even when a
container_id is already set.
2026-06-04 03:05:00 +00:00
5668c03615 chore(capture): remove stale legacy FIFO path + pin capture profile
- capture-manager: remove dead legacy deltacast FIFO video path (FC_SLOT_ID
  is now always set by node-agent, framecache mandatory on all SDI nodes)
- node-agent: correct stale comment about legacy FIFO fallback
- onboard-node.sh: harden detect_sdi (device-node checks, not just lspci) and
  persist COMPOSE_PROFILES so framecache survives every redeploy on SDI nodes
- remove committed capture.js.bak

Root cause of this session's outage: zampp3 came up without the capture
compose profile, so framecache never started; the bridge published to shm
with no consumer and recorders showed 'receiving' with no real capture.
2026-06-04 02:50:57 +00:00
Wild Dragon Dev
4045e30cd2 fix(node-agent): make http server handler async 2026-06-04 01:54:38 +00:00
Wild Dragon Dev
df6ca084ff feat(web-ui): add Node column to Containers screen + integrated log viewer 2026-06-04 01:48:44 +00:00
Wild Dragon Dev
2f13c8d8b1 feat(mam-api): aggregate containers from all nodes + proxy logs 2026-06-04 01:42:13 +00:00
Wild Dragon Dev
a90adb5b52 feat(node-agent): add /containers and /sidecar/:id/logs endpoints 2026-06-04 01:40:44 +00:00
Wild Dragon Dev
8efcf5c545 feat(capture): remove build-with-decklink.sh script 2026-06-04 01:27:41 +00:00
Wild Dragon Dev
e5abbede43 debug(fc_writer): add trace logs for GET slots path 2026-06-04 01:13:19 +00:00
Wild Dragon Dev
cc489f7774 fix(fc_writer): handle 409 Conflict by fetching existing slot details via GET 2026-06-04 01:12:06 +00:00
Wild Dragon Dev
5b72ee167d fix(decklink-bridge): prevent redundant fc_writer_open loops via last_format tracking 2026-06-04 01:10:47 +00:00
Wild Dragon Dev
d957ce74ae fix(decklink-bridge): avoid redundant fc_writer_open calls in reopen_slot 2026-06-04 01:09:08 +00:00
Wild Dragon Dev
58c058b10c fix(framecache): bind port 7435 to 0.0.0.0 so remote bridges can register slots 2026-06-04 01:00:54 +00:00
Wild Dragon Dev
e715af158d fix(node-agent): pass FRAMECACHE_IP to node-agent env 2026-06-04 00:58:51 +00:00
Wild Dragon Dev
21ba7595b3 fix(node-agent): await async cleanup + fix syntax 2026-06-04 00:57:22 +00:00
Wild Dragon Dev
315b31a68b fix(node-agent): await stopDecklinkBridge and clean up stale occurrences 2026-06-04 00:54:29 +00:00
Wild Dragon Dev
d1b40f5303 fix(node-agent): pass correct FC_URL and Cmd to containerized decklink-bridge 2026-06-04 00:51:14 +00:00
Wild Dragon Dev
6ee8dd5694 feat(node-agent): containerized decklink-bridge + async bridge management 2026-06-04 00:46:19 +00:00
Wild Dragon Dev
8ca7c79acd fix(node-agent): mount decklink-bridge wrapper script as file (not dir) 2026-06-04 00:43:19 +00:00
Wild Dragon Dev
fb0ce320a5 build(node-agent): mount host /usr/local/bin to expose decklink-bridge wrapper 2026-06-04 00:42:31 +00:00
Wild Dragon Dev
6481760dff revert(capture): Dockerfile copy paths to root-relative for compose build 2026-06-04 00:39:24 +00:00
Wild Dragon Dev
650a100d17 build(capture): include decklink-bridge in runtime image 2026-06-04 00:37:49 +00:00
Wild Dragon Dev
400cb786ab fix(decklink-bridge): use IDeckLinkVideoBuffer QueryInterface to get raw bytes 2026-06-04 00:35:16 +00:00
Wild Dragon Dev
74055e79f8 fix(decklink-bridge): use GetFrameInternalBufferBytes instead of GetBytes 2026-06-04 00:28:19 +00:00