Root cause of stuck 'processing', failed deletes, and dead playback:
The mam-api proxies media (/video, /hls pipe the full S3 body through
Express), holding long-lived streaming sockets. With the SDK's default
http agents (no keep-alive, unbounded but unpooled) those streams starved
control-plane calls — DeleteObject and the proxy worker's master download
— which timed out (10s connectionTimeout) in bursts.
Fixes:
- mam-api S3 client: dedicated keep-alive http/https Agents (maxSockets 256)
+ requestTimeout raised 30s→300s so large master GETs finish.
- worker S3 client: previously had NO handler config at all (SDK defaults).
Added keep-alive agents + 600s requestTimeout so proxy/conform master
downloads (hundreds of MB) don't stall and leave assets in 'processing'.
Three fixes to restore growing-files (XDCAM HD422 MXF) recording:
1. capture-manager mountGrowingShare: pass username=/password= inline
instead of a credentials= file. TrueNAS SMB3 rejects the creds-file form
with EACCES (-13, 'cannot mount read-only') while the identical inline
creds mount fine. This was causing every growing record to silently fall
back to the HEVC/S3 path (producing .mov, not .mxf).
2. docker-compose capture: add cap_add SYS_ADMIN + DAC_READ_SEARCH and
apparmor:unconfined so mount.cifs can run inside the container.
3. storage /overview: wrap S3 HeadBucket/ListObjects probe in a 5s timeout
so the admin 'Mount health' card stops hanging on 'Probing…' forever
when S3 is slow.
- mam-api: dockerLogs() + demuxDockerStream() — the local container-log path
JSON.parsed Docker's raw multiplexed stream and always returned '(no logs)';
now strips stdcopy framing and returns readable text (tail configurable).
- web-ui: new Logs admin page — every container across every node grouped by
node in a left rail, live-follow log viewer with filter + copy on the right.
Reuses the now-working /cluster/containers/:node/:id/logs endpoint.
- web-ui: Containers screen now polls every 5s (was load-once) so the
cross-cluster view stays live without manual refresh.
- icons: add server + file glyphs (were referenced but missing -> blank).
- nav: Logs wired into the Admin sidebar section + routes + breadcrumbs.
- restore -use_wallclock_as_timestamps on audio input: without it ffmpeg's raw
s16le reader stalled the graph (NVENC idle at 9%, ~half frames dropped). With
it + long-GOP HEVC the encoder runs realtime and A/V length stays locked.
- remove all CPU codec options (prores*, dnxh*, libx264/265) from recorder UI;
GPU NVENC only (hevc_nvenc / h264_nvenc). 3x L4 cluster, no reason for CPU.
- GPU codec defaults in env builders + proxy default h264_nvenc.
Taking the MAX sample count across the 4 audio groups could emit more audio
frames per slot than group 0 (the SDI-clock reference), drifting the audio
stream slightly longer than video — heard as a ~1% pitch-up. Group 0 paces the
timeline exactly as the original 2ch path did; shorter groups are silence-padded
to its length, never extending it.
Recorders are now physical capture ports, not user-created rows:
- migration 036: label, enabled, auto_provisioned + UNIQUE(node_id,device_index)
(the structural fix that makes two recorders sharing a port impossible)
- mam-api: auto-provision one recorder row per port from heartbeat capabilities
(reconcileRecordersForNode); create-once, never overwrites operator config
- mam-api: POST /:id/enable + /:id/disable (provision/teardown standby sidecar);
PATCH accepts label; config persists across enable/disable
- node-agent: freeCapturePort() force-removes any container on a capture port
before standby/start — eliminates the EADDRINUSE collisions
- web-ui: recorder menu grouped by node (online/offline), Enable/Disable toggle,
per-recorder config modal (codec/bitrate/growing/label/project), friendly
label over hardware name, no destructive delete
Fixes the delete/recreate churn that orphaned standby sidecars and collided on
capture ports during this session's outage.
Re-provisions the persistent standby sidecar for SDI/deltacast recorders that
lost theirs (manual cleanup, node redeploy, wiped /dev/shm). Without this the
recorder falls back to slow on-demand spawn on /start, which can collide on the
capture port (EADDRINUSE). Idempotent; { force:true } recreates even when a
container_id is already set.
Sidecars now spawn at recorder CREATE time instead of /start time.
The container boots in STANDBY=1 mode (idle preview only, no ffmpeg master).
On /start, mam-api sends per-session params (CLIP_NAME, ASSET_ID, PROJECT_ID)
to the running sidecar via HTTP POST /capture/start — ffmpeg starts in <1s.
On /stop, mam-api calls HTTP POST /capture/stop — container stays alive in
standby, ready for the next take immediately.
Container is only killed on recorder DELETE.
This eliminates: Docker create/start overhead (~1-2s), bridge startup (~2-5s),
and pre-roll wait (~5s). Latency from 'record' click to first encoded frame
drops from ~10s to ~1s.
Changes:
- capture/src/index.js: boot in standby when STANDBY=1 env is set; still
start idle preview (live thumbnail visible before recording)
- capture/src/routes/capture.js: POST /start accepts full codec params and
asset_id in body (skips mam-api asset creation when asset_id provided)
- node-agent/index.js: handleSidecarStandby() + POST /sidecar/standby route;
warms bridge at recorder create time
- recorders.js POST /: spawn standby sidecar after DB insert (non-fatal)
- recorders.js POST /:id/start: HTTP fast-path to standby sidecar; falls
back to on-demand spawn if standby not available
- recorders.js POST /:id/stop: HTTP /capture/stop, keep container in standby
- recorders.js GET /:id/status: use port-based URL for local capture status
The scheduler tick loop updates a schedule's status to 'starting' and 'stopping'
in the database while it initiates the API calls to the recorder container. The
original CHECK constraint in recorder_schedules rejected these two statuses,
causing the scheduler to crash on constraint violation and never start the job.
Replace the HLS 'connecting…' player in the library with a real frame grabbed
from the start of the recording, while the recording is still live.
Flow:
- recorders.js already pre-creates the asset as status='live' + ASSET_ID env
- capture-manager.start() fires _publishLiveThumbnail() (non-blocking): polls
/live/<id> for the first seg-*.ts, extracts frame 0 via ffmpeg (scaled JPEG,
yuvj420p), uploads to S3 thumbnails/<id>.jpg, then POSTs the key to mam-api
- new mam-api POST /assets/:id/live-thumbnail sets thumbnail_s3_key on the still
-live row (status untouched); idempotent no-op once finalized
- visuals.jsx AssetThumb: for live assets, show the static poster once the key /
signed URL is available, else fall back to the live HLS preview. Pulsing LIVE
border kept either way
- POST /assets gains an optional status param (default 'processing'); 'live'
skips the proxy/thumbnail queue
- capture /stop route now finalizes the pre-created asset by id (guarded) instead
of POSTing a duplicate
🤖 Generated with Claude Code
When a capture sidecar crashes before finalize() runs (e.g. wrong node,
filter error, hardware fault), the asset stays 'live' indefinitely — library
shows 'Recording' badge for up to 120 minutes until the stale-timeout fires.
Add an orphan check that runs every scheduler tick: if an asset is 'live'
and its recorder is 'stopped', mark it 'error' immediately. This runs before
the 120-minute staleness guard so the library clears within 15 seconds.
🤖 Generated with Claude Code
ROOT CAUSE of 'connecting' hangs and intermittent port failures:
The DELTA-12G-e-h 8c is a bidirectional card. Without calling
VHD_SetBiDirCfg(board_index, VHD_BIDIR_80) before streaming, the
board remains in its default bi-dir config (likely 4RX/4TX) — so
RX stream opens fail with VHDERR_RESOURCEUNAVAILABLE on channels
configured as TX, causing random 'connecting' hangs per the SDK docs.
Per SDK Tools.cpp SetNbChannels() pattern:
1. Open temporary board handle
2. Check IS_BIDIR + channel counts
3. Call VHD_SetBiDirCfg(board_index, VHD_BIDIR_80) for 8ch bidir
4. Close temp handle, then open real board handle for streaming
Also add VHD_SetChannelProperty(VHD_CHANNEL_MODE_SDI) for ASI-type
channels per Sample_RX.cpp — required for 12G-ASI/3G-ASI channel
types to correctly detect incoming video standard.
🤖 Generated with Claude Code
Growing root cause (4th attempt): Premiere doesn't import H.264-in-.ts
("unsupported compression type"); its growing-file support is MXF OP1a.
Prior MXF/DNxHR failed because DNxHR is VBR and never flushes the incremental
index — XDCAM HD422 (mpeg2video, CBR) DOES write index segments into body
partitions mid-record (proven live via SIGKILL: 5 index segments, readable,
no footer). Growing master is now MXF OP1a / XDCAM HD422 4:2:2 CBR + PCM s16le,
operator bitrate as CBR (default 50M). live-path returns .mxf to match.
GUI: bitrate input is now always editable in growing mode (was hidden for
ProRes-selected codecs); codec menu shown disabled-with-explanation under
growing (it had only looked "missing" due to a stale served bundle).
Requires Premiere prefs: Media > "Automatically refresh growing files" ON,
and disable the two XMP-write-on-import options.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
duration_ms/file_size are int8; node-postgres returned them as strings,
a footgun for any consumer doing arithmetic/sorting/comparison (already
hand-patched once in playout totals). Register a global int8 type parser
so the API emits real numbers. All such values are < 2^53 (no precision loss).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Drop in the redesigned timeline-centric Playout (PGM monitor, transport,
SCTE-35 card, as-run drawer) from the on-node redesign, fully wired to the
real playout API (channels/transport/HLS preview w/ error-recovery/as-run);
no mock data. In-page ConfirmModal for destructive actions.
SCTE-35: new playout_scte_breaks table (migration 033), endpoints to
schedule/trigger/list/cancel breaks (POST/GET/DELETE /channels/:id/scte[/trigger]),
scheduler due-break sweep, engine triggerScte + auto-return + as-run 'scte'
rows + on-air SCTE-BREAK state and timeline AD markers. In-stream SCTE-35
cue injection is a documented stub (CasparCG FFMPEG consumer exposes no
scte35 muxer) — scheduling/triggering/countdown/as-run are functional.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Root cause: MXF OP1a writes its index/duration only in the footer partition
on finalize, so a growing MXF has no footer and VLC/Premiere/ffmpeg-strict
refuse it ("Unable to open file on disk"). Separately the proxy job pointed
at a .mov S3 key that never existed (promotion worker watched a local empty
disk, not the SMB share), so stop -> instant proxy failure.
Fix: growing master is now MPEG-TS (H.264 high422 all-intra + AAC), which is
readable from the first PAT/PMT while still growing (verified mid-write decode).
hiresKey derives from the actual produced extension. Capture skips finalize for
growing recorders (leaves asset live for promotion). Promotion worker CIFS-
mounts the same growing_smb share before scanning; worker image gets cifs-utils
and worker-p4 runs privileged (local /growing bind removed). /live-path uses .ts.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
#162 local-spawn stop now uses /stop?t=180 + waits for asset to leave 'live'
before removing the container (no more SIGKILL-corrupted masters / stuck-live).
#163 validateRecorderConfig guard (PCM!=MP4, HEVC!=MXF, NVENC needs GPU) on
create+PATCH; codec presets in new-recorder modal.
#159 container list reads Docker /stats memory (N/A when null) + UI render.
#160 primary node self-populates version + uptime on the Cluster screen.
#145 asset-detail Download original gated by dismissable size warning.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Per-node "Capture Drivers / SDKs" panel installs Blackmagic / AJA / Deltacast
/ NDI drivers without SSH. node-agent gains NODE_TOKEN-gated /driver/install
+ /driver/status (spawns a one-shot privileged ubuntu container that bind-
mounts host kernel paths + the repo and runs deploy/install-driver.sh);
mam-api adds admin-gated /cluster/:id/install-driver + /driver-status.
Driver files live in-repo under sdk/<vendor>/ (private repo); binaries are
admin-supplied per each sdk/<vendor>/README.md. Vendor allowlist throughout.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Cluster: AddNodeModal on Admin->Cluster mints a node token via /auth/tokens
and emits a ready-to-paste curl|bash onboarding command. New admin-only
GET /cluster/onboard-info returns apiUrl/scriptUrl/branch. Role->PROFILES
mapping (worker/capture/gpu); gate worker-l4 behind compose profile [gpu].
Home: restore "Let's Create" kicker + one-line "Media Asset Management &
Production Platform" tagline; animated accent pulse behind the dragon logo
(reduced-motion safe); move Settings tile to a centered bottom row.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Dashboard (screens-home.jsx): rebuild to new design, fully live-wired.
Dropped fabricated figures per "real data" rule (object-store %, uptime,
storage breakdown); repurposed ingest cell to real Assets-24h count.
Fixed undefined refs and double-rendered Resources section.
Playout: as-run writer in scheduler.js writeAsRun() off the health-tick
/status poll; AsRunPanel UI + missing CSS in styles-playout.css.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Root cause of the persistent black preview, fully isolated: ZAMPP1's nginx
serves the live .m3u8 fresh on every request (no-store works there), but
the PUBLIC reverse proxy (159.112.211.103 -> ZAMPP1) caches the static
.m3u8 by path with a multi-second TTL, ignoring both the origin's no-store
and query params. hls.js reloads the playlist ~every second, always landing
inside that TTL, so it sees the live playlist as never advancing
("live playlist MISSED" forever), never establishes the timeline, and never
loads a fragment -> readyState 0 (black). Proven: rapid reads via ZAMPP1
localhost advance (404->405); the same rapid reads via the public URL are
stuck; query-busting doesn't help (proxy caches by path).
Fix: serve the playlist through GET /api/v1/playout/channels/:id/hls/index.m3u8
instead of the static /media/live path. /api/ is not proxy-cached (the live
status poll already updates fine through it), so hls.js always gets the fresh
live edge. Segment (.ts) lines are rewritten to absolute /media/live/<id>/
URLs so they still load from the static path (immutable; caching them is
correct). ProgramMonitor points hls.js at the /api playlist and sends the
session cookie (xhrSetup withCredentials) since /api is auth-gated.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Growing-files mode became per-recorder (recorders.growing_enabled); the
global growing_enabled setting was removed. GET /assets/:id/live-path —
which the Premiere plugin calls to mount a still-growing master — still
required growing_enabled==='true', so Mount Live would 409 "Growing-files
mode is disabled" on any deploy where that stale key isn't set. Drop the
global gate: a status='live' asset already proves a growing recorder is
producing the file; only the editor-facing growing_smb_url is required.
Response contract is unchanged, so the plugin needs no update.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Implements docs/superpowers/specs/2026-05-31-storage-settings-growing-smb-design.md.
1. Storage warning banner at the top of Settings → Storage (set-once /
path-change-corrupts-data warning).
2. Growing-files SMB credentials + system CIFS mount (Approach A):
- settings.js: new global keys growing_smb_mount / growing_smb_username /
growing_smb_vers; growing_smb_password is write-only (GET returns only
growing_smb_password_exists; growing_smb_password_clear:true removes it).
- GrowingSettingsCard: SMB mount/username/password (masked, "saved" state) +
CIFS version fields.
- capture Dockerfile: add cifs-utils + util-linux.
- capture-manager: on growing start, mount //host/share at /growing using a
root-only credentials file (creds never on the command line); unmount on
stop; mount failure falls back to S3 streaming so a recording is never lost.
- recorders.js: pass GROWING_SMB_* env; don't host-bind /growing when a CIFS
mount is configured (an empty mountpoint is required).
3. Per-recorder growing mode (global toggle removed):
- Removed the global "capture writes to local SMB share first" checkbox; the
growing card is now SMB-infrastructure-only.
- recorders.js reads the per-recorder recorders.growing_enabled column
(already present from migration 014) instead of the global setting;
RECORDER_FIELDS += growing_enabled.
- New-recorder modal: "Growing-files mode" toggle.
- storage.js overview: "enabled" now means the SMB landing zone is configured
(mount source set), surfaced as smb_mount; health strip labels updated.
No DB migration required (recorders.growing_enabled exists; new settings are
key/value rows).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- spawnChannelSidecar: set last_heartbeat_at = NOW() when flipping
channel to 'running'. Without this, last_heartbeat_at is NULL so
the first scheduler tick sees ageMs = (now - epoch) >> TIMEOUT_MS
and triggers failover before the sidecar has had a single chance
to respond.
- scheduler playoutHealthTick: when last_heartbeat_at is NULL fall
back to updated_at as the baseline (belt-and-suspenders with the
spawnChannelSidecar fix). Also include updated_at in the query.
- POST /channels/:id/play: catch callSidecar errors explicitly and
return 502 Bad Gateway instead of delegating to next(err) which
the error middleware maps to 409 Conflict.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- jobs.js: add playout-stage BullMQ queue to QUEUES; asset_id from
job data is already resolved to a name by attachAssetNames
- screens-jobs.jsx: map type 'playout-stage' -> kind 'Stage' with
monitor icon
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without this, a freshly-spawned channel with NULL last_heartbeat_at was
instantly failover-killed by the playoutHealthTick because `0` was used as
the lastSeen timestamp, making ageMs huge on the very first tick.
Playout failover queries cluster_nodes.last_seen_at to find healthy nodes
for channel re-placement. Column missing from original cluster schema.
Migration 031 adds column + backfills existing nodes to NOW().
Fixes scheduler error: column "last_seen_at" does not exist
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Backend (routes/users.js):
- GET / now returns totp_enabled so the UI can show 2FA status
- GET /:id/access — admin-only effective per-project access (MAX over direct +
group grants), labels via=direct|group:<name>; admins report all/edit
- POST /:id/totp/disable — admin clears a locked-out user's 2FA without their
password (self-service disable still requires it); dev user blocked
- role validated against {admin,editor,viewer} on create + PATCH (was unchecked)
Frontend:
- Users>Policies tab: static prose replaced with interactive per-user matrix —
inline role select, 2FA badge, Reset-2FA action, lazy per-user access expander
- Home "Premiere panel" tile -> "Downloads"; modal renamed, adds Teams ISO row
(disabled "coming soon" until the .exe is supplied); UXP .ccx link unchanged
- data.jsx: window.TEAMS_ISO placeholder ({available:false})
Not runtime-tested in browser yet. Teams ISO .exe still pending from user.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Post-review fixes for the 8-commit playout-mcr drop:
- Scheduler self-calls (callSelf -> /recorders, /playout) carried no auth, so
under AUTH_ENABLED=true requireUiHeader 403'd every mutating POST. This broke
playout failover AND scheduled recordings. Add a per-boot in-process service
token (x-internal-token) the scheduler attaches; requireAuth/requireUiHeader
treat it as the seeded admin. No env/compose config needed.
- Failover deadlocked: restartChannel set status='starting' then the scheduler
called the guarded /start route, which 409s on 'starting'. Extract the spawn
body into spawnChannelSidecar() shared by /start and restartChannel; failover
now spawns directly with no self-call.
- Phase A playlist stalled after 2 clips: _scheduleAdvance cued the next clip
via LOADBG AUTO but never advanced the pointer. Pass asset_duration_ms in the
/play payload and arm a duration-based timer that advances currentIndex and
cues subsequent clips, keeping as-run in sync for arbitrary-length playlists.
- CasparCG consumer syntax was invalid: "ADD <ch> FFMPEG" is the producer name,
not a consumer keyword, and old -vcodec/-acodec short args are rejected. Use
STREAM/FILE with -codec:v / -codec:a / -preset:v / -tune:v and a format=yuv420p
filter ahead of libx264 (channel output is RGBA).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>