Post-review fixes for the 8-commit playout-mcr drop:
- Scheduler self-calls (callSelf -> /recorders, /playout) carried no auth, so
under AUTH_ENABLED=true requireUiHeader 403'd every mutating POST. This broke
playout failover AND scheduled recordings. Add a per-boot in-process service
token (x-internal-token) the scheduler attaches; requireAuth/requireUiHeader
treat it as the seeded admin. No env/compose config needed.
- Failover deadlocked: restartChannel set status='starting' then the scheduler
called the guarded /start route, which 409s on 'starting'. Extract the spawn
body into spawnChannelSidecar() shared by /start and restartChannel; failover
now spawns directly with no self-call.
- Phase A playlist stalled after 2 clips: _scheduleAdvance cued the next clip
via LOADBG AUTO but never advanced the pointer. Pass asset_duration_ms in the
/play payload and arm a duration-based timer that advances currentIndex and
cues subsequent clips, keeping as-run in sync for arbitrary-length playlists.
- CasparCG consumer syntax was invalid: "ADD <ch> FFMPEG" is the producer name,
not a consumer keyword, and old -vcodec/-acodec short args are rejected. Use
STREAM/FILE with -codec:v / -codec:a / -preset:v / -tune:v and a format=yuv420p
filter ahead of libx264 (channel output is RGBA).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Routes: channel + playlist CRUD, start/stop/play/pause/skip transport, as-run
log. RBAC via assertProjectAccess on channel.project_id; null project ⇒
admin-only (recorder convention).
Sidecar orchestration mirrors recorders.js: Docker socket for local node,
node-agent /sidecar/start for remote. Channel start passes CHANNEL_ID env so
the sidecar can write HLS preview to /media/live/<id>.
DeckLink port-contention guard: blocks starting a decklink channel when a
recorder or another channel on the same node+device_index is active.
restartChannel(id) helper picks another healthy cluster node and re-places
non-decklink channels; decklink is alert-only. Exposed for the scheduler.
Scheduler tick adds step 6: poll each running channel's sidecar /status,
update last_heartbeat_at, and after ~3 misses trigger restartChannel +
self-call /start. Reuses the existing PG advisory lock so multi-replica
deploys don't double-fire failovers.
Six tables: channels, playlists, items, sidecars (sidecar registry for
health-check), schedule (Phase B), as-run log.
- video_format default 1080p5994 (house standard, capture cadence)
- restart_count / last_restart_at / last_heartbeat_at on channels for
auto-failover bookkeeping
- audio_normalized flag on items so re-stages skip the loudnorm pass
- unique partial index on (channel_id) for running sidecars
Review of the v2 auth landing turned up four weak spots in the MFA path.
All four are now fixed; behaviour is unchanged for the password-correct
+ correct-TOTP happy path.
1. TOTP brute-force gate (the big one). /login was calling
ipBackoff.recordSuccess(ip) the instant the password hashed correctly,
*before* the second factor was proven. That cleared the per-IP failure
counter, so each /login retry let an attacker with a known password
hammer the 6-digit /login/totp space (10^6) at full speed.
Now recordSuccess fires only inside establishSession() — i.e. after
every required factor has actually passed (password [+TOTP] or
OAuth [+TOTP]).
2. MFA ticket binding. Tickets issued by /login (and the Google callback)
were unbound — a stolen ticket replayed from a different origin still
worked. Tickets now carry SHA-256 hashes of the issuing request's IP
and User-Agent; redeemTicket rejects on mismatch. The ticket is burned
even on mismatch so a wrong-binding probe can't be retried.
3. TOTP replay within the same 30s step (RFC 6238 §5.2). The verifier
accepted the same code as many times as you submitted it. Now
verifyToken returns the matched counter, and /login/totp does a CAS
UPDATE on users.totp_last_counter — codes at counters <= the last
accepted value are rejected. New migration 030 adds totp_last_counter,
seeded on /totp/enable so the enrollment code itself can't be reused
at first login, and zeroed on /totp/disable.
4. Google OAuth domain check no longer falls back to the email suffix
when the hd (hosted-domain) claim is missing. Email-suffix matching
let consumer (non-Workspace) Google accounts whose email happens to
end in the allowed domain through; if GOOGLE_ALLOWED_DOMAIN is set,
the operator means "only this Workspace", so accounts without a
verified hd must be rejected.
Tests: new mfa-tickets.test.js covers ip/UA binding, single-use on
mismatch, and bindings-absent back-compat. totp.test.js updated for the
new verifyToken return shape (counter on success, null on failure;
truthiness still works at call sites) and adds an explicit
matched-counter check.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Review of the v2 auth landing found four places where the per-project RBAC
helpers weren't applied to destination/source projects, letting a scoped
editor write into projects they don't have access to:
- assets PATCH /🆔 bin_id moved with no check, so an editor in project A
could stuff their asset into a bin in project B. Now validates the bin's
project_id matches the asset's own project (assets don't change project).
- assets POST /:id/copy: body's projectId/binId never checked, so any
reachable asset could be cloned into an arbitrary project. Now asserts
edit on the destination project and validates binId belongs there.
- bins POST /:id/assets: requireBinEdit checks edit on the bin's project but
not on the source asset's project, so an asset from project B could be
pulled into A's bin tree (and surfaced in A's views). Now the asset must
belong to the bin's own project.
- jobs POST /conform: project_id from body never gated, so any logged-in
user could enqueue conform jobs against any project. Now asserts edit.
- upload POST /init, POST /simple: projectId/binId from body never gated,
same class of bug. Now asserts edit on projectId and validates binId.
- upload GET /: returned every in-progress upload globally, leaking
filenames across projects. Now scoped via accessibleProjectIds.
These are the same pattern as the holes 2615143 closed on recorders/
sequences/imports/comments — these routes existed before the RBAC commit
landed and were never marked TODO(authz), so the broad sweep missed them.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase 1 scoped only projects/assets/bins and left recorders, sequences,
imports, comments carrying TODO(authz) markers. A scoped editor/viewer could
still read and mutate those across every project. This closes the gap using
the existing authz.js helpers — no open TODO(authz) markers remain.
- recorders: param('id') resolves project + view baseline, requireRecorderEdit
on PATCH/DELETE/start/stop, GET / filtered by accessibleProjectIds, POST /
asserts edit on the target project (null project = admin-only)
- sequences: same param pattern + requireSequenceEdit on PUT/:id,/clips,conform
and DELETE; GET//POST/ assert on the query/body project
- imports: POST /youtube asserts edit on the body projectId
- comments: router.use guard resolves project via the asset (view to read, edit
to write); also fixes the author bug (req.session.userId -> req.user.id, which
was always NULL so comments had no recorded author)
- capture: intentionally any-logged-in (shared hardware, asset scoped on
registration) — TODO replaced with a rationale note
Security fixes from review of this change:
- recorders POST /:id/start: a per-take projectId override could route a live
asset into a project the caller lacks edit on — now asserts edit on the
override target
- sequences PUT /:id/clips: spliced asset_ids weren't checked, so an editor
could pull in (and via GET /:id leak signed proxy URLs for) assets from a
project they can't access — now every clip asset must belong to the
sequence's project; pre-transaction queries moved inside try/catch so a DB
error returns 500 instead of hanging the request
- tests: recorders-access, sequences-access (incl. cross-project clip guard),
comments-access (incl. author-id regression)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Optional "Sign in with Google" with auto-provisioning, fully config-gated:
without GOOGLE_CLIENT_ID/SECRET and OAUTH_REDIRECT_URL the routes 404 and the
button is hidden, so deployments without SSO are unaffected.
- migration 028: users.google_sub (unique) + email; password_hash nullable
for OAuth-only accounts
- src/auth/google-oauth.js: lazy google-auth-library, ID-token verify,
GOOGLE_ALLOWED_DOMAIN enforcement, requires email_verified === true
- auth routes: /auth/google (state-CSRF redirect), /auth/google/callback,
/auth/google/enabled; reuses establishSession
- web-ui: "Sign in with Google" on the login screen (shown only when enabled),
friendly callback error handling
- .env.example documents all new vars
Security hardening (from review of this + the TOTP work):
- resolveGoogleUser links ONLY by google_sub, never by email — a Google login
can never seize a pre-existing local account (account-takeover fix)
- a Google-linked account with TOTP still requires the second factor (ticket
in session, /?mfa=1 step) instead of bypassing it
- /login/totp now applies the per-IP login backoff
- recovery-code consumption is atomic (WHERE used_at IS NULL + rowCount)
- concurrent first-login race on google_sub is caught and re-resolved
- tests: google-oauth config helpers + google-link takeover/dedup regression
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The HLS-VOD work made GET /assets/:id/stream return the HLS playlist URL as
`url` whenever hls_s3_key was set. The Premiere plugin's "Import Proxy"
downloads `url` to a file and imports it — so it was saving an .m3u8 playlist
as .mp4, and Premiere rejected it ("unsupported compression type"). This hit
every YouTube asset (all get HLS generated), regardless of codec.
/stream now returns the directly-downloadable MP4 proxy as `url` (type mp4)
and the HLS playlist as a separate `hls_url`. The web player prefers `hls_url`
(so in-browser HLS playback is unchanged), while the already-installed plugin
gets a real MP4 again — no plugin reinstall needed.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The s3 client request-timeout fix (the original browser playback-hang fix)
was applied directly on zampp1 but never committed to main. Without it a
stalled RustFS GET hangs /video and /hls indefinitely. Landing it so a clean
deploy from main no longer regresses playback.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phase 0.2 of the NVENC All-Intra HEVC ingest plan.
node-agent/handleSidecarStart:
- Accept useGpu: true in the sidecar start body
- When useGpu: adds Runtime=nvidia, DeviceRequests=[gpu], and injects
NVIDIA_VISIBLE_DEVICES=all + NVIDIA_DRIVER_CAPABILITIES=video,compute,utility
into the container env. CPU-codec recorders are unaffected (useGpu defaults false).
mam-api/recorders (start endpoint):
- Derive useGpu from recorder.recording_codec — true for hevc_nvenc/h264_nvenc
- Pass useGpu to remote sidecar start body
- Apply same Runtime/DeviceRequests to the local Docker spawn path
capture/capture-manager:
- Update hevc_nvenc codec entry with all-intra flags:
-g 1 -bf 0 (every frame IDR, no B-frames — required for growing-file
edit-while-record), -rc vbr, -profile:v main10, pixFmt p010le (10-bit 4:2:0)
Next: validation gate (§8) — test MXF OP1a then fragmented MOV on one
DeckLink channel, mount in Premiere while recording.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Stuck-live fix: capture sidecar now finalises the pre-created live asset by id (new POST /assets/:id/finalize) instead of POSTing a new asset (409 collision); node-agent gives the sidecar a 180s stop grace so the S3 upload + callback complete; node-agent logs sidecar start/stop for diagnostics.
Live SDI monitor: HLS preview is now a 2nd output of the hires ffmpeg (single DeckLink read, split to ProRes/S3 + H.264/HLS); node-agent serves /live over HTTP; mam-api proxies GET /recorders/:id/live/* to the recorder node; web-ui HlsPreview loads from the proxied URL.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- capture/src/index.js: read MAM_API_TOKEN from env; include
Authorization: Bearer header in shutdown callback fetches to mam-api
(POST /assets and POST /assets/:id/mark-empty). Without this, mam-api
AUTH_ENABLED=true rejects the callback with 401, leaving assets stuck in live
- recorders.js: pass MAM_API_TOKEN=${CAPTURE_TOKEN} in sidecar env so the
capture container receives the token at boot
- api_tokens: inserted capture-sidecar token (unbound, prefix b3d3d3c4)
- recorders.js: when isRemote=true, replace MAM_API_URL in sidecar env with
http://<NODE_IP>:<PORT_MAM_API> so capture containers on worker host network
can reach mam-api (fixes assets stuck in live status after recorder stop)
- cluster.js: add GET /api/v1/cluster/metrics endpoint returning per-node
cpu/ram/gpu utilization; update heartbeat handler to persist metrics JSONB
- web-ui: add Resources panel to dashboard with live CPU/RAM/GPU bars per node,
polling /api/v1/cluster/metrics every 5s
## capture service
- capture-manager.js: add 'deltacast' source_type to _buildInputArgs.
Uses 'deltacast://<index>' with ffmpeg deltacast demuxer when
/dev/deltacast<N> exists; falls back to lavfi testsrc2 + sine test card
(matching deltacast-sdi-recorder standalone app) when hardware absent.
- routes/capture.js: add GET /devices/deltacast endpoint (enumerates
/dev/deltacast* + DELTACAST_PORT_COUNT env fallback). Extend /probe to
handle source_type=deltacast.
## node-agent
- detectHardware(): add 'deltacast' array to capabilities payload.
Enumerates /dev/deltacast* nodes; falls back to DELTACAST_PORT_COUNT env.
Adds DELTACAST_MODEL env support. Logs dc= count in heartbeat line.
- sidecar /start: bind /dev/deltacast* device nodes into capture containers
when sourceType='deltacast'.
## mam-api
- cluster.js: add GET /cluster/devices/deltacast and
GET /cluster/devices/deltacast/signal endpoints — same shape as
blackmagic equivalents for UI parity.
- recorders.js /start: pass DELTACAST_PORT_COUNT env to capture container;
bind /dev/deltacast* device nodes on local spawn.
- migration 024: ALTER TYPE source_type ADD VALUE 'deltacast' (idempotent).
- schema.sql: add 'deltacast' to source_type ENUM for fresh installs.
## web-ui
- modal-new-recorder.jsx: add 'Deltacast' source type card; fetch
/cluster/devices/deltacast on selection; port picker with TEST CARD
badge when hardware absent; falls through to manual index entry if
no devices detected.
- capture-manager.js, routes/capture.js: fix ffmpeg -sources decklink
parse regex from v4l2 hex-address format (never matched DeckLink output)
to correct indented-line format. Port 2+ (index 1+) was falling through
to a wrong model-name fallback, causing ffmpeg to open the wrong input
and produce black frames. Now logs the detected device list and the
selected name at start.
- recorders.js (/start): accept per-take projectId override in request
body. If provided, clips go to that project instead of the recorder's
default project_id. Used for both the live-asset INSERT and the
PROJECT_ID env var passed to the capture container.
- screens-ingest.jsx (RecorderRow): add project dropdown shown when
recorder is stopped. Defaults to the recorder's configured project;
operator can change it before hitting Record without editing the
recorder config.
The conform worker's final step INSERTs the rendered output into the
assets table:
INSERT INTO assets (project_id, filename, display_name, …)
VALUES ($1, …)
-- project_id NOT NULL
It reads projectId from job.data, but the /sequences/:id/conform
endpoint never set it. Render finished cleanly, ffmpeg ran, output
uploaded to S3, then the final asset row INSERT failed:
null value in column "project_id" of relation "assets"
Pass seq.project_id from the loaded sequence row. The rendered output
lands as an asset under the same project as its source sequence —
the natural target.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two cooperating bugs left Export Timeline stuck at "Rendering Hi-Res"
forever:
A. worker emitted "Invalid FCP XML: no sequence element" because
Timeline.generateFcpXml produced fcpxml (FCP X schema:
<fcpxml><resources>/<library>/...) while the worker's parseFcpXml
expects xmeml (FCP 7 schema: <xmeml><sequence>...). Two completely
different formats.
Rewrite generateFcpXml to emit xmeml v5 with the structure the
parser walks:
xmeml/sequence/{name,duration,rate{timebase,ntsc},
media/video/{format/samplecharacteristics,
track[@currentExplodedTrackIndex]
/clipitem/{name,duration,rate,in,out,
start,end,file/{name,pathurl}}}}
Clipitem in/out are SOURCE frames (the underlying media in/out);
start/end are TIMELINE frames (the cut position). The worker uses
the rate timebase to parse them.
B. /api/v1/jobs/:id rejected the panel's polls with
"Invalid id — must be a UUID". The handlers below correctly parse
BullMQ-prefixed ids ("conform:42"), but router.param('id',
validateUuid('id')) ran first and 400'd everything that wasn't a
UUID. The panel's pollConform swallows the resulting fetch error
silently and polls forever.
Drop the validator. Comment in the file explains why.
Bumps panel to v2.2.2.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
User reported infinite login loop on dragonflight.live. Root cause: openresty
fronts both http:// and https:// without redirecting, and a user landing on
http:// gets the Set-Cookie response silently dropped — cookies are Secure-only
when TRUST_PROXY=true, and the CORS allowlist refuses the http:// origin.
Result: login appears to succeed, next request has no session cookie, AuthGate
bounces back to login.
Two defensive layers (the openresty box is not in our reach):
- web-ui index.html: tiny inline redirect; if location is http://dragonflight.live,
rewrite to https:// before anything else runs. Bounded to that exact hostname
so local / LAN access on http://172.18.91.x stays as-is.
- mam-api: emit Strict-Transport-Security on HTTPS responses when AUTH_ENABLED=true.
After one successful HTTPS visit, browsers auto-upgrade future http:// requests
on their own — closes the loophole even if someone bypasses the index.html JS.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- requireAuth bearer path now selects api_tokens.bound_hostname and users.role,
populates req.tokenBoundHostname and req.user.role. /cluster/heartbeat can
now authenticate via a bound api_token (issued via POST /auth/tokens with
bound_hostname).
- routes/tokens.js POST accepts bound_hostname; GET returns it so users can
see which tokens are bound.
- Remove /cluster/heartbeat from SERVICE_PATHS so requireAuth runs on it (the
bearer auth handles the gate; the heartbeat handler still enforces the
body.hostname === bound match).
- /auth/me now returns role (final-review I2). Closes the gap where every
signed-in user appeared as 'viewer' in the UI regardless of actual role.
- loadUser SELECTs role for session auth.
- Backend tests still 37/15/0/22 — no test changes needed; existing token
CRUD tests stay passing since bound_hostname is optional.
Migration 023 was fixed in 9dc572b to use '00000000-0000-4000-8000-000000000000'
because 'v' isn't a valid hex digit, but the DEV_USER_ID constant in
middleware/auth.js still referenced the original '...000000000dev'. Every
route that passes DEV_USER_ID as a query parameter (users list, login lookup,
setup-required count) was throwing 22P02 invalid input syntax for type uuid.
The errors were swallowed by Promise.allSettled in the SPA's data load so the
app appeared to work in dev mode, but enabling AUTH_ENABLED=true would have
broken login entirely.
Final-review findings:
- Mount usersRouter at /api/v1/users in addition to /api/v1/auth/users so the
existing SPA Users page works; add PATCH /:id for inline edits (display_name,
role, password).
- Add X-Requested-With: dragonflight-ui to raw XHR/fetch paths that bypass
apiFetch (file uploads, SDK uploads, EDL export) — without it, requireUiHeader
403s before reaching the route.
- Exempt SERVICE_PATHS (/cluster/heartbeat) from requireUiHeader so node-agent
heartbeats keep working when NODE_TOKEN is unset.
- Remove stale auth.js.bak.
Fixes three issues in the authentication system:
C1: Add boot-time warning when AUTH_ENABLED=true but TRUST_PROXY!=true.
Without TRUST_PROXY=true behind nginx, req.ip becomes the proxy IP for all
clients, collapsing per-IP rate limiting into a shared pool. Operators must
explicitly set TRUST_PROXY=true to make per-IP rate limiting effective.
C2: Mount requireUiHeader middleware in test helpers (auth.test.js,
users.test.js, tokens.test.js). The CSRF header validation was not being
exercised in the test suite. Tests now send X-Requested-With: dragonflight-ui
headers that are actually validated by the middleware.
I1: Implement bounded rate-limit Map with MAX_ENTRIES=10000 and LRU eviction.
Unbounded Maps are vulnerable to spray attacks: attackers can force memory
exhaustion by requesting with distinct IPs. Now we evict the oldest entry
(by insertion order) when the map reaches capacity.
Code-review feedback:
- Dummy hash for user-enumeration-defense timing was 63 chars (bcrypt strings
are 60 chars). Worked by accident because bcrypt 5.x is lenient about
trailing chars; a future tightening would silently regress the timing
defense. Replaced with a real pre-computed bcrypt hash.
- last_login_at UPDATE now logs errors instead of silently swallowing them,
matching the pattern in requireAuth for api_tokens.last_used_at.
- Removed dead import of comparePassword from auth.test.js.
Code-review feedback: startsWith('/cluster') was a prefix match that exposed
destructive operator endpoints (POST /containers/:id/restart, DELETE /:id,
GET /devices/blackmagic/*) unauthenticated. Only POST /heartbeat is genuine
node-agent traffic; everything else in cluster.js is operator/UI surface
that should go through requireAuth. Long-term: issue node-agent a bound
api_token and drop the carve-out entirely.
Code-review feedback:
- Hard-fail boot when AUTH_ENABLED=true and SESSION_SECRET is unset, so
express-session can't silently use an in-memory random secret that
invalidates sessions on restart and breaks multi-node clusters.
- CORS rejection now returns cb(null, false) instead of cb(new Error)
so misconfigured origins surface as clean CORS errors in the browser
instead of HTTP 500s. Log a warn line for operator visibility.
- pruneSessionInterval units comment.
Code-review feedback: writing last_seen_at = now before loadUser() lets
the stamp persist if the lookup throws (resave:false still writes when
modified), extending the idle window without confirming the user exists.
Also clarify DEV_USER_ID is a specific placeholder, not a generic sentinel.
Code-review feedback: ON CONFLICT (id) only catches id collisions; a pre-existing
'dev' username would trigger a unique_violation on the username index and roll
back the migration, hard-failing the mam-api boot. Switch to bare ON CONFLICT
DO NOTHING so any unique conflict is no-op-safe.