15 KiB
Dragonflight User Authentication — Design
Status: Approved, ready for implementation planning Date: 2026-05-27 Brainstormed with: Zac
Problem
Dragonflight has the skeleton of an auth system spread across the codebase:
userstable (id,username,password_hash,display_name,role)sessionstable (sid,sess,expire) forconnect-pg-simplegroups,user_groups,api_tokenstablesSESSION_SECRETenv varAUTH_ENABLEDenv flag with boot-log toggle- PR #26 frontend handler that bounces to
/login.htmlon 401 - Issue #94 "session security fixes" deployed 2026-05-26 (commit
3ebe5d6)
But the actual express-session middleware was never mounted in services/mam-api/src/index.js. There is no /api/v1/auth/* router. There is no requireAuth middleware. As a result, when AUTH_ENABLED=true was tried:
- User submits login, server returns 200 OK from a stub endpoint.
- No
Set-Cookieis ever sent (no session middleware mounted). - The next request to a protected route returns 401.
- Frontend bounces to
/login.html. - Infinite redirect loop.
The prior attempts failed because auth was being built reactively in pieces, with no single source of truth for what "logged in" means.
Goals
- One coherent, readable auth code path.
- Web UI logins survive page reloads and container restarts.
- Premiere panel can authenticate via long-lived bearer tokens.
- First-run setup works on a fresh install with no env var or CLI gymnastics.
- The whole auth flow can be exercised by automated tests, including a regression test for the redirect-loop failure mode.
Non-goals (v1)
- MFA / TOTP.
- OAuth / OIDC delegation (Forgejo, Google, etc.).
- Per-project or per-recorder permissions. Flat access: logged in = full access.
- Email-based "forgot password" (no SMTP assumed; admin-reset only).
- Audit log of who-did-what (the
last_login_atcolumn is the minimum). - Service-to-service auth for
node-agent— keeps existing019-node-token-bindingmechanism.
Decisions
| Decision | Choice | Reasoning |
|---|---|---|
| Client surface | Web UI + Premiere panel | Two transports (cookies + bearer), one identity backend |
| Permission model | Flat (logged in = full access) | Small homogeneous operator population. groups / user_groups schemas stay inert. |
| Identity provider | Local username/password | On-prem broadcast operators won't tolerate OIDC roundtrips. Matches existing schema. |
| First-user bootstrap | First-run setup page | Hardest to mis-configure. No env vars to leak. No CLI to remember. |
| Session lifetime | 8h absolute + 1h sliding idle | Operator security posture, tighter than typical SaaS. |
| Auth library | Hand-rolled (express-session + connect-pg-simple) |
Explicit, debuggable. Rejected JWT and Passport for this codebase. |
Architecture
Single source of truth
"Logged in" means exactly one of two things:
- The request carries a valid
dragonflight.sidcookie whose row insessionshasn't expired and isn't past its 1h-idle or 8h-absolute window, OR - The request carries
Authorization: Bearer <token>whose SHA-256 matches anapi_tokensrow that hasn't been revoked or expired.
Nothing else counts. No localStorage flags, no JWT, no client-side "I think I'm logged in" hints.
One middleware, one check
services/mam-api/src/middleware/auth.js exposes a single requireAuth function:
export async function requireAuth(req, res, next) {
// Dev mode preserved. The 'dev' user is a real row in `users` seeded at
// boot when AUTH_ENABLED !== 'true', so FK-bearing routes (api_tokens,
// future comments, audit fields) keep working without conditional logic.
if (process.env.AUTH_ENABLED !== 'true') {
req.user = DEV_USER; // { id: <UUID of seeded 'dev' user>, username: 'dev' }
return next();
}
// 1. Session check
if (req.session?.user_id) {
const now = Date.now();
if (now - req.session.first_seen_at > 8 * 3600 * 1000) return destroyAnd401(req, res);
if (now - req.session.last_seen_at > 1 * 3600 * 1000) return destroyAnd401(req, res);
req.session.last_seen_at = now;
req.user = await loadUser(req.session.user_id);
if (!req.user) return destroyAnd401(req, res);
return next();
}
// 2. Bearer check
const bearer = parseBearer(req.headers.authorization);
if (bearer) {
const hash = sha256hex(bearer);
const row = await pool.query(
`SELECT t.id, t.user_id, t.expires_at, u.username
FROM api_tokens t JOIN users u ON u.id = t.user_id
WHERE t.token_hash = $1`, [hash]);
if (row.rows.length && (!row.rows[0].expires_at || row.rows[0].expires_at > new Date())) {
pool.query(`UPDATE api_tokens SET last_used_at = NOW() WHERE id = $1`, [row.rows[0].id]).catch(() => {});
req.user = { id: row.rows[0].user_id, username: row.rows[0].username };
return next();
}
}
// 3. Otherwise
return res.status(401).json({ error: 'unauthorized' });
}
Mounted at the /api/v1 level in services/mam-api/src/index.js, before the individual route mounts, with an allowlist for the three pre-login auth paths:
app.use('/api/v1', (req, res, next) => {
const unauth = ['/auth/login', '/auth/setup', '/auth/setup-required'];
if (unauth.some(p => req.path === p)) return next();
return requireAuth(req, res, next);
});
// then: app.use('/api/v1/assets', assetsRouter), etc.
/health lives at the root, outside the /api/v1 mount, so it's naturally unaffected. /api/v1/cluster/* keeps its existing 019-node-token-binding service-auth path: requireAuth runs first, fails with 401 for an unauthenticated request, but the cluster routes themselves do their own token check on request bodies, so node-agent traffic must include a valid user session OR an api_token (which is the change — node-agent will need to be issued an api_token at install time). Alternative: carve /api/v1/cluster/* out of the requireAuth gate too, and keep node-agent on its existing binding token alone. Implementer should pick — flagged in the implementation order.
Session middleware (actually wired this time)
In services/mam-api/src/index.js, before any route:
import session from 'express-session';
import connectPgSimple from 'connect-pg-simple';
const PgStore = connectPgSimple(session);
if (process.env.TRUST_PROXY === 'true') app.set('trust proxy', 1);
app.use(session({
store: new PgStore({ pool, tableName: 'sessions', pruneSessionInterval: 60 * 15 }),
secret: process.env.SESSION_SECRET,
name: 'dragonflight.sid',
cookie: {
httpOnly: true,
sameSite: 'lax',
secure: process.env.TRUST_PROXY === 'true',
path: '/',
maxAge: 8 * 3600 * 1000,
},
rolling: false, // sliding renewal handled in requireAuth so we can enforce idle + absolute separately
resave: false,
saveUninitialized: false,
}));
Auth router
services/mam-api/src/routes/auth.js:
| Method | Path | Auth | Description |
|---|---|---|---|
GET |
/api/v1/auth/setup-required |
none | { required: bool }. Cheap, no auth. |
POST |
/api/v1/auth/setup |
none | Only succeeds if users is empty. Creates first user, logs them in. |
POST |
/api/v1/auth/login |
none | { username, password } -> 200 + cookie or 401 |
POST |
/api/v1/auth/logout |
required | Destroys session row, clears cookie |
GET |
/api/v1/auth/me |
required | { id, username, display_name } |
POST |
/api/v1/auth/password |
required | Change own password (requires current) |
GET/POST/DELETE |
/api/v1/auth/users[/:id] |
required | User CRUD |
GET/POST/DELETE |
/api/v1/auth/tokens[/:id] |
required | Current user's API tokens |
Data model
Existing schema is almost right. One small migration:
-- services/mam-api/src/db/migrations/023-auth-session-timestamps.sql
ALTER TABLE users ADD COLUMN IF NOT EXISTS password_updated_at TIMESTAMPTZ DEFAULT NOW();
ALTER TABLE users ADD COLUMN IF NOT EXISTS last_login_at TIMESTAMPTZ;
-- idle / absolute timestamps live inside session.sess JSONB; no schema change needed
groups and user_groups stay as-is, unused for v1. api_tokens is already correctly shaped.
Flows
Browser login (the one that broke last time)
- SPA boots,
<AuthGate>callsGET /api/v1/auth/me. requireAuthreturns 401.- AuthGate calls
GET /api/v1/auth/setup-required. Iftrue, render Setup screen. Otherwise, render Login screen. - User submits
POST /api/v1/auth/login. Serverbcrypt.compares, setsreq.session.user_id,first_seen_at,last_seen_at. Critical:await new Promise(r => req.session.save(r))before responding, so the cookie is persisted to Postgres before the next request can arrive. - AuthGate re-calls
/api/v1/auth/me, gets 200, renders the app.
Why this doesn't loop: the explicit req.session.save() callback before response guarantees the cookie row exists before the SPA can fire its next request. requireAuth returns a clean 401 (not a redirect) so the SPA decides what to render. The static /login.html is deleted; there is no HTML bounce.
Premiere panel bearer
- Web UI -> Settings -> API Tokens -> "New token" named "Premiere panel".
POST /api/v1/auth/tokensreturns{ token: 'dfl_<32 hex>', prefix: 'dfl_a3f2', id }exactly once.- Premiere panel sends
Authorization: Bearer dfl_<...>on every request.requireAuthSHA-256s it, looks upapi_tokens.token_hash, updateslast_used_at.
Idle + absolute timeout (inside requireAuth)
if session present:
if now - session.first_seen_at > 8h -> destroy session, 401
if now - session.last_seen_at > 1h -> destroy session, 401
session.last_seen_at = now
req.user = lookup(session.user_id)
next()
Bearer tokens have their own optional expires_at (NULL = never expires); checked the same way.
Frontend
services/web-ui/src/auth-gate.jsx— new component that wraps the SPA. On mount:GET /me. On 401: checksetup-required, render either Setup or Login. On 200: render the app shell.- Login screen — layout B from brainstorm: 22px wordmark over "WILD DRAGON BROADCAST" tagline above a
--bg-1card containing username, password, "Sign in" button. Matches DESIGN.md tokens. - Setup screen — same chrome; fields = username, password, confirm password; button = "Create admin".
- Settings -> Account section — change password.
- Settings -> API Tokens section — list / create / revoke. New token shown exactly once with a copy affordance.
- Fetch wrapper — the central
ZAMPP_API.fetch(already exists) gains a 401 handler that re-mounts AuthGate's Login state with the current path saved aslast_path, restored after re-auth.
Removed
- The static
/login.htmlpage (PR #26's bounce target) is deleted. SPA handles login internally; no full-page reload.
Error handling
| Case | Behavior |
|---|---|
| Wrong username or password | 401 { error: 'invalid credentials' }. Same message either way, no user enumeration. |
| Login rate limiting | Per-IP exponential backoff (1s, 2s, 4s, 8s, max 30s). In-memory Map. Single-instance limitation documented. |
| Idle / absolute expiry | 401 -> AuthGate Login. Last path saved, restored on re-auth. |
| Setup after first user exists | 409 { error: 'setup already complete' }. Permanently disabled. |
| Token revoke | DELETE /api/v1/auth/tokens/:id — only owner can revoke. Subsequent bearer requests 401. |
| Delete-self when only user | 409 { error: 'cannot delete last user' }. |
| Forgot password | No self-serve. Any logged-in user can reset another via POST /api/v1/auth/users/:id/password. Documented as the recovery path. |
| Password rules | Min 12 chars, no max, no character class requirements (NIST SP 800-63B). bcrypt cost 12. |
| CSRF | SameSite=Lax + same origin + required X-Requested-With: dragonflight-ui header on mutating requests (belt-and-suspenders). |
| Session table growth | connect-pg-simple pruneSessionInterval: 60 * 15 (every 15 min). |
Testing
- Unit —
services/mam-api/test/middleware/auth.test.js: requireAuth with (a) no creds, (b) valid session, (c) idle-expired session, (d) absolute-expired session, (e) valid bearer, (f) invalid bearer, (g) bearer matching a deleted user. - Integration —
services/mam-api/test/auth.integration.test.js: spin up Express + test Postgres. Walks: setup -> login -> /me -> mutating call -> logout -> /me 401. Second pass: idle timeout simulated by mutatinglast_seen_atin DB. Third pass: bearer issue -> use -> revoke -> 401. - Regression test for the redirect-loop bug: explicit test that after
POST /auth/loginreturns 200, a subsequentGET /auth/mewith the returned cookie returns 200 in the same test client. This is the test that would have caught the original failure. - Manual smoke (documented in PR): fresh install -> setup -> create admin -> land on dashboard -> reload (stays logged in) -> wait 1h idle -> reload -> bounce to login.
Implementation order
Suggested sequencing for the implementation plan (writing-plans will refine):
- Migration
023-auth-session-timestamps.sql. Add idempotent seed of the dev user (INSERT ... ON CONFLICT DO NOTHINGwith a fixed UUID) so dev mode FK-bearing routes work out of the box. express-session+connect-pg-simplewiring inindex.js.requireAuthmiddleware (withDEV_USERconstant resolved from the seeded row).- Auth router (setup, login, logout, me, password).
- Apply
requireAuthto API router with allowlist. Decide cluster carve-out (see Architecture). - Auth tests (unit + integration + regression).
- Frontend
<AuthGate>+ Login screen + Setup screen. - Frontend Settings -> Account + API Tokens.
- Delete
/login.html. - User CRUD + token CRUD routes.
- Rate limiting + CSRF header.
- Documentation: README updates,
AUTH_ENABLEDtransition notes.
Out-of-band notes for the implementer
- The current
cors({ origin: true, credentials: true })inindex.jsis too permissive once cookies start carrying authority. Tighten to a specific origin list (driven by anALLOWED_ORIGINSenv var) at the same time as wiring the session middleware — otherwise we're undoing theSameSite=Laxprotection from the other side. - node-agent -> mam-api traffic on
/api/v1/cluster/*must keep working. Add a route-level carve-out comment that this path uses the existing019-node-token-bindingtoken, not the user-auth path. - The boot log currently says
Authentication: ENABLED/DISABLED (set AUTH_ENABLED=true for production). Once this lands, the recommended default flips:AUTH_ENABLED=truebecomes the documented default in.env.exampleand the README, andAUTH_ENABLED=falseis documented as a dev-only escape hatch.