dragonflight/docs/superpowers/specs/2026-05-27-auth-system-design.md

15 KiB

Dragonflight User Authentication — Design

Status: Approved, ready for implementation planning Date: 2026-05-27 Brainstormed with: Zac

Problem

Dragonflight has the skeleton of an auth system spread across the codebase:

  • users table (id, username, password_hash, display_name, role)
  • sessions table (sid, sess, expire) for connect-pg-simple
  • groups, user_groups, api_tokens tables
  • SESSION_SECRET env var
  • AUTH_ENABLED env flag with boot-log toggle
  • PR #26 frontend handler that bounces to /login.html on 401
  • Issue #94 "session security fixes" deployed 2026-05-26 (commit 3ebe5d6)

But the actual express-session middleware was never mounted in services/mam-api/src/index.js. There is no /api/v1/auth/* router. There is no requireAuth middleware. As a result, when AUTH_ENABLED=true was tried:

  1. User submits login, server returns 200 OK from a stub endpoint.
  2. No Set-Cookie is ever sent (no session middleware mounted).
  3. The next request to a protected route returns 401.
  4. Frontend bounces to /login.html.
  5. Infinite redirect loop.

The prior attempts failed because auth was being built reactively in pieces, with no single source of truth for what "logged in" means.

Goals

  • One coherent, readable auth code path.
  • Web UI logins survive page reloads and container restarts.
  • Premiere panel can authenticate via long-lived bearer tokens.
  • First-run setup works on a fresh install with no env var or CLI gymnastics.
  • The whole auth flow can be exercised by automated tests, including a regression test for the redirect-loop failure mode.

Non-goals (v1)

  • MFA / TOTP.
  • OAuth / OIDC delegation (Forgejo, Google, etc.).
  • Per-project or per-recorder permissions. Flat access: logged in = full access.
  • Email-based "forgot password" (no SMTP assumed; admin-reset only).
  • Audit log of who-did-what (the last_login_at column is the minimum).
  • Service-to-service auth for node-agent — keeps existing 019-node-token-binding mechanism.

Decisions

Decision Choice Reasoning
Client surface Web UI + Premiere panel Two transports (cookies + bearer), one identity backend
Permission model Flat (logged in = full access) Small homogeneous operator population. groups / user_groups schemas stay inert.
Identity provider Local username/password On-prem broadcast operators won't tolerate OIDC roundtrips. Matches existing schema.
First-user bootstrap First-run setup page Hardest to mis-configure. No env vars to leak. No CLI to remember.
Session lifetime 8h absolute + 1h sliding idle Operator security posture, tighter than typical SaaS.
Auth library Hand-rolled (express-session + connect-pg-simple) Explicit, debuggable. Rejected JWT and Passport for this codebase.

Architecture

Single source of truth

"Logged in" means exactly one of two things:

  1. The request carries a valid dragonflight.sid cookie whose row in sessions hasn't expired and isn't past its 1h-idle or 8h-absolute window, OR
  2. The request carries Authorization: Bearer <token> whose SHA-256 matches an api_tokens row that hasn't been revoked or expired.

Nothing else counts. No localStorage flags, no JWT, no client-side "I think I'm logged in" hints.

One middleware, one check

services/mam-api/src/middleware/auth.js exposes a single requireAuth function:

export async function requireAuth(req, res, next) {
  // Dev mode preserved. The 'dev' user is a real row in `users` seeded at
  // boot when AUTH_ENABLED !== 'true', so FK-bearing routes (api_tokens,
  // future comments, audit fields) keep working without conditional logic.
  if (process.env.AUTH_ENABLED !== 'true') {
    req.user = DEV_USER; // { id: <UUID of seeded 'dev' user>, username: 'dev' }
    return next();
  }

  // 1. Session check
  if (req.session?.user_id) {
    const now = Date.now();
    if (now - req.session.first_seen_at > 8 * 3600 * 1000) return destroyAnd401(req, res);
    if (now - req.session.last_seen_at  > 1 * 3600 * 1000) return destroyAnd401(req, res);
    req.session.last_seen_at = now;
    req.user = await loadUser(req.session.user_id);
    if (!req.user) return destroyAnd401(req, res);
    return next();
  }

  // 2. Bearer check
  const bearer = parseBearer(req.headers.authorization);
  if (bearer) {
    const hash = sha256hex(bearer);
    const row = await pool.query(
      `SELECT t.id, t.user_id, t.expires_at, u.username
         FROM api_tokens t JOIN users u ON u.id = t.user_id
        WHERE t.token_hash = $1`, [hash]);
    if (row.rows.length && (!row.rows[0].expires_at || row.rows[0].expires_at > new Date())) {
      pool.query(`UPDATE api_tokens SET last_used_at = NOW() WHERE id = $1`, [row.rows[0].id]).catch(() => {});
      req.user = { id: row.rows[0].user_id, username: row.rows[0].username };
      return next();
    }
  }

  // 3. Otherwise
  return res.status(401).json({ error: 'unauthorized' });
}

Mounted at the /api/v1 level in services/mam-api/src/index.js, before the individual route mounts, with an allowlist for the three pre-login auth paths:

app.use('/api/v1', (req, res, next) => {
  const unauth = ['/auth/login', '/auth/setup', '/auth/setup-required'];
  if (unauth.some(p => req.path === p)) return next();
  return requireAuth(req, res, next);
});
// then: app.use('/api/v1/assets', assetsRouter), etc.

/health lives at the root, outside the /api/v1 mount, so it's naturally unaffected. /api/v1/cluster/* keeps its existing 019-node-token-binding service-auth path: requireAuth runs first, fails with 401 for an unauthenticated request, but the cluster routes themselves do their own token check on request bodies, so node-agent traffic must include a valid user session OR an api_token (which is the change — node-agent will need to be issued an api_token at install time). Alternative: carve /api/v1/cluster/* out of the requireAuth gate too, and keep node-agent on its existing binding token alone. Implementer should pick — flagged in the implementation order.

Session middleware (actually wired this time)

In services/mam-api/src/index.js, before any route:

import session from 'express-session';
import connectPgSimple from 'connect-pg-simple';
const PgStore = connectPgSimple(session);

if (process.env.TRUST_PROXY === 'true') app.set('trust proxy', 1);

app.use(session({
  store: new PgStore({ pool, tableName: 'sessions', pruneSessionInterval: 60 * 15 }),
  secret: process.env.SESSION_SECRET,
  name:   'dragonflight.sid',
  cookie: {
    httpOnly: true,
    sameSite: 'lax',
    secure:   process.env.TRUST_PROXY === 'true',
    path:     '/',
    maxAge:   8 * 3600 * 1000,
  },
  rolling: false,         // sliding renewal handled in requireAuth so we can enforce idle + absolute separately
  resave:  false,
  saveUninitialized: false,
}));

Auth router

services/mam-api/src/routes/auth.js:

Method Path Auth Description
GET /api/v1/auth/setup-required none { required: bool }. Cheap, no auth.
POST /api/v1/auth/setup none Only succeeds if users is empty. Creates first user, logs them in.
POST /api/v1/auth/login none { username, password } -> 200 + cookie or 401
POST /api/v1/auth/logout required Destroys session row, clears cookie
GET /api/v1/auth/me required { id, username, display_name }
POST /api/v1/auth/password required Change own password (requires current)
GET/POST/DELETE /api/v1/auth/users[/:id] required User CRUD
GET/POST/DELETE /api/v1/auth/tokens[/:id] required Current user's API tokens

Data model

Existing schema is almost right. One small migration:

-- services/mam-api/src/db/migrations/023-auth-session-timestamps.sql
ALTER TABLE users ADD COLUMN IF NOT EXISTS password_updated_at TIMESTAMPTZ DEFAULT NOW();
ALTER TABLE users ADD COLUMN IF NOT EXISTS last_login_at TIMESTAMPTZ;
-- idle / absolute timestamps live inside session.sess JSONB; no schema change needed

groups and user_groups stay as-is, unused for v1. api_tokens is already correctly shaped.

Flows

Browser login (the one that broke last time)

  1. SPA boots, <AuthGate> calls GET /api/v1/auth/me.
  2. requireAuth returns 401.
  3. AuthGate calls GET /api/v1/auth/setup-required. If true, render Setup screen. Otherwise, render Login screen.
  4. User submits POST /api/v1/auth/login. Server bcrypt.compares, sets req.session.user_id, first_seen_at, last_seen_at. Critical: await new Promise(r => req.session.save(r)) before responding, so the cookie is persisted to Postgres before the next request can arrive.
  5. AuthGate re-calls /api/v1/auth/me, gets 200, renders the app.

Why this doesn't loop: the explicit req.session.save() callback before response guarantees the cookie row exists before the SPA can fire its next request. requireAuth returns a clean 401 (not a redirect) so the SPA decides what to render. The static /login.html is deleted; there is no HTML bounce.

Premiere panel bearer

  1. Web UI -> Settings -> API Tokens -> "New token" named "Premiere panel".
  2. POST /api/v1/auth/tokens returns { token: 'dfl_<32 hex>', prefix: 'dfl_a3f2', id } exactly once.
  3. Premiere panel sends Authorization: Bearer dfl_<...> on every request. requireAuth SHA-256s it, looks up api_tokens.token_hash, updates last_used_at.

Idle + absolute timeout (inside requireAuth)

if session present:
  if now - session.first_seen_at > 8h  -> destroy session, 401
  if now - session.last_seen_at  > 1h  -> destroy session, 401
  session.last_seen_at = now
  req.user = lookup(session.user_id)
  next()

Bearer tokens have their own optional expires_at (NULL = never expires); checked the same way.

Frontend

  • services/web-ui/src/auth-gate.jsx — new component that wraps the SPA. On mount: GET /me. On 401: check setup-required, render either Setup or Login. On 200: render the app shell.
  • Login screen — layout B from brainstorm: 22px wordmark over "WILD DRAGON BROADCAST" tagline above a --bg-1 card containing username, password, "Sign in" button. Matches DESIGN.md tokens.
  • Setup screen — same chrome; fields = username, password, confirm password; button = "Create admin".
  • Settings -> Account section — change password.
  • Settings -> API Tokens section — list / create / revoke. New token shown exactly once with a copy affordance.
  • Fetch wrapper — the central ZAMPP_API.fetch (already exists) gains a 401 handler that re-mounts AuthGate's Login state with the current path saved as last_path, restored after re-auth.

Removed

  • The static /login.html page (PR #26's bounce target) is deleted. SPA handles login internally; no full-page reload.

Error handling

Case Behavior
Wrong username or password 401 { error: 'invalid credentials' }. Same message either way, no user enumeration.
Login rate limiting Per-IP exponential backoff (1s, 2s, 4s, 8s, max 30s). In-memory Map. Single-instance limitation documented.
Idle / absolute expiry 401 -> AuthGate Login. Last path saved, restored on re-auth.
Setup after first user exists 409 { error: 'setup already complete' }. Permanently disabled.
Token revoke DELETE /api/v1/auth/tokens/:id — only owner can revoke. Subsequent bearer requests 401.
Delete-self when only user 409 { error: 'cannot delete last user' }.
Forgot password No self-serve. Any logged-in user can reset another via POST /api/v1/auth/users/:id/password. Documented as the recovery path.
Password rules Min 12 chars, no max, no character class requirements (NIST SP 800-63B). bcrypt cost 12.
CSRF SameSite=Lax + same origin + required X-Requested-With: dragonflight-ui header on mutating requests (belt-and-suspenders).
Session table growth connect-pg-simple pruneSessionInterval: 60 * 15 (every 15 min).

Testing

  • Unit — services/mam-api/test/middleware/auth.test.js: requireAuth with (a) no creds, (b) valid session, (c) idle-expired session, (d) absolute-expired session, (e) valid bearer, (f) invalid bearer, (g) bearer matching a deleted user.
  • Integration — services/mam-api/test/auth.integration.test.js: spin up Express + test Postgres. Walks: setup -> login -> /me -> mutating call -> logout -> /me 401. Second pass: idle timeout simulated by mutating last_seen_at in DB. Third pass: bearer issue -> use -> revoke -> 401.
  • Regression test for the redirect-loop bug: explicit test that after POST /auth/login returns 200, a subsequent GET /auth/me with the returned cookie returns 200 in the same test client. This is the test that would have caught the original failure.
  • Manual smoke (documented in PR): fresh install -> setup -> create admin -> land on dashboard -> reload (stays logged in) -> wait 1h idle -> reload -> bounce to login.

Implementation order

Suggested sequencing for the implementation plan (writing-plans will refine):

  1. Migration 023-auth-session-timestamps.sql. Add idempotent seed of the dev user (INSERT ... ON CONFLICT DO NOTHING with a fixed UUID) so dev mode FK-bearing routes work out of the box.
  2. express-session + connect-pg-simple wiring in index.js.
  3. requireAuth middleware (with DEV_USER constant resolved from the seeded row).
  4. Auth router (setup, login, logout, me, password).
  5. Apply requireAuth to API router with allowlist. Decide cluster carve-out (see Architecture).
  6. Auth tests (unit + integration + regression).
  7. Frontend <AuthGate> + Login screen + Setup screen.
  8. Frontend Settings -> Account + API Tokens.
  9. Delete /login.html.
  10. User CRUD + token CRUD routes.
  11. Rate limiting + CSRF header.
  12. Documentation: README updates, AUTH_ENABLED transition notes.

Out-of-band notes for the implementer

  • The current cors({ origin: true, credentials: true }) in index.js is too permissive once cookies start carrying authority. Tighten to a specific origin list (driven by an ALLOWED_ORIGINS env var) at the same time as wiring the session middleware — otherwise we're undoing the SameSite=Lax protection from the other side.
  • node-agent -> mam-api traffic on /api/v1/cluster/* must keep working. Add a route-level carve-out comment that this path uses the existing 019-node-token-binding token, not the user-auth path.
  • The boot log currently says Authentication: ENABLED / DISABLED (set AUTH_ENABLED=true for production). Once this lands, the recommended default flips: AUTH_ENABLED=true becomes the documented default in .env.example and the README, and AUTH_ENABLED=false is documented as a dev-only escape hatch.