From 183e10f8e67231039eeb9d213e7b658e3feea3b0 Mon Sep 17 00:00:00 2001 From: Zac Gaetano Date: Wed, 27 May 2026 11:57:44 -0400 Subject: [PATCH] docs(auth): spec for user auth system, brainstormed 2026-05-27 --- .../specs/2026-05-27-auth-system-design.md | 256 ++++++++++++++++++ 1 file changed, 256 insertions(+) create mode 100644 docs/superpowers/specs/2026-05-27-auth-system-design.md diff --git a/docs/superpowers/specs/2026-05-27-auth-system-design.md b/docs/superpowers/specs/2026-05-27-auth-system-design.md new file mode 100644 index 0000000..20ca23e --- /dev/null +++ b/docs/superpowers/specs/2026-05-27-auth-system-design.md @@ -0,0 +1,256 @@ +# Dragonflight User Authentication — Design + +**Status:** Approved, ready for implementation planning +**Date:** 2026-05-27 +**Brainstormed with:** Zac + +## Problem + +Dragonflight has the skeleton of an auth system spread across the codebase: + +- `users` table (`id`, `username`, `password_hash`, `display_name`, `role`) +- `sessions` table (`sid`, `sess`, `expire`) for `connect-pg-simple` +- `groups`, `user_groups`, `api_tokens` tables +- `SESSION_SECRET` env var +- `AUTH_ENABLED` env flag with boot-log toggle +- PR #26 frontend handler that bounces to `/login.html` on 401 +- Issue #94 "session security fixes" deployed 2026-05-26 (commit `3ebe5d6`) + +But the actual `express-session` middleware was never mounted in `services/mam-api/src/index.js`. There is no `/api/v1/auth/*` router. There is no `requireAuth` middleware. As a result, when `AUTH_ENABLED=true` was tried: + +1. User submits login, server returns 200 OK from a stub endpoint. +2. No `Set-Cookie` is ever sent (no session middleware mounted). +3. The next request to a protected route returns 401. +4. Frontend bounces to `/login.html`. +5. **Infinite redirect loop.** + +The prior attempts failed because auth was being built reactively in pieces, with no single source of truth for what "logged in" means. + +## Goals + +- One coherent, readable auth code path. +- Web UI logins survive page reloads and container restarts. +- Premiere panel can authenticate via long-lived bearer tokens. +- First-run setup works on a fresh install with no env var or CLI gymnastics. +- The whole auth flow can be exercised by automated tests, including a regression test for the redirect-loop failure mode. + +## Non-goals (v1) + +- MFA / TOTP. +- OAuth / OIDC delegation (Forgejo, Google, etc.). +- Per-project or per-recorder permissions. Flat access: logged in = full access. +- Email-based "forgot password" (no SMTP assumed; admin-reset only). +- Audit log of who-did-what (the `last_login_at` column is the minimum). +- Service-to-service auth for `node-agent` — keeps existing `019-node-token-binding` mechanism. + +## Decisions + +| Decision | Choice | Reasoning | +|---|---|---| +| Client surface | Web UI + Premiere panel | Two transports (cookies + bearer), one identity backend | +| Permission model | Flat (logged in = full access) | Small homogeneous operator population. `groups` / `user_groups` schemas stay inert. | +| Identity provider | Local username/password | On-prem broadcast operators won't tolerate OIDC roundtrips. Matches existing schema. | +| First-user bootstrap | First-run setup page | Hardest to mis-configure. No env vars to leak. No CLI to remember. | +| Session lifetime | 8h absolute + 1h sliding idle | Operator security posture, tighter than typical SaaS. | +| Auth library | Hand-rolled (`express-session` + `connect-pg-simple`) | Explicit, debuggable. Rejected JWT and Passport for this codebase. | + +## Architecture + +### Single source of truth + +"Logged in" means exactly one of two things: + +1. The request carries a valid `dragonflight.sid` cookie whose row in `sessions` hasn't expired and isn't past its 1h-idle or 8h-absolute window, OR +2. The request carries `Authorization: Bearer ` whose SHA-256 matches an `api_tokens` row that hasn't been revoked or expired. + +Nothing else counts. No `localStorage` flags, no JWT, no client-side "I think I'm logged in" hints. + +### One middleware, one check + +`services/mam-api/src/middleware/auth.js` exposes a single `requireAuth` function: + +```js +export async function requireAuth(req, res, next) { + // Dev mode preserved + if (process.env.AUTH_ENABLED !== 'true') { + req.user = { id: 'dev', username: 'dev' }; + return next(); + } + + // 1. Session check + if (req.session?.user_id) { + const now = Date.now(); + if (now - req.session.first_seen_at > 8 * 3600 * 1000) return destroyAnd401(req, res); + if (now - req.session.last_seen_at > 1 * 3600 * 1000) return destroyAnd401(req, res); + req.session.last_seen_at = now; + req.user = await loadUser(req.session.user_id); + if (!req.user) return destroyAnd401(req, res); + return next(); + } + + // 2. Bearer check + const bearer = parseBearer(req.headers.authorization); + if (bearer) { + const hash = sha256hex(bearer); + const row = await pool.query( + `SELECT t.id, t.user_id, t.expires_at, u.username + FROM api_tokens t JOIN users u ON u.id = t.user_id + WHERE t.token_hash = $1`, [hash]); + if (row.rows.length && (!row.rows[0].expires_at || row.rows[0].expires_at > new Date())) { + pool.query(`UPDATE api_tokens SET last_used_at = NOW() WHERE id = $1`, [row.rows[0].id]).catch(() => {}); + req.user = { id: row.rows[0].user_id, username: row.rows[0].username }; + return next(); + } + } + + // 3. Otherwise + return res.status(401).json({ error: 'unauthorized' }); +} +``` + +Mounted at the `/api/v1` level in `services/mam-api/src/index.js`, with an allowlist for `/api/v1/auth/login`, `/api/v1/auth/setup`, `/api/v1/auth/setup-required`, and `/health`. + +### Session middleware (actually wired this time) + +In `services/mam-api/src/index.js`, **before any route**: + +```js +import session from 'express-session'; +import connectPgSimple from 'connect-pg-simple'; +const PgStore = connectPgSimple(session); + +if (process.env.TRUST_PROXY === 'true') app.set('trust proxy', 1); + +app.use(session({ + store: new PgStore({ pool, tableName: 'sessions', pruneSessionInterval: 60 * 15 }), + secret: process.env.SESSION_SECRET, + name: 'dragonflight.sid', + cookie: { + httpOnly: true, + sameSite: 'lax', + secure: process.env.TRUST_PROXY === 'true', + path: '/', + maxAge: 8 * 3600 * 1000, + }, + rolling: false, // sliding renewal handled in requireAuth so we can enforce idle + absolute separately + resave: false, + saveUninitialized: false, +})); +``` + +### Auth router + +`services/mam-api/src/routes/auth.js`: + +| Method | Path | Auth | Description | +|---|---|---|---| +| `GET` | `/api/v1/auth/setup-required` | none | `{ required: bool }`. Cheap, no auth. | +| `POST` | `/api/v1/auth/setup` | none | Only succeeds if `users` is empty. Creates first user, logs them in. | +| `POST` | `/api/v1/auth/login` | none | `{ username, password }` -> 200 + cookie or 401 | +| `POST` | `/api/v1/auth/logout` | required | Destroys session row, clears cookie | +| `GET` | `/api/v1/auth/me` | required | `{ id, username, display_name }` | +| `POST` | `/api/v1/auth/password` | required | Change own password (requires current) | +| `GET/POST/DELETE` | `/api/v1/auth/users[/:id]` | required | User CRUD | +| `GET/POST/DELETE` | `/api/v1/auth/tokens[/:id]` | required | Current user's API tokens | + +### Data model + +Existing schema is almost right. One small migration: + +```sql +-- services/mam-api/src/db/migrations/023-auth-session-timestamps.sql +ALTER TABLE users ADD COLUMN IF NOT EXISTS password_updated_at TIMESTAMPTZ DEFAULT NOW(); +ALTER TABLE users ADD COLUMN IF NOT EXISTS last_login_at TIMESTAMPTZ; +-- idle / absolute timestamps live inside session.sess JSONB; no schema change needed +``` + +`groups` and `user_groups` stay as-is, unused for v1. `api_tokens` is already correctly shaped. + +## Flows + +### Browser login (the one that broke last time) + +1. SPA boots, `` calls `GET /api/v1/auth/me`. +2. `requireAuth` returns 401. +3. AuthGate calls `GET /api/v1/auth/setup-required`. If `true`, render Setup screen. Otherwise, render Login screen. +4. User submits `POST /api/v1/auth/login`. Server `bcrypt.compare`s, sets `req.session.user_id`, `first_seen_at`, `last_seen_at`. **Critical:** `await new Promise(r => req.session.save(r))` before responding, so the cookie is persisted to Postgres before the next request can arrive. +5. AuthGate re-calls `/api/v1/auth/me`, gets 200, renders the app. + +**Why this doesn't loop:** the explicit `req.session.save()` callback before response guarantees the cookie row exists before the SPA can fire its next request. `requireAuth` returns a clean 401 (not a redirect) so the SPA decides what to render. The static `/login.html` is deleted; there is no HTML bounce. + +### Premiere panel bearer + +1. Web UI -> Settings -> API Tokens -> "New token" named "Premiere panel". +2. `POST /api/v1/auth/tokens` returns `{ token: 'dfl_<32 hex>', prefix: 'dfl_a3f2', id }` **exactly once**. +3. Premiere panel sends `Authorization: Bearer dfl_<...>` on every request. `requireAuth` SHA-256s it, looks up `api_tokens.token_hash`, updates `last_used_at`. + +### Idle + absolute timeout (inside `requireAuth`) + +``` +if session present: + if now - session.first_seen_at > 8h -> destroy session, 401 + if now - session.last_seen_at > 1h -> destroy session, 401 + session.last_seen_at = now + req.user = lookup(session.user_id) + next() +``` + +Bearer tokens have their own optional `expires_at` (`NULL` = never expires); checked the same way. + +## Frontend + +- **`services/web-ui/src/auth-gate.jsx`** — new component that wraps the SPA. On mount: `GET /me`. On 401: check `setup-required`, render either Setup or Login. On 200: render the app shell. +- **Login screen** — layout B from brainstorm: 22px wordmark over "WILD DRAGON BROADCAST" tagline above a `--bg-1` card containing username, password, "Sign in" button. Matches DESIGN.md tokens. +- **Setup screen** — same chrome; fields = username, password, confirm password; button = "Create admin". +- **Settings -> Account section** — change password. +- **Settings -> API Tokens section** — list / create / revoke. New token shown exactly once with a copy affordance. +- **Fetch wrapper** — the central `ZAMPP_API.fetch` (already exists) gains a 401 handler that re-mounts AuthGate's Login state with the current path saved as `last_path`, restored after re-auth. + +### Removed + +- The static `/login.html` page (PR #26's bounce target) is deleted. SPA handles login internally; no full-page reload. + +## Error handling + +| Case | Behavior | +|---|---| +| Wrong username or password | `401 { error: 'invalid credentials' }`. Same message either way, no user enumeration. | +| Login rate limiting | Per-IP exponential backoff (1s, 2s, 4s, 8s, max 30s). In-memory `Map`. Single-instance limitation documented. | +| Idle / absolute expiry | 401 -> AuthGate Login. Last path saved, restored on re-auth. | +| Setup after first user exists | `409 { error: 'setup already complete' }`. Permanently disabled. | +| Token revoke | `DELETE /api/v1/auth/tokens/:id` — only owner can revoke. Subsequent bearer requests 401. | +| Delete-self when only user | `409 { error: 'cannot delete last user' }`. | +| Forgot password | No self-serve. Any logged-in user can reset another via `POST /api/v1/auth/users/:id/password`. Documented as the recovery path. | +| Password rules | Min 12 chars, no max, no character class requirements (NIST SP 800-63B). `bcrypt` cost 12. | +| CSRF | `SameSite=Lax` + same origin + required `X-Requested-With: dragonflight-ui` header on mutating requests (belt-and-suspenders). | +| Session table growth | `connect-pg-simple` `pruneSessionInterval: 60 * 15` (every 15 min). | + +## Testing + +- **Unit — `services/mam-api/test/middleware/auth.test.js`**: requireAuth with (a) no creds, (b) valid session, (c) idle-expired session, (d) absolute-expired session, (e) valid bearer, (f) invalid bearer, (g) bearer matching a deleted user. +- **Integration — `services/mam-api/test/auth.integration.test.js`**: spin up Express + test Postgres. Walks: setup -> login -> /me -> mutating call -> logout -> /me 401. Second pass: idle timeout simulated by mutating `last_seen_at` in DB. Third pass: bearer issue -> use -> revoke -> 401. +- **Regression test for the redirect-loop bug:** explicit test that after `POST /auth/login` returns 200, a subsequent `GET /auth/me` with the returned cookie returns 200 in the same test client. This is the test that would have caught the original failure. +- **Manual smoke (documented in PR):** fresh install -> setup -> create admin -> land on dashboard -> reload (stays logged in) -> wait 1h idle -> reload -> bounce to login. + +## Implementation order + +Suggested sequencing for the implementation plan (writing-plans will refine): + +1. Migration `023-auth-session-timestamps.sql`. +2. `express-session` + `connect-pg-simple` wiring in `index.js`. +3. `requireAuth` middleware. +4. Auth router (setup, login, logout, me, password). +5. Apply `requireAuth` to API router with allowlist. +6. Auth tests (unit + integration + regression). +7. Frontend `` + Login screen + Setup screen. +8. Frontend Settings -> Account + API Tokens. +9. Delete `/login.html`. +10. User CRUD + token CRUD routes. +11. Rate limiting + CSRF header. +12. Documentation: README updates, `AUTH_ENABLED` transition notes. + +## Out-of-band notes for the implementer + +- The current `cors({ origin: true, credentials: true })` in `index.js` is too permissive once cookies start carrying authority. Tighten to a specific origin list (driven by an `ALLOWED_ORIGINS` env var) at the same time as wiring the session middleware — otherwise we're undoing the `SameSite=Lax` protection from the other side. +- node-agent -> mam-api traffic on `/api/v1/cluster/*` must keep working. Add a route-level carve-out comment that this path uses the existing `019-node-token-binding` token, not the user-auth path. +- The boot log currently says `Authentication: ENABLED` / `DISABLED (set AUTH_ENABLED=true for production)`. Once this lands, the recommended default flips: `AUTH_ENABLED=true` becomes the documented default in `.env.example` and the README, and `AUTH_ENABLED=false` is documented as a dev-only escape hatch.