dragonflight/docs/superpowers/specs/2026-05-27-auth-system-design.md

270 lines
15 KiB
Markdown
Raw Normal View History

# Dragonflight User Authentication — Design
**Status:** Approved, ready for implementation planning
**Date:** 2026-05-27
**Brainstormed with:** Zac
## Problem
Dragonflight has the skeleton of an auth system spread across the codebase:
- `users` table (`id`, `username`, `password_hash`, `display_name`, `role`)
- `sessions` table (`sid`, `sess`, `expire`) for `connect-pg-simple`
- `groups`, `user_groups`, `api_tokens` tables
- `SESSION_SECRET` env var
- `AUTH_ENABLED` env flag with boot-log toggle
- PR #26 frontend handler that bounces to `/login.html` on 401
- Issue #94 "session security fixes" deployed 2026-05-26 (commit `3ebe5d6`)
But the actual `express-session` middleware was never mounted in `services/mam-api/src/index.js`. There is no `/api/v1/auth/*` router. There is no `requireAuth` middleware. As a result, when `AUTH_ENABLED=true` was tried:
1. User submits login, server returns 200 OK from a stub endpoint.
2. No `Set-Cookie` is ever sent (no session middleware mounted).
3. The next request to a protected route returns 401.
4. Frontend bounces to `/login.html`.
5. **Infinite redirect loop.**
The prior attempts failed because auth was being built reactively in pieces, with no single source of truth for what "logged in" means.
## Goals
- One coherent, readable auth code path.
- Web UI logins survive page reloads and container restarts.
- Premiere panel can authenticate via long-lived bearer tokens.
- First-run setup works on a fresh install with no env var or CLI gymnastics.
- The whole auth flow can be exercised by automated tests, including a regression test for the redirect-loop failure mode.
## Non-goals (v1)
- MFA / TOTP.
- OAuth / OIDC delegation (Forgejo, Google, etc.).
- Per-project or per-recorder permissions. Flat access: logged in = full access.
- Email-based "forgot password" (no SMTP assumed; admin-reset only).
- Audit log of who-did-what (the `last_login_at` column is the minimum).
- Service-to-service auth for `node-agent` — keeps existing `019-node-token-binding` mechanism.
## Decisions
| Decision | Choice | Reasoning |
|---|---|---|
| Client surface | Web UI + Premiere panel | Two transports (cookies + bearer), one identity backend |
| Permission model | Flat (logged in = full access) | Small homogeneous operator population. `groups` / `user_groups` schemas stay inert. |
| Identity provider | Local username/password | On-prem broadcast operators won't tolerate OIDC roundtrips. Matches existing schema. |
| First-user bootstrap | First-run setup page | Hardest to mis-configure. No env vars to leak. No CLI to remember. |
| Session lifetime | 8h absolute + 1h sliding idle | Operator security posture, tighter than typical SaaS. |
| Auth library | Hand-rolled (`express-session` + `connect-pg-simple`) | Explicit, debuggable. Rejected JWT and Passport for this codebase. |
## Architecture
### Single source of truth
"Logged in" means exactly one of two things:
1. The request carries a valid `dragonflight.sid` cookie whose row in `sessions` hasn't expired and isn't past its 1h-idle or 8h-absolute window, OR
2. The request carries `Authorization: Bearer <token>` whose SHA-256 matches an `api_tokens` row that hasn't been revoked or expired.
Nothing else counts. No `localStorage` flags, no JWT, no client-side "I think I'm logged in" hints.
### One middleware, one check
`services/mam-api/src/middleware/auth.js` exposes a single `requireAuth` function:
```js
export async function requireAuth(req, res, next) {
// Dev mode preserved. The 'dev' user is a real row in `users` seeded at
// boot when AUTH_ENABLED !== 'true', so FK-bearing routes (api_tokens,
// future comments, audit fields) keep working without conditional logic.
if (process.env.AUTH_ENABLED !== 'true') {
req.user = DEV_USER; // { id: <UUID of seeded 'dev' user>, username: 'dev' }
return next();
}
// 1. Session check
if (req.session?.user_id) {
const now = Date.now();
if (now - req.session.first_seen_at > 8 * 3600 * 1000) return destroyAnd401(req, res);
if (now - req.session.last_seen_at > 1 * 3600 * 1000) return destroyAnd401(req, res);
req.session.last_seen_at = now;
req.user = await loadUser(req.session.user_id);
if (!req.user) return destroyAnd401(req, res);
return next();
}
// 2. Bearer check
const bearer = parseBearer(req.headers.authorization);
if (bearer) {
const hash = sha256hex(bearer);
const row = await pool.query(
`SELECT t.id, t.user_id, t.expires_at, u.username
FROM api_tokens t JOIN users u ON u.id = t.user_id
WHERE t.token_hash = $1`, [hash]);
if (row.rows.length && (!row.rows[0].expires_at || row.rows[0].expires_at > new Date())) {
pool.query(`UPDATE api_tokens SET last_used_at = NOW() WHERE id = $1`, [row.rows[0].id]).catch(() => {});
req.user = { id: row.rows[0].user_id, username: row.rows[0].username };
return next();
}
}
// 3. Otherwise
return res.status(401).json({ error: 'unauthorized' });
}
```
Mounted at the `/api/v1` level in `services/mam-api/src/index.js`, **before** the individual route mounts, with an allowlist for the three pre-login auth paths:
```js
app.use('/api/v1', (req, res, next) => {
const unauth = ['/auth/login', '/auth/setup', '/auth/setup-required'];
if (unauth.some(p => req.path === p)) return next();
return requireAuth(req, res, next);
});
// then: app.use('/api/v1/assets', assetsRouter), etc.
```
`/health` lives at the root, outside the `/api/v1` mount, so it's naturally unaffected. `/api/v1/cluster/*` keeps its existing `019-node-token-binding` service-auth path: requireAuth runs first, fails with 401 for an unauthenticated request, **but** the cluster routes themselves do their own token check on request bodies, so node-agent traffic must include a valid user session OR an api_token (which is the change — node-agent will need to be issued an api_token at install time). Alternative: carve `/api/v1/cluster/*` out of the requireAuth gate too, and keep node-agent on its existing binding token alone. Implementer should pick — flagged in the implementation order.
### Session middleware (actually wired this time)
In `services/mam-api/src/index.js`, **before any route**:
```js
import session from 'express-session';
import connectPgSimple from 'connect-pg-simple';
const PgStore = connectPgSimple(session);
if (process.env.TRUST_PROXY === 'true') app.set('trust proxy', 1);
app.use(session({
store: new PgStore({ pool, tableName: 'sessions', pruneSessionInterval: 60 * 15 }),
secret: process.env.SESSION_SECRET,
name: 'dragonflight.sid',
cookie: {
httpOnly: true,
sameSite: 'lax',
secure: process.env.TRUST_PROXY === 'true',
path: '/',
maxAge: 8 * 3600 * 1000,
},
rolling: false, // sliding renewal handled in requireAuth so we can enforce idle + absolute separately
resave: false,
saveUninitialized: false,
}));
```
### Auth router
`services/mam-api/src/routes/auth.js`:
| Method | Path | Auth | Description |
|---|---|---|---|
| `GET` | `/api/v1/auth/setup-required` | none | `{ required: bool }`. Cheap, no auth. |
| `POST` | `/api/v1/auth/setup` | none | Only succeeds if `users` is empty. Creates first user, logs them in. |
| `POST` | `/api/v1/auth/login` | none | `{ username, password }` -> 200 + cookie or 401 |
| `POST` | `/api/v1/auth/logout` | required | Destroys session row, clears cookie |
| `GET` | `/api/v1/auth/me` | required | `{ id, username, display_name }` |
| `POST` | `/api/v1/auth/password` | required | Change own password (requires current) |
| `GET/POST/DELETE` | `/api/v1/auth/users[/:id]` | required | User CRUD |
| `GET/POST/DELETE` | `/api/v1/auth/tokens[/:id]` | required | Current user's API tokens |
### Data model
Existing schema is almost right. One small migration:
```sql
-- services/mam-api/src/db/migrations/023-auth-session-timestamps.sql
ALTER TABLE users ADD COLUMN IF NOT EXISTS password_updated_at TIMESTAMPTZ DEFAULT NOW();
ALTER TABLE users ADD COLUMN IF NOT EXISTS last_login_at TIMESTAMPTZ;
-- idle / absolute timestamps live inside session.sess JSONB; no schema change needed
```
`groups` and `user_groups` stay as-is, unused for v1. `api_tokens` is already correctly shaped.
## Flows
### Browser login (the one that broke last time)
1. SPA boots, `<AuthGate>` calls `GET /api/v1/auth/me`.
2. `requireAuth` returns 401.
3. AuthGate calls `GET /api/v1/auth/setup-required`. If `true`, render Setup screen. Otherwise, render Login screen.
4. User submits `POST /api/v1/auth/login`. Server `bcrypt.compare`s, sets `req.session.user_id`, `first_seen_at`, `last_seen_at`. **Critical:** `await new Promise(r => req.session.save(r))` before responding, so the cookie is persisted to Postgres before the next request can arrive.
5. AuthGate re-calls `/api/v1/auth/me`, gets 200, renders the app.
**Why this doesn't loop:** the explicit `req.session.save()` callback before response guarantees the cookie row exists before the SPA can fire its next request. `requireAuth` returns a clean 401 (not a redirect) so the SPA decides what to render. The static `/login.html` is deleted; there is no HTML bounce.
### Premiere panel bearer
1. Web UI -> Settings -> API Tokens -> "New token" named "Premiere panel".
2. `POST /api/v1/auth/tokens` returns `{ token: 'dfl_<32 hex>', prefix: 'dfl_a3f2', id }` **exactly once**.
3. Premiere panel sends `Authorization: Bearer dfl_<...>` on every request. `requireAuth` SHA-256s it, looks up `api_tokens.token_hash`, updates `last_used_at`.
### Idle + absolute timeout (inside `requireAuth`)
```
if session present:
if now - session.first_seen_at > 8h -> destroy session, 401
if now - session.last_seen_at > 1h -> destroy session, 401
session.last_seen_at = now
req.user = lookup(session.user_id)
next()
```
Bearer tokens have their own optional `expires_at` (`NULL` = never expires); checked the same way.
## Frontend
- **`services/web-ui/src/auth-gate.jsx`** — new component that wraps the SPA. On mount: `GET /me`. On 401: check `setup-required`, render either Setup or Login. On 200: render the app shell.
- **Login screen** — layout B from brainstorm: 22px wordmark over "WILD DRAGON BROADCAST" tagline above a `--bg-1` card containing username, password, "Sign in" button. Matches DESIGN.md tokens.
- **Setup screen** — same chrome; fields = username, password, confirm password; button = "Create admin".
- **Settings -> Account section** — change password.
- **Settings -> API Tokens section** — list / create / revoke. New token shown exactly once with a copy affordance.
- **Fetch wrapper** — the central `ZAMPP_API.fetch` (already exists) gains a 401 handler that re-mounts AuthGate's Login state with the current path saved as `last_path`, restored after re-auth.
### Removed
- The static `/login.html` page (PR #26's bounce target) is deleted. SPA handles login internally; no full-page reload.
## Error handling
| Case | Behavior |
|---|---|
| Wrong username or password | `401 { error: 'invalid credentials' }`. Same message either way, no user enumeration. |
| Login rate limiting | Per-IP exponential backoff (1s, 2s, 4s, 8s, max 30s). In-memory `Map`. Single-instance limitation documented. |
| Idle / absolute expiry | 401 -> AuthGate Login. Last path saved, restored on re-auth. |
| Setup after first user exists | `409 { error: 'setup already complete' }`. Permanently disabled. |
| Token revoke | `DELETE /api/v1/auth/tokens/:id` — only owner can revoke. Subsequent bearer requests 401. |
| Delete-self when only user | `409 { error: 'cannot delete last user' }`. |
| Forgot password | No self-serve. Any logged-in user can reset another via `POST /api/v1/auth/users/:id/password`. Documented as the recovery path. |
| Password rules | Min 12 chars, no max, no character class requirements (NIST SP 800-63B). `bcrypt` cost 12. |
| CSRF | `SameSite=Lax` + same origin + required `X-Requested-With: dragonflight-ui` header on mutating requests (belt-and-suspenders). |
| Session table growth | `connect-pg-simple` `pruneSessionInterval: 60 * 15` (every 15 min). |
## Testing
- **Unit — `services/mam-api/test/middleware/auth.test.js`**: requireAuth with (a) no creds, (b) valid session, (c) idle-expired session, (d) absolute-expired session, (e) valid bearer, (f) invalid bearer, (g) bearer matching a deleted user.
- **Integration — `services/mam-api/test/auth.integration.test.js`**: spin up Express + test Postgres. Walks: setup -> login -> /me -> mutating call -> logout -> /me 401. Second pass: idle timeout simulated by mutating `last_seen_at` in DB. Third pass: bearer issue -> use -> revoke -> 401.
- **Regression test for the redirect-loop bug:** explicit test that after `POST /auth/login` returns 200, a subsequent `GET /auth/me` with the returned cookie returns 200 in the same test client. This is the test that would have caught the original failure.
- **Manual smoke (documented in PR):** fresh install -> setup -> create admin -> land on dashboard -> reload (stays logged in) -> wait 1h idle -> reload -> bounce to login.
## Implementation order
Suggested sequencing for the implementation plan (writing-plans will refine):
1. Migration `023-auth-session-timestamps.sql`. Add idempotent seed of the dev user (`INSERT ... ON CONFLICT DO NOTHING` with a fixed UUID) so dev mode FK-bearing routes work out of the box.
2. `express-session` + `connect-pg-simple` wiring in `index.js`.
3. `requireAuth` middleware (with `DEV_USER` constant resolved from the seeded row).
4. Auth router (setup, login, logout, me, password).
5. Apply `requireAuth` to API router with allowlist. Decide cluster carve-out (see Architecture).
6. Auth tests (unit + integration + regression).
7. Frontend `<AuthGate>` + Login screen + Setup screen.
8. Frontend Settings -> Account + API Tokens.
9. Delete `/login.html`.
10. User CRUD + token CRUD routes.
11. Rate limiting + CSRF header.
12. Documentation: README updates, `AUTH_ENABLED` transition notes.
## Out-of-band notes for the implementer
- The current `cors({ origin: true, credentials: true })` in `index.js` is too permissive once cookies start carrying authority. Tighten to a specific origin list (driven by an `ALLOWED_ORIGINS` env var) at the same time as wiring the session middleware — otherwise we're undoing the `SameSite=Lax` protection from the other side.
- node-agent -> mam-api traffic on `/api/v1/cluster/*` must keep working. Add a route-level carve-out comment that this path uses the existing `019-node-token-binding` token, not the user-auth path.
- The boot log currently says `Authentication: ENABLED` / `DISABLED (set AUTH_ENABLED=true for production)`. Once this lands, the recommended default flips: `AUTH_ENABLED=true` becomes the documented default in `.env.example` and the README, and `AUTH_ENABLED=false` is documented as a dev-only escape hatch.