docs(auth): spec for user auth system, brainstormed 2026-05-27
This commit is contained in:
parent
ad9e1ef5f1
commit
183e10f8e6
1 changed files with 256 additions and 0 deletions
256
docs/superpowers/specs/2026-05-27-auth-system-design.md
Normal file
256
docs/superpowers/specs/2026-05-27-auth-system-design.md
Normal file
|
|
@ -0,0 +1,256 @@
|
|||
# Dragonflight User Authentication — Design
|
||||
|
||||
**Status:** Approved, ready for implementation planning
|
||||
**Date:** 2026-05-27
|
||||
**Brainstormed with:** Zac
|
||||
|
||||
## Problem
|
||||
|
||||
Dragonflight has the skeleton of an auth system spread across the codebase:
|
||||
|
||||
- `users` table (`id`, `username`, `password_hash`, `display_name`, `role`)
|
||||
- `sessions` table (`sid`, `sess`, `expire`) for `connect-pg-simple`
|
||||
- `groups`, `user_groups`, `api_tokens` tables
|
||||
- `SESSION_SECRET` env var
|
||||
- `AUTH_ENABLED` env flag with boot-log toggle
|
||||
- PR #26 frontend handler that bounces to `/login.html` on 401
|
||||
- Issue #94 "session security fixes" deployed 2026-05-26 (commit `3ebe5d6`)
|
||||
|
||||
But the actual `express-session` middleware was never mounted in `services/mam-api/src/index.js`. There is no `/api/v1/auth/*` router. There is no `requireAuth` middleware. As a result, when `AUTH_ENABLED=true` was tried:
|
||||
|
||||
1. User submits login, server returns 200 OK from a stub endpoint.
|
||||
2. No `Set-Cookie` is ever sent (no session middleware mounted).
|
||||
3. The next request to a protected route returns 401.
|
||||
4. Frontend bounces to `/login.html`.
|
||||
5. **Infinite redirect loop.**
|
||||
|
||||
The prior attempts failed because auth was being built reactively in pieces, with no single source of truth for what "logged in" means.
|
||||
|
||||
## Goals
|
||||
|
||||
- One coherent, readable auth code path.
|
||||
- Web UI logins survive page reloads and container restarts.
|
||||
- Premiere panel can authenticate via long-lived bearer tokens.
|
||||
- First-run setup works on a fresh install with no env var or CLI gymnastics.
|
||||
- The whole auth flow can be exercised by automated tests, including a regression test for the redirect-loop failure mode.
|
||||
|
||||
## Non-goals (v1)
|
||||
|
||||
- MFA / TOTP.
|
||||
- OAuth / OIDC delegation (Forgejo, Google, etc.).
|
||||
- Per-project or per-recorder permissions. Flat access: logged in = full access.
|
||||
- Email-based "forgot password" (no SMTP assumed; admin-reset only).
|
||||
- Audit log of who-did-what (the `last_login_at` column is the minimum).
|
||||
- Service-to-service auth for `node-agent` — keeps existing `019-node-token-binding` mechanism.
|
||||
|
||||
## Decisions
|
||||
|
||||
| Decision | Choice | Reasoning |
|
||||
|---|---|---|
|
||||
| Client surface | Web UI + Premiere panel | Two transports (cookies + bearer), one identity backend |
|
||||
| Permission model | Flat (logged in = full access) | Small homogeneous operator population. `groups` / `user_groups` schemas stay inert. |
|
||||
| Identity provider | Local username/password | On-prem broadcast operators won't tolerate OIDC roundtrips. Matches existing schema. |
|
||||
| First-user bootstrap | First-run setup page | Hardest to mis-configure. No env vars to leak. No CLI to remember. |
|
||||
| Session lifetime | 8h absolute + 1h sliding idle | Operator security posture, tighter than typical SaaS. |
|
||||
| Auth library | Hand-rolled (`express-session` + `connect-pg-simple`) | Explicit, debuggable. Rejected JWT and Passport for this codebase. |
|
||||
|
||||
## Architecture
|
||||
|
||||
### Single source of truth
|
||||
|
||||
"Logged in" means exactly one of two things:
|
||||
|
||||
1. The request carries a valid `dragonflight.sid` cookie whose row in `sessions` hasn't expired and isn't past its 1h-idle or 8h-absolute window, OR
|
||||
2. The request carries `Authorization: Bearer <token>` whose SHA-256 matches an `api_tokens` row that hasn't been revoked or expired.
|
||||
|
||||
Nothing else counts. No `localStorage` flags, no JWT, no client-side "I think I'm logged in" hints.
|
||||
|
||||
### One middleware, one check
|
||||
|
||||
`services/mam-api/src/middleware/auth.js` exposes a single `requireAuth` function:
|
||||
|
||||
```js
|
||||
export async function requireAuth(req, res, next) {
|
||||
// Dev mode preserved
|
||||
if (process.env.AUTH_ENABLED !== 'true') {
|
||||
req.user = { id: 'dev', username: 'dev' };
|
||||
return next();
|
||||
}
|
||||
|
||||
// 1. Session check
|
||||
if (req.session?.user_id) {
|
||||
const now = Date.now();
|
||||
if (now - req.session.first_seen_at > 8 * 3600 * 1000) return destroyAnd401(req, res);
|
||||
if (now - req.session.last_seen_at > 1 * 3600 * 1000) return destroyAnd401(req, res);
|
||||
req.session.last_seen_at = now;
|
||||
req.user = await loadUser(req.session.user_id);
|
||||
if (!req.user) return destroyAnd401(req, res);
|
||||
return next();
|
||||
}
|
||||
|
||||
// 2. Bearer check
|
||||
const bearer = parseBearer(req.headers.authorization);
|
||||
if (bearer) {
|
||||
const hash = sha256hex(bearer);
|
||||
const row = await pool.query(
|
||||
`SELECT t.id, t.user_id, t.expires_at, u.username
|
||||
FROM api_tokens t JOIN users u ON u.id = t.user_id
|
||||
WHERE t.token_hash = $1`, [hash]);
|
||||
if (row.rows.length && (!row.rows[0].expires_at || row.rows[0].expires_at > new Date())) {
|
||||
pool.query(`UPDATE api_tokens SET last_used_at = NOW() WHERE id = $1`, [row.rows[0].id]).catch(() => {});
|
||||
req.user = { id: row.rows[0].user_id, username: row.rows[0].username };
|
||||
return next();
|
||||
}
|
||||
}
|
||||
|
||||
// 3. Otherwise
|
||||
return res.status(401).json({ error: 'unauthorized' });
|
||||
}
|
||||
```
|
||||
|
||||
Mounted at the `/api/v1` level in `services/mam-api/src/index.js`, with an allowlist for `/api/v1/auth/login`, `/api/v1/auth/setup`, `/api/v1/auth/setup-required`, and `/health`.
|
||||
|
||||
### Session middleware (actually wired this time)
|
||||
|
||||
In `services/mam-api/src/index.js`, **before any route**:
|
||||
|
||||
```js
|
||||
import session from 'express-session';
|
||||
import connectPgSimple from 'connect-pg-simple';
|
||||
const PgStore = connectPgSimple(session);
|
||||
|
||||
if (process.env.TRUST_PROXY === 'true') app.set('trust proxy', 1);
|
||||
|
||||
app.use(session({
|
||||
store: new PgStore({ pool, tableName: 'sessions', pruneSessionInterval: 60 * 15 }),
|
||||
secret: process.env.SESSION_SECRET,
|
||||
name: 'dragonflight.sid',
|
||||
cookie: {
|
||||
httpOnly: true,
|
||||
sameSite: 'lax',
|
||||
secure: process.env.TRUST_PROXY === 'true',
|
||||
path: '/',
|
||||
maxAge: 8 * 3600 * 1000,
|
||||
},
|
||||
rolling: false, // sliding renewal handled in requireAuth so we can enforce idle + absolute separately
|
||||
resave: false,
|
||||
saveUninitialized: false,
|
||||
}));
|
||||
```
|
||||
|
||||
### Auth router
|
||||
|
||||
`services/mam-api/src/routes/auth.js`:
|
||||
|
||||
| Method | Path | Auth | Description |
|
||||
|---|---|---|---|
|
||||
| `GET` | `/api/v1/auth/setup-required` | none | `{ required: bool }`. Cheap, no auth. |
|
||||
| `POST` | `/api/v1/auth/setup` | none | Only succeeds if `users` is empty. Creates first user, logs them in. |
|
||||
| `POST` | `/api/v1/auth/login` | none | `{ username, password }` -> 200 + cookie or 401 |
|
||||
| `POST` | `/api/v1/auth/logout` | required | Destroys session row, clears cookie |
|
||||
| `GET` | `/api/v1/auth/me` | required | `{ id, username, display_name }` |
|
||||
| `POST` | `/api/v1/auth/password` | required | Change own password (requires current) |
|
||||
| `GET/POST/DELETE` | `/api/v1/auth/users[/:id]` | required | User CRUD |
|
||||
| `GET/POST/DELETE` | `/api/v1/auth/tokens[/:id]` | required | Current user's API tokens |
|
||||
|
||||
### Data model
|
||||
|
||||
Existing schema is almost right. One small migration:
|
||||
|
||||
```sql
|
||||
-- services/mam-api/src/db/migrations/023-auth-session-timestamps.sql
|
||||
ALTER TABLE users ADD COLUMN IF NOT EXISTS password_updated_at TIMESTAMPTZ DEFAULT NOW();
|
||||
ALTER TABLE users ADD COLUMN IF NOT EXISTS last_login_at TIMESTAMPTZ;
|
||||
-- idle / absolute timestamps live inside session.sess JSONB; no schema change needed
|
||||
```
|
||||
|
||||
`groups` and `user_groups` stay as-is, unused for v1. `api_tokens` is already correctly shaped.
|
||||
|
||||
## Flows
|
||||
|
||||
### Browser login (the one that broke last time)
|
||||
|
||||
1. SPA boots, `<AuthGate>` calls `GET /api/v1/auth/me`.
|
||||
2. `requireAuth` returns 401.
|
||||
3. AuthGate calls `GET /api/v1/auth/setup-required`. If `true`, render Setup screen. Otherwise, render Login screen.
|
||||
4. User submits `POST /api/v1/auth/login`. Server `bcrypt.compare`s, sets `req.session.user_id`, `first_seen_at`, `last_seen_at`. **Critical:** `await new Promise(r => req.session.save(r))` before responding, so the cookie is persisted to Postgres before the next request can arrive.
|
||||
5. AuthGate re-calls `/api/v1/auth/me`, gets 200, renders the app.
|
||||
|
||||
**Why this doesn't loop:** the explicit `req.session.save()` callback before response guarantees the cookie row exists before the SPA can fire its next request. `requireAuth` returns a clean 401 (not a redirect) so the SPA decides what to render. The static `/login.html` is deleted; there is no HTML bounce.
|
||||
|
||||
### Premiere panel bearer
|
||||
|
||||
1. Web UI -> Settings -> API Tokens -> "New token" named "Premiere panel".
|
||||
2. `POST /api/v1/auth/tokens` returns `{ token: 'dfl_<32 hex>', prefix: 'dfl_a3f2', id }` **exactly once**.
|
||||
3. Premiere panel sends `Authorization: Bearer dfl_<...>` on every request. `requireAuth` SHA-256s it, looks up `api_tokens.token_hash`, updates `last_used_at`.
|
||||
|
||||
### Idle + absolute timeout (inside `requireAuth`)
|
||||
|
||||
```
|
||||
if session present:
|
||||
if now - session.first_seen_at > 8h -> destroy session, 401
|
||||
if now - session.last_seen_at > 1h -> destroy session, 401
|
||||
session.last_seen_at = now
|
||||
req.user = lookup(session.user_id)
|
||||
next()
|
||||
```
|
||||
|
||||
Bearer tokens have their own optional `expires_at` (`NULL` = never expires); checked the same way.
|
||||
|
||||
## Frontend
|
||||
|
||||
- **`services/web-ui/src/auth-gate.jsx`** — new component that wraps the SPA. On mount: `GET /me`. On 401: check `setup-required`, render either Setup or Login. On 200: render the app shell.
|
||||
- **Login screen** — layout B from brainstorm: 22px wordmark over "WILD DRAGON BROADCAST" tagline above a `--bg-1` card containing username, password, "Sign in" button. Matches DESIGN.md tokens.
|
||||
- **Setup screen** — same chrome; fields = username, password, confirm password; button = "Create admin".
|
||||
- **Settings -> Account section** — change password.
|
||||
- **Settings -> API Tokens section** — list / create / revoke. New token shown exactly once with a copy affordance.
|
||||
- **Fetch wrapper** — the central `ZAMPP_API.fetch` (already exists) gains a 401 handler that re-mounts AuthGate's Login state with the current path saved as `last_path`, restored after re-auth.
|
||||
|
||||
### Removed
|
||||
|
||||
- The static `/login.html` page (PR #26's bounce target) is deleted. SPA handles login internally; no full-page reload.
|
||||
|
||||
## Error handling
|
||||
|
||||
| Case | Behavior |
|
||||
|---|---|
|
||||
| Wrong username or password | `401 { error: 'invalid credentials' }`. Same message either way, no user enumeration. |
|
||||
| Login rate limiting | Per-IP exponential backoff (1s, 2s, 4s, 8s, max 30s). In-memory `Map`. Single-instance limitation documented. |
|
||||
| Idle / absolute expiry | 401 -> AuthGate Login. Last path saved, restored on re-auth. |
|
||||
| Setup after first user exists | `409 { error: 'setup already complete' }`. Permanently disabled. |
|
||||
| Token revoke | `DELETE /api/v1/auth/tokens/:id` — only owner can revoke. Subsequent bearer requests 401. |
|
||||
| Delete-self when only user | `409 { error: 'cannot delete last user' }`. |
|
||||
| Forgot password | No self-serve. Any logged-in user can reset another via `POST /api/v1/auth/users/:id/password`. Documented as the recovery path. |
|
||||
| Password rules | Min 12 chars, no max, no character class requirements (NIST SP 800-63B). `bcrypt` cost 12. |
|
||||
| CSRF | `SameSite=Lax` + same origin + required `X-Requested-With: dragonflight-ui` header on mutating requests (belt-and-suspenders). |
|
||||
| Session table growth | `connect-pg-simple` `pruneSessionInterval: 60 * 15` (every 15 min). |
|
||||
|
||||
## Testing
|
||||
|
||||
- **Unit — `services/mam-api/test/middleware/auth.test.js`**: requireAuth with (a) no creds, (b) valid session, (c) idle-expired session, (d) absolute-expired session, (e) valid bearer, (f) invalid bearer, (g) bearer matching a deleted user.
|
||||
- **Integration — `services/mam-api/test/auth.integration.test.js`**: spin up Express + test Postgres. Walks: setup -> login -> /me -> mutating call -> logout -> /me 401. Second pass: idle timeout simulated by mutating `last_seen_at` in DB. Third pass: bearer issue -> use -> revoke -> 401.
|
||||
- **Regression test for the redirect-loop bug:** explicit test that after `POST /auth/login` returns 200, a subsequent `GET /auth/me` with the returned cookie returns 200 in the same test client. This is the test that would have caught the original failure.
|
||||
- **Manual smoke (documented in PR):** fresh install -> setup -> create admin -> land on dashboard -> reload (stays logged in) -> wait 1h idle -> reload -> bounce to login.
|
||||
|
||||
## Implementation order
|
||||
|
||||
Suggested sequencing for the implementation plan (writing-plans will refine):
|
||||
|
||||
1. Migration `023-auth-session-timestamps.sql`.
|
||||
2. `express-session` + `connect-pg-simple` wiring in `index.js`.
|
||||
3. `requireAuth` middleware.
|
||||
4. Auth router (setup, login, logout, me, password).
|
||||
5. Apply `requireAuth` to API router with allowlist.
|
||||
6. Auth tests (unit + integration + regression).
|
||||
7. Frontend `<AuthGate>` + Login screen + Setup screen.
|
||||
8. Frontend Settings -> Account + API Tokens.
|
||||
9. Delete `/login.html`.
|
||||
10. User CRUD + token CRUD routes.
|
||||
11. Rate limiting + CSRF header.
|
||||
12. Documentation: README updates, `AUTH_ENABLED` transition notes.
|
||||
|
||||
## Out-of-band notes for the implementer
|
||||
|
||||
- The current `cors({ origin: true, credentials: true })` in `index.js` is too permissive once cookies start carrying authority. Tighten to a specific origin list (driven by an `ALLOWED_ORIGINS` env var) at the same time as wiring the session middleware — otherwise we're undoing the `SameSite=Lax` protection from the other side.
|
||||
- node-agent -> mam-api traffic on `/api/v1/cluster/*` must keep working. Add a route-level carve-out comment that this path uses the existing `019-node-token-binding` token, not the user-auth path.
|
||||
- The boot log currently says `Authentication: ENABLED` / `DISABLED (set AUTH_ENABLED=true for production)`. Once this lands, the recommended default flips: `AUTH_ENABLED=true` becomes the documented default in `.env.example` and the README, and `AUTH_ENABLED=false` is documented as a dev-only escape hatch.
|
||||
Loading…
Reference in a new issue