diff --git a/docs/design/2026-05-03-datarhei-dragon-fork-webrtc-prometheus-metrics-design.md b/docs/design/2026-05-03-datarhei-dragon-fork-webrtc-prometheus-metrics-design.md
new file mode 100644
index 0000000..ae4b697
--- /dev/null
+++ b/docs/design/2026-05-03-datarhei-dragon-fork-webrtc-prometheus-metrics-design.md
@@ -0,0 +1,666 @@
+# Datarhei - Dragon Fork: WebRTC Prometheus Metrics
+
+**Status:** Draft for review
+**Author:** Zac (Wild Dragon)
+**Date:** 2026-05-03
+**Predecessors:**
+- [`2026-04-16-datarhei-dragon-fork-webrtc-design.md`](2026-04-16-datarhei-dragon-fork-webrtc-design.md)
+- [`2026-04-17-datarhei-dragon-fork-m2-webrtc-core-integration.md`](2026-04-17-datarhei-dragon-fork-m2-webrtc-core-integration.md)
+- v0.1.0-dragonfork released 2026-05-03
+
+---
+
+## Summary
+
+Add Prometheus instrumentation to Dragon Fork's WebRTC subsystem and ship a
+collection-and-dashboard stack in the existing TrueNAS deploy bundle. Closes
+the v0.1 observability gap: the WHEP egress has been running in production
+since 2026-04-17 with zero per-subsystem signal.
+
+The deliverable is a RED-method dashboard ("rate, errors, duration") that
+answers a single operator question — _is the WebRTC stack healthy right now?_
+Eleven new metrics in the `dragonfork_webrtc_*` namespace, two new containers
+(Prometheus + Grafana) in `deploy/truenas/core/`, four pre-loaded alert rules,
+one pre-provisioned dashboard.
+
+## Goals
+
+- Operator can answer "is WebRTC healthy right now?" from a single Grafana
+  dashboard, without tailing logs or hitting the API.
+- Per-stream drill-down available when the dashboard goes red — labels carry
+  `stream_id` everywhere it's meaningful, never `peer_id`.
+- Deploy is one-command on a fresh TrueNAS box (`docker compose up -d`),
+  matching the existing v0.1 deploy ergonomics.
+- Backwards-compatible: zero changes to upstream's `/metrics` payload. New
+  metrics are purely additive.
+- Bucket choices and label sets are tuned for the realistic latency ranges
+  observed in v0.1 (server-hop p95 ≈ 240µs, ICE establishment seconds-scale).
+
+## Non-Goals
+
+- **Alertmanager bundling.** Alert rules are loaded into Prometheus but not
+  routed. Paging configuration is too opinionated to ship a default; separate
+  spec if/when paging is wanted.
+- **Per-peer metric labels.** Peer-level forensics (individual session
+  lifetimes, per-resource teardown reasons) is out of scope. `peer_id` is
+  unbounded under churn and risks cardinality bloat.
+- **Federated multi-Core scrape.** Single-deploy scrape config only. The
+  `core` label is set statically to `dragonfork-truenas`.
+- **Latency p95 CI gate via Prometheus.** Server-hop latency stays a Go
+  test gate (`-tags latency`); not a Prometheus histogram.
+- **Server-hop microsecond histogram.** The 240µs server-hop is well below
+  HTTP request scales and would need its own bucket set; it's already
+  covered by the latency CI test, no need to duplicate in Prom.
+- **Custom monitor/metric bus integration.** Upstream pulls from
+  `monitor/metric.Reader`. We diverge — see Module Layout for rationale.
+
+## Context
+
+v0.1 surface area:
+
+- WHEP HTTP routes: `POST /api/v3/whep/{id}`, `DELETE /api/v3/whep/{id}/{r}`,
+  `PATCH /api/v3/whep/{id}/{r}`, plus admin `GET /api/v3/webrtc/streams`
+  and `GET /api/v3/webrtc/streams/{id}/peers`.
+- Error matrix in v0.1: `406` codec mismatch, `503` cap reached (split into
+  global vs per-stream in response body), `504` ICE timeout, `204` DELETE
+  idempotent, `404` unknown stream.
+- Pion-mediated peer connection lifecycle in `app/webrtc/lifecycle.go` —
+  ICE state transitions are the natural hook for ICE timing/failure metrics.
+- FFmpeg RTP output legs supervised by the existing process supervisor;
+  silent leg failure is a known "quietly degrading" risk worth instrumenting.
+
+Existing Prometheus integration (upstream):
+
+- `prometheus/prometheus.go` exposes a `Metrics` interface with `Register`
+  and an `HTTPHandler()`. Single shared `prometheus.Registry`.
+- `prometheus/restream.go` is the reference collector — pulls from
+  `monitor/metric.Reader` via `metric.Pattern` queries, emits via
+  `prometheus.MustNewConstMetric`. All upstream collectors carry a `core`
+  label as the first dimension.
+- `/metrics` endpoint already exposed by Core; auth handled at the same
+  layer as the rest of the API.
+
+## Approach
+
+**Hybrid instrumentation, in two surfaces:**
+
+1. **Direct `prometheus/client_golang` instrumentation** in `app/webrtc/`
+   for hot-path counters and histograms (request rate, request duration,
+   ICE establishment duration, error counters by reason). Histograms can't
+   be reconstructed from a scrape-time snapshot, so this is non-negotiable
+   for RED-method.
+
+2. **Snapshot-style collector** in `prometheus/webrtc.go` for slow-changing
+   gauges (active streams, active peers per stream, UDP port pool usage).
+   Calls a new `Stats()` method on the WebRTC subsystem at scrape time.
+
+Both surfaces register against the same `prometheus.Registerer` exposed by
+`prometheus.Metrics`. No new HTTP endpoint, no new auth path. Both take a
+`core` first-label dimension to match upstream collector convention.
+
+### Why not pure snapshot?
+
+Upstream's `prometheus/restream.go` pulls from a `monitor/metric` bus that
+the FFmpeg supervision layer writes into. We could mirror that for WebRTC
+— have `app/webrtc/lifecycle.go` and `handler.go` push events onto the bus,
+have `prometheus/webrtc.go` pull them. Two reasons not to:
+
+- **Histograms don't fit the pattern.** The bus stores point-in-time values
+  (gauges and counters), not distributions. RED-method needs duration p50
+  and p95; you'd end up maintaining an in-process sliding-window quantile
+  estimator inside the WebRTC subsystem, which is more code than just using
+  `client_golang.Histogram` directly.
+- **The bus is FFmpeg-shaped.** `metric.Pattern` queries are designed for
+  process-state metrics (process IDs, FFmpeg states). Bolting WebRTC
+  semantics on requires defining new patterns the bus consumers all need
+  to know about, for a payload only the WebRTC collector cares about.
+
+The hybrid keeps each metric type on the cleanest path. The cost is two
+patterns in the codebase instead of one — accepted, with a comment in
+`prometheus/webrtc.go` pointing at this rationale so the next contributor
+doesn't try to "fix" the divergence.
+
+### Why not pure direct?
+
+Pure `client_golang` everywhere would mean the gauges (active streams,
+active peers, UDP ports) sit in `app/webrtc/` alongside histograms. Workable,
+but loses the "one collector file per subsystem in `prometheus/`" pattern
+that anyone reading the repo's existing structure would expect. Snapshot
+gauges are cheap to implement via the existing pattern, so we keep them
+where a casual reader would look.
+
+## Module Layout
+
+### New files
+
+```
+app/webrtc/metrics.go       (~150 LOC)
+app/webrtc/metrics_test.go  (~200 LOC)
+prometheus/webrtc.go        (~120 LOC)
+prometheus/webrtc_test.go   (~150 LOC)
+deploy/truenas/core/prom/prometheus.yml
+deploy/truenas/core/prom/rules/webrtc-alerts.yml
+deploy/truenas/core/grafana/provisioning/datasources/prometheus.yml
+deploy/truenas/core/grafana/provisioning/dashboards/webrtc.yml
+deploy/truenas/core/grafana/dashboards/dragonfork-webrtc-health.json
+```
+
+### Modified files
+
+```
+app/webrtc/handler.go       — add metric middleware around WHEP routes
+app/webrtc/lifecycle.go     — record ICE timing in OnConnectionStateChange
+app/webrtc/subsystem.go     — add Stats() method, instrument process hooks
+deploy/truenas/core/docker-compose.yml  — add prom + grafana services
+deploy/truenas/core/README.md           — document new env vars + ports
+README.md                   — quick-start mentions Grafana URL
+CHANGELOG.md                — v0.2.0-dragonfork section
+```
+
+### `app/webrtc/metrics.go` — direct instrumentation
+
+`promauto`-registered into the shared registry, exposed as package-level
+vars so `handler.go` and `lifecycle.go` can increment without dependency
+injection. Single `Init(reg prometheus.Registerer, core string)` called
+from `subsystem.New` after the registry is available.
+
+```go
+// Sketch — exact wire format finalized at implementation.
+package webrtc
+
+import (
+    "github.com/prometheus/client_golang/prometheus"
+    "github.com/prometheus/client_golang/prometheus/promauto"
+)
+
+var histBuckets = []float64{0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10}
+
+type metrics struct {
+    whepRequests          *prometheus.CounterVec   // route, code, stream_id
+    whepRequestDuration   *prometheus.HistogramVec // route, stream_id
+    iceEstablishment      *prometheus.HistogramVec // stream_id, result
+    iceFailures           *prometheus.CounterVec   // stream_id, reason
+    codecMismatches       *prometheus.CounterVec   // stream_id, kind
+    capRejections         *prometheus.CounterVec   // stream_id, scope
+    ffmpegLegFailures     *prometheus.CounterVec   // stream_id, leg
+}
+
+func newMetrics(reg prometheus.Registerer, core string) *metrics {
+    factory := promauto.With(reg)
+    return &metrics{
+        whepRequests: factory.NewCounterVec(prometheus.CounterOpts{
+            Name:        "dragonfork_webrtc_whep_requests_total",
+            Help:        "Count of WHEP requests by route, status code, and stream.",
+            ConstLabels: prometheus.Labels{"core": core},
+        }, []string{"route", "code", "stream_id"}),
+        // ... etc
+    }
+}
+```
+
+The `core` label is a `ConstLabels` (set once at construction) rather than a
+per-request dimension — matches the upstream collector pattern and avoids
+threading it through every call site.
+
+### `prometheus/webrtc.go` — snapshot collector
+
+Standard `prometheus.Collector` interface (Describe / Collect). Keeps a
+reference to a `WebRTCStatsSource` interface, which the WebRTC subsystem
+implements via its `Stats()` method. Avoids importing `app/webrtc` from
+`prometheus/` — the dependency arrow points the right way.
+
+```go
+// Sketch.
+type WebRTCStatsSource interface {
+    Stats() WebRTCStats
+}
+
+type WebRTCStats struct {
+    StreamCount        int
+    PeersByStream      map[string]int
+    UDPPortsInUse      int
+    UDPPortsAvailable  int
+}
+
+type webrtcCollector struct {
+    core   string
+    source WebRTCStatsSource
+
+    activeStreamsDesc      *prometheus.Desc
+    activePeersDesc        *prometheus.Desc
+    udpPortsInUseDesc      *prometheus.Desc
+    udpPortsAvailableDesc  *prometheus.Desc
+}
+
+func NewWebRTCCollector(core string, source WebRTCStatsSource) prometheus.Collector { ... }
+```
+
+The `WebRTCStats` type lives in `prometheus/webrtc.go` (not in `app/webrtc/`)
+so the dependency stays one-directional. The subsystem implements the
+interface by satisfying the shape, not by importing from `prometheus/`.
+
+### `app/webrtc/subsystem.go` — `Stats()` method
+
+```go
+func (s *Subsystem) Stats() prometheus.WebRTCStats {
+    s.mu.Lock()
+    defer s.mu.Unlock()
+    peers := make(map[string]int, len(s.streams))
+    for id, st := range s.streams {
+        peers[id] = len(st.peers)  // assume peers tracked per-stream
+    }
+    return prometheus.WebRTCStats{
+        StreamCount:       len(s.streams),
+        PeersByStream:     peers,
+        UDPPortsInUse:     s.portAlloc.InUse(),
+        UDPPortsAvailable: s.portAlloc.Available(),
+    }
+}
+```
+
+The existing subsystem tracks streams in `s.streams` under `s.mu`. Peer
+count per stream needs the per-stream peer index that already exists in
+`handler.go` — the `Stats()` method consults it via the existing teardown
+hook plumbing or a small new accessor on `Handler`. Pick whichever surface
+introduces the smaller blast radius at implementation time.
+
+## Metric Inventory
+
+Eleven metrics. Eight new label dimensions across them. ~50 active series
+at typical 1-5 stream scale.
+
+### Direct instrumentation (`app/webrtc/metrics.go`)
+
+| Name | Type | Labels | Description |
+|---|---|---|---|
+| `dragonfork_webrtc_whep_requests_total` | Counter | core, route, code, stream_id | Count of WHEP requests by route+status code. |
+| `dragonfork_webrtc_whep_request_duration_seconds` | Histogram | core, route, stream_id | Server-side WHEP request duration. Buckets: `[0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]`. |
+| `dragonfork_webrtc_ice_establishment_duration_seconds` | Histogram | core, stream_id, result | Time from `SetLocalDescription` to first `connected` or `failed` ICE state. Same buckets. |
+| `dragonfork_webrtc_ice_failures_total` | Counter | core, stream_id, reason | ICE failure count. `reason` ∈ {timeout, disconnected, failed}. |
+| `dragonfork_webrtc_codec_mismatches_total` | Counter | core, stream_id, kind | 406 rejections by kind. `kind` ∈ {video, audio}. |
+| `dragonfork_webrtc_cap_rejections_total` | Counter | core, stream_id, scope | 503 rejections. `scope` ∈ {global, stream}. |
+| `dragonfork_webrtc_ffmpeg_leg_failures_total` | Counter | core, stream_id, leg | RTP output leg failures. `leg` ∈ {video, audio}. |
+
+### Snapshot collector (`prometheus/webrtc.go`)
+
+| Name | Type | Labels | Description |
+|---|---|---|---|
+| `dragonfork_webrtc_active_streams` | Gauge | core | Streams currently registered (processes with `webrtc.enabled=true` running). |
+| `dragonfork_webrtc_active_peers` | Gauge | core, stream_id | Currently subscribed WHEP peers per stream. |
+| `dragonfork_webrtc_udp_ports_in_use` | Gauge | core | UDP ports currently allocated from the pool. |
+| `dragonfork_webrtc_udp_ports_available` | Gauge | core | Pool size minus in-use (explicit for alert friendliness). |
+
+### Label rationale
+
+- `whep_request_duration_seconds` deliberately omits `code` — separating
+  distributions per outcome makes p95 noisy, and per-route per-stream p95
+  is what an operator actually looks at. Errors get visibility through the
+  request-counter ratio.
+- `ice_establishment_duration_seconds` includes both `connected` and
+  `failed` results in the same histogram via the `result` label —
+  intentionally — so the dashboard can compare success latency to
+  failure-tail latency on the same axis.
+- `cap_rejections_total` keeps the `scope` label because v0.1's response
+  body already splits global vs per-stream rejections; metrics mirror that
+  distinction so the dashboard shows whether to raise `max_peers_total`
+  or just one stream's per-stream cap.
+- `ffmpeg_leg_failures_total` is the "quietly degrading" canary — a silent
+  RTP-output-leg failure (port bind, encoder crash) is exactly what the
+  "is it healthy?" framing is meant to catch.
+
+### Cardinality budget
+
+At typical scale (5 streams, 3 routes, ~6 status codes seen in practice):
+
+- `whep_requests_total`: 5 × 3 × 6 = 90 series (worst case)
+- `whep_request_duration_seconds`: 5 × 3 × (8 buckets + sum + count) = 150 series
+- `ice_establishment_duration_seconds`: 5 × 2 × 10 = 100 series
+- All others: 5–15 series each
+- **Total: <500 active series at 5-stream sustained load**
+
+Well within Prometheus's comfort zone. At 15s scrape interval × 15-day
+retention, on-disk storage ~80MB.
+
+### Specifically excluded metrics
+
+- **Per-peer session metrics.** Listed under non-goals.
+- **Bytes-out / bandwidth.** Pion exposes RTP write bytes via stats; would
+  be useful but pulls peer-level state. Defer to a future v0.3 spec
+  ("WebRTC bandwidth observability") if needed.
+- **Server-hop latency (FFmpeg → peer).** Microsecond scale, already
+  covered by `-tags latency` test gate, would need its own bucket set.
+
+## Deploy Bundle
+
+### `deploy/truenas/core/docker-compose.yml` additions
+
+Two new services on a new bridge network `dragonfork-mon`. Core continues
+on `network_mode: host` unchanged. The new containers reach Core via
+`host.docker.internal:${CORE_HTTP_PORT}` (Linux Docker resolves this when
+`extra_hosts: ["host.docker.internal:host-gateway"]` is set on the service).
+
+```yaml
+services:
+  core:
+    # ... existing definition unchanged
+
+  prom:
+    image: prom/prometheus:v2.55.0
+    container_name: dragonfork-prom
+    restart: unless-stopped
+    networks: [dragonfork-mon]
+    extra_hosts:
+      - "host.docker.internal:host-gateway"
+    volumes:
+      - ./prom/prometheus.yml:/etc/prometheus/prometheus.yml:ro
+      - ./prom/rules:/etc/prometheus/rules:ro
+      - ./prom-data:/prometheus
+    command:
+      - --config.file=/etc/prometheus/prometheus.yml
+      - --storage.tsdb.retention.time=15d
+      - --storage.tsdb.path=/prometheus
+      - --web.console.libraries=/usr/share/prometheus/console_libraries
+      - --web.console.templates=/usr/share/prometheus/consoles
+    ports:
+      - "${PROM_PORT:-9090}:9090"
+
+  grafana:
+    image: grafana/grafana-oss:11.3.0
+    container_name: dragonfork-grafana
+    restart: unless-stopped
+    networks: [dragonfork-mon]
+    depends_on: [prom]
+    environment:
+      GF_SECURITY_ADMIN_PASSWORD: "${GRAFANA_ADMIN_PASSWORD:?set in .env}"
+      GF_USERS_ALLOW_SIGN_UP: "false"
+      GF_AUTH_ANONYMOUS_ENABLED: "false"
+    volumes:
+      - ./grafana/provisioning:/etc/grafana/provisioning:ro
+      - ./grafana/dashboards:/var/lib/grafana/dashboards:ro
+      - ./grafana-data:/var/lib/grafana
+    ports:
+      - "${GRAFANA_PORT:-3000}:3000"
+
+networks:
+  dragonfork-mon:
+    driver: bridge
+```
+
+### `prom/prometheus.yml`
+
+```yaml
+global:
+  scrape_interval: 15s
+  scrape_timeout: 10s
+  evaluation_interval: 15s
+  external_labels:
+    core: dragonfork-truenas
+
+rule_files:
+  - /etc/prometheus/rules/*.yml
+
+scrape_configs:
+  - job_name: dragonfork-core
+    static_configs:
+      - targets: ["host.docker.internal:8080"]
+    metrics_path: /metrics
+    # If API auth is enabled on /metrics, uncomment and provide creds via
+    # env-substituted file. v0.1 leaves /metrics public by default.
+    # basic_auth:
+    #   username_file: /run/secrets/prom_basic_user
+    #   password_file: /run/secrets/prom_basic_pass
+```
+
+### `prom/rules/webrtc-alerts.yml`
+
+```yaml
+groups:
+  - name: dragonfork-webrtc
+    rules:
+      - alert: WebRTCWHEPErrorRateHigh
+        expr: |
+          sum by (stream_id) (
+            rate(dragonfork_webrtc_whep_requests_total{code=~"4..|5.."}[5m])
+          ) > 0.5
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "WHEP error rate high on stream {{ $labels.stream_id }}"
+          description: "Sustained 4xx/5xx rate >0.5/sec for 5m."
+
+      - alert: WebRTCICEEstablishmentSlow
+        expr: |
+          histogram_quantile(0.95,
+            sum by (le, stream_id) (
+              rate(dragonfork_webrtc_ice_establishment_duration_seconds_bucket[10m])
+            )
+          ) > 3
+        for: 10m
+        labels:
+          severity: warning
+        annotations:
+          summary: "ICE establishment p95 >3s on {{ $labels.stream_id }}"
+
+      - alert: WebRTCICEFailureRateHigh
+        expr: |
+          sum by (stream_id) (rate(dragonfork_webrtc_ice_failures_total[5m])) > 0.2
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "ICE failures sustained on {{ $labels.stream_id }}"
+
+      - alert: WebRTCFFmpegLegFailure
+        expr: |
+          increase(dragonfork_webrtc_ffmpeg_leg_failures_total[5m]) > 0
+        labels:
+          severity: critical
+        annotations:
+          summary: "FFmpeg RTP leg failed on {{ $labels.stream_id }} ({{ $labels.leg }})"
+          description: "Silent degradation of RTP output. Check FFmpeg logs."
+```
+
+Alerts evaluate but route nowhere. Alertmanager bundling deferred — see
+non-goals.
+
+### Grafana provisioning
+
+Datasource provisioning at `grafana/provisioning/datasources/prometheus.yml`:
+
+```yaml
+apiVersion: 1
+datasources:
+  - name: Prometheus
+    type: prometheus
+    access: proxy
+    url: http://prom:9090
+    isDefault: true
+    editable: false
+```
+
+Dashboard provisioning at `grafana/provisioning/dashboards/webrtc.yml`:
+
+```yaml
+apiVersion: 1
+providers:
+  - name: dragonfork
+    orgId: 1
+    folder: "Dragon Fork"
+    type: file
+    disableDeletion: false
+    updateIntervalSeconds: 30
+    options:
+      path: /var/lib/grafana/dashboards
+```
+
+### Dashboard JSON: `dragonfork-webrtc-health.json`
+
+Single dashboard, five rows aligned to the questions from the metric
+inventory:
+
+1. **WHEP API health** — request rate by route (stat panel), error rate
+   stacked by code (timeseries), p95 request duration by route (timeseries).
+2. **ICE establishment** — success/failure rate (gauge), p50/p95
+   establishment duration (timeseries with a 3s threshold line for the
+   alert), failure breakdown by reason (table).
+3. **What's flowing** — `active_streams` (stat), `active_peers` per stream
+   (timeseries), top 5 streams by peer count (table).
+4. **Capacity headroom** — `udp_ports_available` (gauge with red-zone <10),
+   cap rejection rate by scope (timeseries).
+5. **Silent degradation** — FFmpeg leg failure timeline (timeseries with
+   annotations), codec mismatch counter (stat).
+
+Built in Grafana 11.3, exported as JSON, committed to the repo. Refresh
+default 30s.
+
+### `.env` template additions
+
+Append to `deploy/truenas/core/README.md`'s example `.env`:
+
+```sh
+# --- Observability (added in v0.2) ---
+GRAFANA_ADMIN_PASSWORD=$(openssl rand -base64 24)
+GRAFANA_PORT=3000
+PROM_PORT=9090
+```
+
+## Testing
+
+### Unit tests — `prometheus/webrtc_test.go`
+
+Mock `WebRTCStatsSource`. Drive the collector through three states (no
+streams, one stream with N peers, multiple streams). Use
+`testutil.CollectAndCompare` to assert exact metric/label/value output
+against a golden plaintext fixture.
+
+```go
+// Golden fixture (excerpt):
+// # HELP dragonfork_webrtc_active_streams ...
+// # TYPE dragonfork_webrtc_active_streams gauge
+// dragonfork_webrtc_active_streams{core="test"} 2
+// # HELP dragonfork_webrtc_active_peers ...
+// # TYPE dragonfork_webrtc_active_peers gauge
+// dragonfork_webrtc_active_peers{core="test",stream_id="live"} 3
+// dragonfork_webrtc_active_peers{core="test",stream_id="cam"} 1
+```
+
+### Unit tests — `app/webrtc/metrics_test.go`
+
+Reuse `handler_test.go` setup (fake registry, in-process Echo router).
+Hit each WHEP route, assert the corresponding counter and histogram have
+the expected increment via `testutil.ToFloat64`. Drive forced error paths
+(unknown stream → 404, codec-less SDP → 406, cap exceeded → 503, ICE
+timeout → 504) and assert the right error-bucket counters bumped.
+
+### Integration verification — `test/TESTING.md`
+
+New section "Verifying Prometheus metrics":
+
+```
+1. docker compose up -d
+2. curl -s http://<host>:8080/metrics | grep dragonfork_webrtc_
+   - expect: 11 metric families present, all with `core="dragonfork-truenas"`
+3. Open http://<host>:3000 (Grafana), log in with GRAFANA_ADMIN_PASSWORD
+4. Navigate to Dashboards → Dragon Fork → WebRTC Health
+   - expect: all 5 rows render, no "no data" panels except where stream traffic is absent
+5. Trigger one of each error in test/whep-player.html (intentional codec
+   mismatch via SDP edit, kill the publisher mid-stream, etc.)
+6. Watch the Grafana panels and verify counters tick within 15s.
+```
+
+### CI
+
+Existing test runner picks up the new `_test.go` files. No new CI gates
+beyond standard build+test — observability isn't a contract; the unit
+tests verify shape only. Grafana dashboard JSON is *not* validated in CI
+(no good lightweight validator); manual verification only.
+
+### Load test alignment
+
+The deferred 5-peer × 10-min load test (separate spec) will use this
+dashboard as its primary observation surface. Recording rules for the
+load test's specific aggregations can be added in that spec without
+touching this one.
+
+## Rollout
+
+The TrueNAS v0.1.0-dragonfork deploy upgrades via:
+
+```sh
+cd deploy/truenas/core
+git pull                          # latest main with this change
+# Add new lines to .env (see template above)
+docker compose pull               # grabs prom + grafana images
+docker compose up -d              # core unchanged, prom + grafana new
+```
+
+Core continues on host networking. The new containers connect via
+`host.docker.internal:host-gateway`, no firewall changes required for
+intra-host traffic. External Grafana access is on `${GRAFANA_PORT}`.
+
+### Backwards compatibility
+
+- No upstream metric names or labels modified. New metrics are purely
+  additive in `dragonfork_webrtc_*` namespace.
+- No API changes. `/metrics` payload grows but stays well-formed
+  Prometheus exposition.
+- Existing config, env vars, and process JSON formats unchanged.
+
+### Forward compatibility
+
+- The `core` label being a `ConstLabels` value (not a per-event dimension)
+  means future federated multi-Core scrapes will distinguish series cleanly
+  by setting `core="dragonfork-truenas-east"` etc. in each deploy's config
+  loader. Spec'd here, implemented when needed.
+- New metrics in this spec follow the `dragonfork_<subsystem>_<noun>` naming
+  pattern. Future Dragon-Fork-specific metrics (WHIP, keyframe cache,
+  bandwidth) should adopt the same convention.
+
+### Known gaps post-rollout
+
+- No paging. Alerts evaluate, no Alertmanager. If `WebRTCFFmpegLegFailure`
+  fires at 3am, no notification — operator notices at next dashboard check.
+  Acceptable for v0.2 single-operator deploy. Track as a v0.3 spec.
+- Grafana dashboard JSON is hand-edited via Grafana UI then re-exported.
+  No JSON-as-code library used. If dashboard maintenance gets painful,
+  Grafonnet/Grafana-as-code is a v0.3+ refactor.
+- `/metrics` itself is unauthenticated by default in v0.1 (matches
+  upstream). If Core's deploy bundle is exposed to untrusted networks,
+  the operator should already be using auth on Core's HTTP listener. Not
+  this spec's problem to solve, but worth a one-line note in
+  `deploy/truenas/core/README.md`.
+
+## Open Decisions
+
+1. **Should the `Stats()` method live on `Subsystem` or on `Handler`?**
+   The peer count is in `Handler`'s per-stream peer index; stream count
+   is in `Subsystem`'s registry; UDP port pool is in `portalloc`. Easiest
+   shape: `Subsystem.Stats()` is the public surface and internally
+   gathers from `Handler` (via the existing teardown-hook plumbing) and
+   `portalloc`. Decide at implementation time based on which surface
+   exposes the cleanest seams.
+
+2. **Should histograms also include a `core` label, given it's already a
+   `ConstLabels`?** Yes — `ConstLabels` is automatically present on every
+   sample, no per-call overhead, and federations need it.
+
+3. **Should Prometheus retention be configurable via `.env`?** Defaulting
+   to 15d covers the realistic window for "what happened last week?"
+   queries. Adding `PROM_RETENTION_DAYS=15d` to `.env` is a one-line
+   change. Including it as optional, defaulting to 15d.
+
+4. **Import-alias collision.** The local package is `package prometheus`
+   (at `github.com/datarhei/core/v16/prometheus`) and `client_golang` is
+   also `package prometheus`. Files in `app/webrtc/` that need both must
+   alias one — convention is `coreprom "github.com/datarhei/core/v16/prometheus"`.
+   Implementation note only; doesn't change the design.
+
+## References
+
+- [Prometheus client_golang](https://pkg.go.dev/github.com/prometheus/client_golang/prometheus)
+- [Prometheus instrumentation best practices](https://prometheus.io/docs/practices/instrumentation/)
+- [Histogram bucket design](https://prometheus.io/docs/practices/histograms/)
+- [Grafana provisioning docs](https://grafana.com/docs/grafana/latest/administration/provisioning/)
+- v0.1 design: `docs/design/2026-04-16-datarhei-dragon-fork-webrtc-design.md`
+- M2 integration: `docs/design/2026-04-17-datarhei-dragon-fork-m2-webrtc-core-integration.md`