datarhei-dragonfork-core/docs/design/2026-05-03-datarhei-dragon-fork-webrtc-prometheus-metrics-design.md

# Datarhei - Dragon Fork: WebRTC Prometheus Metrics

**Status:** Draft for review
**Author:** Zac (Wild Dragon)
**Date:** 2026-05-03
**Predecessors:**
- [`2026-04-16-datarhei-dragon-fork-webrtc-design.md`](2026-04-16-datarhei-dragon-fork-webrtc-design.md)
- [`2026-04-17-datarhei-dragon-fork-m2-webrtc-core-integration.md`](2026-04-17-datarhei-dragon-fork-m2-webrtc-core-integration.md)
- v0.1.0-dragonfork released 2026-05-03

---

## Summary

Add Prometheus instrumentation to Dragon Fork's WebRTC subsystem and ship a
collection-and-dashboard stack in the existing TrueNAS deploy bundle. Closes
the v0.1 observability gap: the WHEP egress has been running in production
since 2026-04-17 with zero per-subsystem signal.

The deliverable is a RED-method dashboard ("rate, errors, duration") that
answers a single operator question — _is the WebRTC stack healthy right now?_
Eleven new metrics in the `dragonfork_webrtc_*` namespace, two new containers
(Prometheus + Grafana) in `deploy/truenas/core/`, four pre-loaded alert rules,
one pre-provisioned dashboard.

## Goals

- Operator can answer "is WebRTC healthy right now?" from a single Grafana
  dashboard, without tailing logs or hitting the API.
- Per-stream drill-down available when the dashboard goes red — labels carry
  `stream_id` everywhere it's meaningful, never `peer_id`.
- Deploy is one-command on a fresh TrueNAS box (`docker compose up -d`),
  matching the existing v0.1 deploy ergonomics.
- Backwards-compatible: zero changes to upstream's `/metrics` payload. New
  metrics are purely additive.
- Bucket choices and label sets are tuned for the realistic latency ranges
  observed in v0.1 (server-hop p95 ≈ 240µs, ICE establishment seconds-scale).

## Non-Goals

- **Alertmanager bundling.** Alert rules are loaded into Prometheus but not
  routed. Paging configuration is too opinionated to ship a default; separate
  spec if/when paging is wanted.
- **Per-peer metric labels.** Peer-level forensics (individual session
  lifetimes, per-resource teardown reasons) is out of scope. `peer_id` is
  unbounded under churn and risks cardinality bloat.
- **Federated multi-Core scrape.** Single-deploy scrape config only. The
  `core` label is set statically to `dragonfork-truenas`.
- **Latency p95 CI gate via Prometheus.** Server-hop latency stays a Go
  test gate (`-tags latency`); not a Prometheus histogram.
- **Server-hop microsecond histogram.** The 240µs server-hop is well below
  HTTP request scales and would need its own bucket set; it's already
  covered by the latency CI test, no need to duplicate in Prom.
- **Custom monitor/metric bus integration.** Upstream pulls from
  `monitor/metric.Reader`. We diverge — see Module Layout for rationale.

## Context

v0.1 surface area:

- WHEP HTTP routes: `POST /api/v3/whep/{id}`, `DELETE /api/v3/whep/{id}/{r}`,
  `PATCH /api/v3/whep/{id}/{r}`, plus admin `GET /api/v3/webrtc/streams`
  and `GET /api/v3/webrtc/streams/{id}/peers`.
- Error matrix in v0.1: `406` codec mismatch, `503` cap reached (split into
  global vs per-stream in response body), `504` ICE timeout, `204` DELETE
  idempotent, `404` unknown stream.
- Pion-mediated peer connection lifecycle in `app/webrtc/lifecycle.go` —
  ICE state transitions are the natural hook for ICE timing/failure metrics.
- FFmpeg RTP output legs supervised by the existing process supervisor;
  silent leg failure is a known "quietly degrading" risk worth instrumenting.

Existing Prometheus integration (upstream):

- `prometheus/prometheus.go` exposes a `Metrics` interface with `Register`
  and an `HTTPHandler()`. Single shared `prometheus.Registry`.
- `prometheus/restream.go` is the reference collector — pulls from
  `monitor/metric.Reader` via `metric.Pattern` queries, emits via
  `prometheus.MustNewConstMetric`. All upstream collectors carry a `core`
  label as the first dimension.
- `/metrics` endpoint already exposed by Core; auth handled at the same
  layer as the rest of the API.

## Approach

**Hybrid instrumentation, in two surfaces:**

1. **Direct `prometheus/client_golang` instrumentation** in `app/webrtc/`
   for hot-path counters and histograms (request rate, request duration,
   ICE establishment duration, error counters by reason). Histograms can't
   be reconstructed from a scrape-time snapshot, so this is non-negotiable
   for RED-method.

2. **Snapshot-style collector** in `prometheus/webrtc.go` for slow-changing
   gauges (active streams, active peers per stream, UDP port pool usage).
   Calls a new `Stats()` method on the WebRTC subsystem at scrape time.

Both surfaces register against the same `prometheus.Registerer` exposed by
`prometheus.Metrics`. No new HTTP endpoint, no new auth path. Both take a
`core` first-label dimension to match upstream collector convention.

### Why not pure snapshot?

Upstream's `prometheus/restream.go` pulls from a `monitor/metric` bus that
the FFmpeg supervision layer writes into. We could mirror that for WebRTC
— have `app/webrtc/lifecycle.go` and `handler.go` push events onto the bus,
have `prometheus/webrtc.go` pull them. Two reasons not to:

- **Histograms don't fit the pattern.** The bus stores point-in-time values
  (gauges and counters), not distributions. RED-method needs duration p50
  and p95; you'd end up maintaining an in-process sliding-window quantile
  estimator inside the WebRTC subsystem, which is more code than just using
  `client_golang.Histogram` directly.
- **The bus is FFmpeg-shaped.** `metric.Pattern` queries are designed for
  process-state metrics (process IDs, FFmpeg states). Bolting WebRTC
  semantics on requires defining new patterns the bus consumers all need
  to know about, for a payload only the WebRTC collector cares about.

The hybrid keeps each metric type on the cleanest path. The cost is two
patterns in the codebase instead of one — accepted, with a comment in
`prometheus/webrtc.go` pointing at this rationale so the next contributor
doesn't try to "fix" the divergence.

### Why not pure direct?

Pure `client_golang` everywhere would mean the gauges (active streams,
active peers, UDP ports) sit in `app/webrtc/` alongside histograms. Workable,
but loses the "one collector file per subsystem in `prometheus/`" pattern
that anyone reading the repo's existing structure would expect. Snapshot
gauges are cheap to implement via the existing pattern, so we keep them
where a casual reader would look.

## Module Layout

### New files

```
app/webrtc/metrics.go       (~150 LOC)
app/webrtc/metrics_test.go  (~200 LOC)
prometheus/webrtc.go        (~120 LOC)
prometheus/webrtc_test.go   (~150 LOC)
deploy/truenas/core/prom/prometheus.yml
deploy/truenas/core/prom/rules/webrtc-alerts.yml
deploy/truenas/core/grafana/provisioning/datasources/prometheus.yml
deploy/truenas/core/grafana/provisioning/dashboards/webrtc.yml
deploy/truenas/core/grafana/dashboards/dragonfork-webrtc-health.json
```

### Modified files

```
app/webrtc/handler.go       — add metric middleware around WHEP routes
app/webrtc/lifecycle.go     — record ICE timing in OnConnectionStateChange
app/webrtc/subsystem.go     — add Stats() method, instrument process hooks
deploy/truenas/core/docker-compose.yml  — add prom + grafana services
deploy/truenas/core/README.md           — document new env vars + ports
README.md                   — quick-start mentions Grafana URL
CHANGELOG.md                — v0.2.0-dragonfork section
```

### `app/webrtc/metrics.go` — direct instrumentation

`promauto`-registered into the shared registry, exposed as package-level
vars so `handler.go` and `lifecycle.go` can increment without dependency
injection. Single `Init(reg prometheus.Registerer, core string)` called
from `subsystem.New` after the registry is available.

```go
// Sketch — exact wire format finalized at implementation.
package webrtc

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var histBuckets = []float64{0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10}

type metrics struct {
    whepRequests          *prometheus.CounterVec   // route, code, stream_id
    whepRequestDuration   *prometheus.HistogramVec // route, stream_id
    iceEstablishment      *prometheus.HistogramVec // stream_id, result
    iceFailures           *prometheus.CounterVec   // stream_id, reason
    codecMismatches       *prometheus.CounterVec   // stream_id, kind
    capRejections         *prometheus.CounterVec   // stream_id, scope
    ffmpegLegFailures     *prometheus.CounterVec   // stream_id, leg
}

func newMetrics(reg prometheus.Registerer, core string) *metrics {
    factory := promauto.With(reg)
    return &metrics{
        whepRequests: factory.NewCounterVec(prometheus.CounterOpts{
            Name:        "dragonfork_webrtc_whep_requests_total",
            Help:        "Count of WHEP requests by route, status code, and stream.",
            ConstLabels: prometheus.Labels{"core": core},
        }, []string{"route", "code", "stream_id"}),
        // ... etc
    }
}
```

The `core` label is a `ConstLabels` (set once at construction) rather than a
per-request dimension — matches the upstream collector pattern and avoids
threading it through every call site.

### `prometheus/webrtc.go` — snapshot collector

Standard `prometheus.Collector` interface (Describe / Collect). Keeps a
reference to a `WebRTCStatsSource` interface, which the WebRTC subsystem
implements via its `Stats()` method. Avoids importing `app/webrtc` from
`prometheus/` — the dependency arrow points the right way.

```go
// Sketch.
type WebRTCStatsSource interface {
    Stats() WebRTCStats
}

type WebRTCStats struct {
    StreamCount        int
    PeersByStream      map[string]int
    UDPPortsInUse      int
    UDPPortsAvailable  int
}

type webrtcCollector struct {
    core   string
    source WebRTCStatsSource

    activeStreamsDesc      *prometheus.Desc
    activePeersDesc        *prometheus.Desc
    udpPortsInUseDesc      *prometheus.Desc
    udpPortsAvailableDesc  *prometheus.Desc
}

func NewWebRTCCollector(core string, source WebRTCStatsSource) prometheus.Collector { ... }
```

The `WebRTCStats` type lives in `prometheus/webrtc.go` (not in `app/webrtc/`)
so the dependency stays one-directional. The subsystem implements the
interface by satisfying the shape, not by importing from `prometheus/`.

### `app/webrtc/subsystem.go` — `Stats()` method

```go
func (s *Subsystem) Stats() prometheus.WebRTCStats {
    s.mu.Lock()
    defer s.mu.Unlock()
    peers := make(map[string]int, len(s.streams))
    for id, st := range s.streams {
        peers[id] = len(st.peers)  // assume peers tracked per-stream
    }
    return prometheus.WebRTCStats{
        StreamCount:       len(s.streams),
        PeersByStream:     peers,
        UDPPortsInUse:     s.portAlloc.InUse(),
        UDPPortsAvailable: s.portAlloc.Available(),
    }
}
```

The existing subsystem tracks streams in `s.streams` under `s.mu`. Peer
count per stream needs the per-stream peer index that already exists in
`handler.go` — the `Stats()` method consults it via the existing teardown
hook plumbing or a small new accessor on `Handler`. Pick whichever surface
introduces the smaller blast radius at implementation time.

## Metric Inventory

Eleven metrics. Eight new label dimensions across them. ~50 active series
at typical 1-5 stream scale.

### Direct instrumentation (`app/webrtc/metrics.go`)

| Name | Type | Labels | Description |
|---|---|---|---|
| `dragonfork_webrtc_whep_requests_total` | Counter | core, route, code, stream_id | Count of WHEP requests by route+status code. |
| `dragonfork_webrtc_whep_request_duration_seconds` | Histogram | core, route, stream_id | Server-side WHEP request duration. Buckets: `[0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]`. |
| `dragonfork_webrtc_ice_establishment_duration_seconds` | Histogram | core, stream_id, result | Time from `SetLocalDescription` to first `connected` or `failed` ICE state. Same buckets. |
| `dragonfork_webrtc_ice_failures_total` | Counter | core, stream_id, reason | ICE failure count. `reason` ∈ {timeout, disconnected, failed}. |
| `dragonfork_webrtc_codec_mismatches_total` | Counter | core, stream_id, kind | 406 rejections by kind. `kind` ∈ {video, audio}. |
| `dragonfork_webrtc_cap_rejections_total` | Counter | core, stream_id, scope | 503 rejections. `scope` ∈ {global, stream}. |
| `dragonfork_webrtc_ffmpeg_leg_failures_total` | Counter | core, stream_id, leg | RTP output leg failures. `leg` ∈ {video, audio}. |

### Snapshot collector (`prometheus/webrtc.go`)

| Name | Type | Labels | Description |
|---|---|---|---|
| `dragonfork_webrtc_active_streams` | Gauge | core | Streams currently registered (processes with `webrtc.enabled=true` running). |
| `dragonfork_webrtc_active_peers` | Gauge | core, stream_id | Currently subscribed WHEP peers per stream. |
| `dragonfork_webrtc_udp_ports_in_use` | Gauge | core | UDP ports currently allocated from the pool. |
| `dragonfork_webrtc_udp_ports_available` | Gauge | core | Pool size minus in-use (explicit for alert friendliness). |

### Label rationale

- `whep_request_duration_seconds` deliberately omits `code` — separating
  distributions per outcome makes p95 noisy, and per-route per-stream p95
  is what an operator actually looks at. Errors get visibility through the
  request-counter ratio.
- `ice_establishment_duration_seconds` includes both `connected` and
  `failed` results in the same histogram via the `result` label —
  intentionally — so the dashboard can compare success latency to
  failure-tail latency on the same axis.
- `cap_rejections_total` keeps the `scope` label because v0.1's response
  body already splits global vs per-stream rejections; metrics mirror that
  distinction so the dashboard shows whether to raise `max_peers_total`
  or just one stream's per-stream cap.
- `ffmpeg_leg_failures_total` is the "quietly degrading" canary — a silent
  RTP-output-leg failure (port bind, encoder crash) is exactly what the
  "is it healthy?" framing is meant to catch.

### Cardinality budget

At typical scale (5 streams, 3 routes, ~6 status codes seen in practice):

- `whep_requests_total`: 5 × 3 × 6 = 90 series (worst case)
- `whep_request_duration_seconds`: 5 × 3 × (8 buckets + sum + count) = 150 series
- `ice_establishment_duration_seconds`: 5 × 2 × 10 = 100 series
- All others: 5–15 series each
- **Total: <500 active series at 5-stream sustained load**

Well within Prometheus's comfort zone. At 15s scrape interval × 15-day
retention, on-disk storage ~80MB.

### Specifically excluded metrics

- **Per-peer session metrics.** Listed under non-goals.
- **Bytes-out / bandwidth.** Pion exposes RTP write bytes via stats; would
  be useful but pulls peer-level state. Defer to a future v0.3 spec
  ("WebRTC bandwidth observability") if needed.
- **Server-hop latency (FFmpeg → peer).** Microsecond scale, already
  covered by `-tags latency` test gate, would need its own bucket set.

## Deploy Bundle

### `deploy/truenas/core/docker-compose.yml` additions

Two new services on a new bridge network `dragonfork-mon`. Core continues
on `network_mode: host` unchanged. The new containers reach Core via
`host.docker.internal:${CORE_HTTP_PORT}` (Linux Docker resolves this when
`extra_hosts: ["host.docker.internal:host-gateway"]` is set on the service).

```yaml
services:
  core:
    # ... existing definition unchanged

  prom:
    image: prom/prometheus:v2.55.0
    container_name: dragonfork-prom
    restart: unless-stopped
    networks: [dragonfork-mon]
    extra_hosts:
      - "host.docker.internal:host-gateway"
    volumes:
      - ./prom/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prom/rules:/etc/prometheus/rules:ro
      - ./prom-data:/prometheus
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.retention.time=15d
      - --storage.tsdb.path=/prometheus
      - --web.console.libraries=/usr/share/prometheus/console_libraries
      - --web.console.templates=/usr/share/prometheus/consoles
    ports:
      - "${PROM_PORT:-9090}:9090"

  grafana:
    image: grafana/grafana-oss:11.3.0
    container_name: dragonfork-grafana
    restart: unless-stopped
    networks: [dragonfork-mon]
    depends_on: [prom]
    environment:
      GF_SECURITY_ADMIN_PASSWORD: "${GRAFANA_ADMIN_PASSWORD:?set in .env}"
      GF_USERS_ALLOW_SIGN_UP: "false"
      GF_AUTH_ANONYMOUS_ENABLED: "false"
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
      - ./grafana/dashboards:/var/lib/grafana/dashboards:ro
      - ./grafana-data:/var/lib/grafana
    ports:
      - "${GRAFANA_PORT:-3000}:3000"

networks:
  dragonfork-mon:
    driver: bridge
```

### `prom/prometheus.yml`

```yaml
global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 15s
  external_labels:
    core: dragonfork-truenas

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  - job_name: dragonfork-core
    static_configs:
      - targets: ["host.docker.internal:8080"]
    metrics_path: /metrics
    # If API auth is enabled on /metrics, uncomment and provide creds via
    # env-substituted file. v0.1 leaves /metrics public by default.
    # basic_auth:
    #   username_file: /run/secrets/prom_basic_user
    #   password_file: /run/secrets/prom_basic_pass
```

### `prom/rules/webrtc-alerts.yml`

```yaml
groups:
  - name: dragonfork-webrtc
    rules:
      - alert: WebRTCWHEPErrorRateHigh
        expr: |
          sum by (stream_id) (
            rate(dragonfork_webrtc_whep_requests_total{code=~"4..|5.."}[5m])
          ) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "WHEP error rate high on stream {{ $labels.stream_id }}"
          description: "Sustained 4xx/5xx rate >0.5/sec for 5m."

      - alert: WebRTCICEEstablishmentSlow
        expr: |
          histogram_quantile(0.95,
            sum by (le, stream_id) (
              rate(dragonfork_webrtc_ice_establishment_duration_seconds_bucket[10m])
            )
          ) > 3
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "ICE establishment p95 >3s on {{ $labels.stream_id }}"

      - alert: WebRTCICEFailureRateHigh
        expr: |
          sum by (stream_id) (rate(dragonfork_webrtc_ice_failures_total[5m])) > 0.2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "ICE failures sustained on {{ $labels.stream_id }}"

      - alert: WebRTCFFmpegLegFailure
        expr: |
          increase(dragonfork_webrtc_ffmpeg_leg_failures_total[5m]) > 0
        labels:
          severity: critical
        annotations:
          summary: "FFmpeg RTP leg failed on {{ $labels.stream_id }} ({{ $labels.leg }})"
          description: "Silent degradation of RTP output. Check FFmpeg logs."
```

Alerts evaluate but route nowhere. Alertmanager bundling deferred — see
non-goals.

### Grafana provisioning

Datasource provisioning at `grafana/provisioning/datasources/prometheus.yml`:

```yaml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prom:9090
    isDefault: true
    editable: false
```

Dashboard provisioning at `grafana/provisioning/dashboards/webrtc.yml`:

```yaml
apiVersion: 1
providers:
  - name: dragonfork
    orgId: 1
    folder: "Dragon Fork"
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    options:
      path: /var/lib/grafana/dashboards
```

### Dashboard JSON: `dragonfork-webrtc-health.json`

Single dashboard, five rows aligned to the questions from the metric
inventory:

1. **WHEP API health** — request rate by route (stat panel), error rate
   stacked by code (timeseries), p95 request duration by route (timeseries).
2. **ICE establishment** — success/failure rate (gauge), p50/p95
   establishment duration (timeseries with a 3s threshold line for the
   alert), failure breakdown by reason (table).
3. **What's flowing** — `active_streams` (stat), `active_peers` per stream
   (timeseries), top 5 streams by peer count (table).
4. **Capacity headroom** — `udp_ports_available` (gauge with red-zone <10),
   cap rejection rate by scope (timeseries).
5. **Silent degradation** — FFmpeg leg failure timeline (timeseries with
   annotations), codec mismatch counter (stat).

Built in Grafana 11.3, exported as JSON, committed to the repo. Refresh
default 30s.

### `.env` template additions

Append to `deploy/truenas/core/README.md`'s example `.env`:

```sh
# --- Observability (added in v0.2) ---
GRAFANA_ADMIN_PASSWORD=$(openssl rand -base64 24)
GRAFANA_PORT=3000
PROM_PORT=9090
```

## Testing

### Unit tests — `prometheus/webrtc_test.go`

Mock `WebRTCStatsSource`. Drive the collector through three states (no
streams, one stream with N peers, multiple streams). Use
`testutil.CollectAndCompare` to assert exact metric/label/value output
against a golden plaintext fixture.

```go
// Golden fixture (excerpt):
// # HELP dragonfork_webrtc_active_streams ...
// # TYPE dragonfork_webrtc_active_streams gauge
// dragonfork_webrtc_active_streams{core="test"} 2
// # HELP dragonfork_webrtc_active_peers ...
// # TYPE dragonfork_webrtc_active_peers gauge
// dragonfork_webrtc_active_peers{core="test",stream_id="live"} 3
// dragonfork_webrtc_active_peers{core="test",stream_id="cam"} 1
```

### Unit tests — `app/webrtc/metrics_test.go`

Reuse `handler_test.go` setup (fake registry, in-process Echo router).
Hit each WHEP route, assert the corresponding counter and histogram have
the expected increment via `testutil.ToFloat64`. Drive forced error paths
(unknown stream → 404, codec-less SDP → 406, cap exceeded → 503, ICE
timeout → 504) and assert the right error-bucket counters bumped.

### Integration verification — `test/TESTING.md`

New section "Verifying Prometheus metrics":

```
1. docker compose up -d
2. curl -s http://<host>:8080/metrics | grep dragonfork_webrtc_
   - expect: 11 metric families present, all with `core="dragonfork-truenas"`
3. Open http://<host>:3000 (Grafana), log in with GRAFANA_ADMIN_PASSWORD
4. Navigate to Dashboards → Dragon Fork → WebRTC Health
   - expect: all 5 rows render, no "no data" panels except where stream traffic is absent
5. Trigger one of each error in test/whep-player.html (intentional codec
   mismatch via SDP edit, kill the publisher mid-stream, etc.)
6. Watch the Grafana panels and verify counters tick within 15s.
```

### CI

Existing test runner picks up the new `_test.go` files. No new CI gates
beyond standard build+test — observability isn't a contract; the unit
tests verify shape only. Grafana dashboard JSON is *not* validated in CI
(no good lightweight validator); manual verification only.

### Load test alignment

The deferred 5-peer × 10-min load test (separate spec) will use this
dashboard as its primary observation surface. Recording rules for the
load test's specific aggregations can be added in that spec without
touching this one.

## Rollout

The TrueNAS v0.1.0-dragonfork deploy upgrades via:

```sh
cd deploy/truenas/core
git pull                          # latest main with this change
# Add new lines to .env (see template above)
docker compose pull               # grabs prom + grafana images
docker compose up -d              # core unchanged, prom + grafana new
```

Core continues on host networking. The new containers connect via
`host.docker.internal:host-gateway`, no firewall changes required for
intra-host traffic. External Grafana access is on `${GRAFANA_PORT}`.

### Backwards compatibility

- No upstream metric names or labels modified. New metrics are purely
  additive in `dragonfork_webrtc_*` namespace.
- No API changes. `/metrics` payload grows but stays well-formed
  Prometheus exposition.
- Existing config, env vars, and process JSON formats unchanged.

### Forward compatibility

- The `core` label being a `ConstLabels` value (not a per-event dimension)
  means future federated multi-Core scrapes will distinguish series cleanly
  by setting `core="dragonfork-truenas-east"` etc. in each deploy's config
  loader. Spec'd here, implemented when needed.
- New metrics in this spec follow the `dragonfork_<subsystem>_<noun>` naming
  pattern. Future Dragon-Fork-specific metrics (WHIP, keyframe cache,
  bandwidth) should adopt the same convention.

### Known gaps post-rollout

- No paging. Alerts evaluate, no Alertmanager. If `WebRTCFFmpegLegFailure`
  fires at 3am, no notification — operator notices at next dashboard check.
  Acceptable for v0.2 single-operator deploy. Track as a v0.3 spec.
- Grafana dashboard JSON is hand-edited via Grafana UI then re-exported.
  No JSON-as-code library used. If dashboard maintenance gets painful,
  Grafonnet/Grafana-as-code is a v0.3+ refactor.
- `/metrics` itself is unauthenticated by default in v0.1 (matches
  upstream). If Core's deploy bundle is exposed to untrusted networks,
  the operator should already be using auth on Core's HTTP listener. Not
  this spec's problem to solve, but worth a one-line note in
  `deploy/truenas/core/README.md`.

## Open Decisions

1. **Should the `Stats()` method live on `Subsystem` or on `Handler`?**
   The peer count is in `Handler`'s per-stream peer index; stream count
   is in `Subsystem`'s registry; UDP port pool is in `portalloc`. Easiest
   shape: `Subsystem.Stats()` is the public surface and internally
   gathers from `Handler` (via the existing teardown-hook plumbing) and
   `portalloc`. Decide at implementation time based on which surface
   exposes the cleanest seams.

2. **Should histograms also include a `core` label, given it's already a
   `ConstLabels`?** Yes — `ConstLabels` is automatically present on every
   sample, no per-call overhead, and federations need it.

3. **Should Prometheus retention be configurable via `.env`?** Defaulting
   to 15d covers the realistic window for "what happened last week?"
   queries. Adding `PROM_RETENTION_DAYS=15d` to `.env` is a one-line
   change. Including it as optional, defaulting to 15d.

4. **Import-alias collision.** The local package is `package prometheus`
   (at `github.com/datarhei/core/v16/prometheus`) and `client_golang` is
   also `package prometheus`. Files in `app/webrtc/` that need both must
   alias one — convention is `coreprom "github.com/datarhei/core/v16/prometheus"`.
   Implementation note only; doesn't change the design.

## References

- [Prometheus client_golang](https://pkg.go.dev/github.com/prometheus/client_golang/prometheus)
- [Prometheus instrumentation best practices](https://prometheus.io/docs/practices/instrumentation/)
- [Histogram bucket design](https://prometheus.io/docs/practices/histograms/)
- [Grafana provisioning docs](https://grafana.com/docs/grafana/latest/administration/provisioning/)
- v0.1 design: `docs/design/2026-04-16-datarhei-dragon-fork-webrtc-design.md`
- M2 integration: `docs/design/2026-04-17-datarhei-dragon-fork-m2-webrtc-core-integration.md`