diff --git a/docs/design/2026-05-03-datarhei-dragon-fork-webrtc-prometheus-metrics-design.md b/docs/design/2026-05-03-datarhei-dragon-fork-webrtc-prometheus-metrics-design.md new file mode 100644 index 0000000..ae4b697 --- /dev/null +++ b/docs/design/2026-05-03-datarhei-dragon-fork-webrtc-prometheus-metrics-design.md @@ -0,0 +1,666 @@ +# Datarhei - Dragon Fork: WebRTC Prometheus Metrics + +**Status:** Draft for review +**Author:** Zac (Wild Dragon) +**Date:** 2026-05-03 +**Predecessors:** +- [`2026-04-16-datarhei-dragon-fork-webrtc-design.md`](2026-04-16-datarhei-dragon-fork-webrtc-design.md) +- [`2026-04-17-datarhei-dragon-fork-m2-webrtc-core-integration.md`](2026-04-17-datarhei-dragon-fork-m2-webrtc-core-integration.md) +- v0.1.0-dragonfork released 2026-05-03 + +--- + +## Summary + +Add Prometheus instrumentation to Dragon Fork's WebRTC subsystem and ship a +collection-and-dashboard stack in the existing TrueNAS deploy bundle. Closes +the v0.1 observability gap: the WHEP egress has been running in production +since 2026-04-17 with zero per-subsystem signal. + +The deliverable is a RED-method dashboard ("rate, errors, duration") that +answers a single operator question — _is the WebRTC stack healthy right now?_ +Eleven new metrics in the `dragonfork_webrtc_*` namespace, two new containers +(Prometheus + Grafana) in `deploy/truenas/core/`, four pre-loaded alert rules, +one pre-provisioned dashboard. + +## Goals + +- Operator can answer "is WebRTC healthy right now?" from a single Grafana + dashboard, without tailing logs or hitting the API. +- Per-stream drill-down available when the dashboard goes red — labels carry + `stream_id` everywhere it's meaningful, never `peer_id`. +- Deploy is one-command on a fresh TrueNAS box (`docker compose up -d`), + matching the existing v0.1 deploy ergonomics. +- Backwards-compatible: zero changes to upstream's `/metrics` payload. New + metrics are purely additive. +- Bucket choices and label sets are tuned for the realistic latency ranges + observed in v0.1 (server-hop p95 ≈ 240µs, ICE establishment seconds-scale). + +## Non-Goals + +- **Alertmanager bundling.** Alert rules are loaded into Prometheus but not + routed. Paging configuration is too opinionated to ship a default; separate + spec if/when paging is wanted. +- **Per-peer metric labels.** Peer-level forensics (individual session + lifetimes, per-resource teardown reasons) is out of scope. `peer_id` is + unbounded under churn and risks cardinality bloat. +- **Federated multi-Core scrape.** Single-deploy scrape config only. The + `core` label is set statically to `dragonfork-truenas`. +- **Latency p95 CI gate via Prometheus.** Server-hop latency stays a Go + test gate (`-tags latency`); not a Prometheus histogram. +- **Server-hop microsecond histogram.** The 240µs server-hop is well below + HTTP request scales and would need its own bucket set; it's already + covered by the latency CI test, no need to duplicate in Prom. +- **Custom monitor/metric bus integration.** Upstream pulls from + `monitor/metric.Reader`. We diverge — see Module Layout for rationale. + +## Context + +v0.1 surface area: + +- WHEP HTTP routes: `POST /api/v3/whep/{id}`, `DELETE /api/v3/whep/{id}/{r}`, + `PATCH /api/v3/whep/{id}/{r}`, plus admin `GET /api/v3/webrtc/streams` + and `GET /api/v3/webrtc/streams/{id}/peers`. +- Error matrix in v0.1: `406` codec mismatch, `503` cap reached (split into + global vs per-stream in response body), `504` ICE timeout, `204` DELETE + idempotent, `404` unknown stream. +- Pion-mediated peer connection lifecycle in `app/webrtc/lifecycle.go` — + ICE state transitions are the natural hook for ICE timing/failure metrics. +- FFmpeg RTP output legs supervised by the existing process supervisor; + silent leg failure is a known "quietly degrading" risk worth instrumenting. + +Existing Prometheus integration (upstream): + +- `prometheus/prometheus.go` exposes a `Metrics` interface with `Register` + and an `HTTPHandler()`. Single shared `prometheus.Registry`. +- `prometheus/restream.go` is the reference collector — pulls from + `monitor/metric.Reader` via `metric.Pattern` queries, emits via + `prometheus.MustNewConstMetric`. All upstream collectors carry a `core` + label as the first dimension. +- `/metrics` endpoint already exposed by Core; auth handled at the same + layer as the rest of the API. + +## Approach + +**Hybrid instrumentation, in two surfaces:** + +1. **Direct `prometheus/client_golang` instrumentation** in `app/webrtc/` + for hot-path counters and histograms (request rate, request duration, + ICE establishment duration, error counters by reason). Histograms can't + be reconstructed from a scrape-time snapshot, so this is non-negotiable + for RED-method. + +2. **Snapshot-style collector** in `prometheus/webrtc.go` for slow-changing + gauges (active streams, active peers per stream, UDP port pool usage). + Calls a new `Stats()` method on the WebRTC subsystem at scrape time. + +Both surfaces register against the same `prometheus.Registerer` exposed by +`prometheus.Metrics`. No new HTTP endpoint, no new auth path. Both take a +`core` first-label dimension to match upstream collector convention. + +### Why not pure snapshot? + +Upstream's `prometheus/restream.go` pulls from a `monitor/metric` bus that +the FFmpeg supervision layer writes into. We could mirror that for WebRTC +— have `app/webrtc/lifecycle.go` and `handler.go` push events onto the bus, +have `prometheus/webrtc.go` pull them. Two reasons not to: + +- **Histograms don't fit the pattern.** The bus stores point-in-time values + (gauges and counters), not distributions. RED-method needs duration p50 + and p95; you'd end up maintaining an in-process sliding-window quantile + estimator inside the WebRTC subsystem, which is more code than just using + `client_golang.Histogram` directly. +- **The bus is FFmpeg-shaped.** `metric.Pattern` queries are designed for + process-state metrics (process IDs, FFmpeg states). Bolting WebRTC + semantics on requires defining new patterns the bus consumers all need + to know about, for a payload only the WebRTC collector cares about. + +The hybrid keeps each metric type on the cleanest path. The cost is two +patterns in the codebase instead of one — accepted, with a comment in +`prometheus/webrtc.go` pointing at this rationale so the next contributor +doesn't try to "fix" the divergence. + +### Why not pure direct? + +Pure `client_golang` everywhere would mean the gauges (active streams, +active peers, UDP ports) sit in `app/webrtc/` alongside histograms. Workable, +but loses the "one collector file per subsystem in `prometheus/`" pattern +that anyone reading the repo's existing structure would expect. Snapshot +gauges are cheap to implement via the existing pattern, so we keep them +where a casual reader would look. + +## Module Layout + +### New files + +``` +app/webrtc/metrics.go (~150 LOC) +app/webrtc/metrics_test.go (~200 LOC) +prometheus/webrtc.go (~120 LOC) +prometheus/webrtc_test.go (~150 LOC) +deploy/truenas/core/prom/prometheus.yml +deploy/truenas/core/prom/rules/webrtc-alerts.yml +deploy/truenas/core/grafana/provisioning/datasources/prometheus.yml +deploy/truenas/core/grafana/provisioning/dashboards/webrtc.yml +deploy/truenas/core/grafana/dashboards/dragonfork-webrtc-health.json +``` + +### Modified files + +``` +app/webrtc/handler.go — add metric middleware around WHEP routes +app/webrtc/lifecycle.go — record ICE timing in OnConnectionStateChange +app/webrtc/subsystem.go — add Stats() method, instrument process hooks +deploy/truenas/core/docker-compose.yml — add prom + grafana services +deploy/truenas/core/README.md — document new env vars + ports +README.md — quick-start mentions Grafana URL +CHANGELOG.md — v0.2.0-dragonfork section +``` + +### `app/webrtc/metrics.go` — direct instrumentation + +`promauto`-registered into the shared registry, exposed as package-level +vars so `handler.go` and `lifecycle.go` can increment without dependency +injection. Single `Init(reg prometheus.Registerer, core string)` called +from `subsystem.New` after the registry is available. + +```go +// Sketch — exact wire format finalized at implementation. +package webrtc + +import ( + "github.com/prometheus/client_golang/prometheus" + "github.com/prometheus/client_golang/prometheus/promauto" +) + +var histBuckets = []float64{0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10} + +type metrics struct { + whepRequests *prometheus.CounterVec // route, code, stream_id + whepRequestDuration *prometheus.HistogramVec // route, stream_id + iceEstablishment *prometheus.HistogramVec // stream_id, result + iceFailures *prometheus.CounterVec // stream_id, reason + codecMismatches *prometheus.CounterVec // stream_id, kind + capRejections *prometheus.CounterVec // stream_id, scope + ffmpegLegFailures *prometheus.CounterVec // stream_id, leg +} + +func newMetrics(reg prometheus.Registerer, core string) *metrics { + factory := promauto.With(reg) + return &metrics{ + whepRequests: factory.NewCounterVec(prometheus.CounterOpts{ + Name: "dragonfork_webrtc_whep_requests_total", + Help: "Count of WHEP requests by route, status code, and stream.", + ConstLabels: prometheus.Labels{"core": core}, + }, []string{"route", "code", "stream_id"}), + // ... etc + } +} +``` + +The `core` label is a `ConstLabels` (set once at construction) rather than a +per-request dimension — matches the upstream collector pattern and avoids +threading it through every call site. + +### `prometheus/webrtc.go` — snapshot collector + +Standard `prometheus.Collector` interface (Describe / Collect). Keeps a +reference to a `WebRTCStatsSource` interface, which the WebRTC subsystem +implements via its `Stats()` method. Avoids importing `app/webrtc` from +`prometheus/` — the dependency arrow points the right way. + +```go +// Sketch. +type WebRTCStatsSource interface { + Stats() WebRTCStats +} + +type WebRTCStats struct { + StreamCount int + PeersByStream map[string]int + UDPPortsInUse int + UDPPortsAvailable int +} + +type webrtcCollector struct { + core string + source WebRTCStatsSource + + activeStreamsDesc *prometheus.Desc + activePeersDesc *prometheus.Desc + udpPortsInUseDesc *prometheus.Desc + udpPortsAvailableDesc *prometheus.Desc +} + +func NewWebRTCCollector(core string, source WebRTCStatsSource) prometheus.Collector { ... } +``` + +The `WebRTCStats` type lives in `prometheus/webrtc.go` (not in `app/webrtc/`) +so the dependency stays one-directional. The subsystem implements the +interface by satisfying the shape, not by importing from `prometheus/`. + +### `app/webrtc/subsystem.go` — `Stats()` method + +```go +func (s *Subsystem) Stats() prometheus.WebRTCStats { + s.mu.Lock() + defer s.mu.Unlock() + peers := make(map[string]int, len(s.streams)) + for id, st := range s.streams { + peers[id] = len(st.peers) // assume peers tracked per-stream + } + return prometheus.WebRTCStats{ + StreamCount: len(s.streams), + PeersByStream: peers, + UDPPortsInUse: s.portAlloc.InUse(), + UDPPortsAvailable: s.portAlloc.Available(), + } +} +``` + +The existing subsystem tracks streams in `s.streams` under `s.mu`. Peer +count per stream needs the per-stream peer index that already exists in +`handler.go` — the `Stats()` method consults it via the existing teardown +hook plumbing or a small new accessor on `Handler`. Pick whichever surface +introduces the smaller blast radius at implementation time. + +## Metric Inventory + +Eleven metrics. Eight new label dimensions across them. ~50 active series +at typical 1-5 stream scale. + +### Direct instrumentation (`app/webrtc/metrics.go`) + +| Name | Type | Labels | Description | +|---|---|---|---| +| `dragonfork_webrtc_whep_requests_total` | Counter | core, route, code, stream_id | Count of WHEP requests by route+status code. | +| `dragonfork_webrtc_whep_request_duration_seconds` | Histogram | core, route, stream_id | Server-side WHEP request duration. Buckets: `[0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]`. | +| `dragonfork_webrtc_ice_establishment_duration_seconds` | Histogram | core, stream_id, result | Time from `SetLocalDescription` to first `connected` or `failed` ICE state. Same buckets. | +| `dragonfork_webrtc_ice_failures_total` | Counter | core, stream_id, reason | ICE failure count. `reason` ∈ {timeout, disconnected, failed}. | +| `dragonfork_webrtc_codec_mismatches_total` | Counter | core, stream_id, kind | 406 rejections by kind. `kind` ∈ {video, audio}. | +| `dragonfork_webrtc_cap_rejections_total` | Counter | core, stream_id, scope | 503 rejections. `scope` ∈ {global, stream}. | +| `dragonfork_webrtc_ffmpeg_leg_failures_total` | Counter | core, stream_id, leg | RTP output leg failures. `leg` ∈ {video, audio}. | + +### Snapshot collector (`prometheus/webrtc.go`) + +| Name | Type | Labels | Description | +|---|---|---|---| +| `dragonfork_webrtc_active_streams` | Gauge | core | Streams currently registered (processes with `webrtc.enabled=true` running). | +| `dragonfork_webrtc_active_peers` | Gauge | core, stream_id | Currently subscribed WHEP peers per stream. | +| `dragonfork_webrtc_udp_ports_in_use` | Gauge | core | UDP ports currently allocated from the pool. | +| `dragonfork_webrtc_udp_ports_available` | Gauge | core | Pool size minus in-use (explicit for alert friendliness). | + +### Label rationale + +- `whep_request_duration_seconds` deliberately omits `code` — separating + distributions per outcome makes p95 noisy, and per-route per-stream p95 + is what an operator actually looks at. Errors get visibility through the + request-counter ratio. +- `ice_establishment_duration_seconds` includes both `connected` and + `failed` results in the same histogram via the `result` label — + intentionally — so the dashboard can compare success latency to + failure-tail latency on the same axis. +- `cap_rejections_total` keeps the `scope` label because v0.1's response + body already splits global vs per-stream rejections; metrics mirror that + distinction so the dashboard shows whether to raise `max_peers_total` + or just one stream's per-stream cap. +- `ffmpeg_leg_failures_total` is the "quietly degrading" canary — a silent + RTP-output-leg failure (port bind, encoder crash) is exactly what the + "is it healthy?" framing is meant to catch. + +### Cardinality budget + +At typical scale (5 streams, 3 routes, ~6 status codes seen in practice): + +- `whep_requests_total`: 5 × 3 × 6 = 90 series (worst case) +- `whep_request_duration_seconds`: 5 × 3 × (8 buckets + sum + count) = 150 series +- `ice_establishment_duration_seconds`: 5 × 2 × 10 = 100 series +- All others: 5–15 series each +- **Total: <500 active series at 5-stream sustained load** + +Well within Prometheus's comfort zone. At 15s scrape interval × 15-day +retention, on-disk storage ~80MB. + +### Specifically excluded metrics + +- **Per-peer session metrics.** Listed under non-goals. +- **Bytes-out / bandwidth.** Pion exposes RTP write bytes via stats; would + be useful but pulls peer-level state. Defer to a future v0.3 spec + ("WebRTC bandwidth observability") if needed. +- **Server-hop latency (FFmpeg → peer).** Microsecond scale, already + covered by `-tags latency` test gate, would need its own bucket set. + +## Deploy Bundle + +### `deploy/truenas/core/docker-compose.yml` additions + +Two new services on a new bridge network `dragonfork-mon`. Core continues +on `network_mode: host` unchanged. The new containers reach Core via +`host.docker.internal:${CORE_HTTP_PORT}` (Linux Docker resolves this when +`extra_hosts: ["host.docker.internal:host-gateway"]` is set on the service). + +```yaml +services: + core: + # ... existing definition unchanged + + prom: + image: prom/prometheus:v2.55.0 + container_name: dragonfork-prom + restart: unless-stopped + networks: [dragonfork-mon] + extra_hosts: + - "host.docker.internal:host-gateway" + volumes: + - ./prom/prometheus.yml:/etc/prometheus/prometheus.yml:ro + - ./prom/rules:/etc/prometheus/rules:ro + - ./prom-data:/prometheus + command: + - --config.file=/etc/prometheus/prometheus.yml + - --storage.tsdb.retention.time=15d + - --storage.tsdb.path=/prometheus + - --web.console.libraries=/usr/share/prometheus/console_libraries + - --web.console.templates=/usr/share/prometheus/consoles + ports: + - "${PROM_PORT:-9090}:9090" + + grafana: + image: grafana/grafana-oss:11.3.0 + container_name: dragonfork-grafana + restart: unless-stopped + networks: [dragonfork-mon] + depends_on: [prom] + environment: + GF_SECURITY_ADMIN_PASSWORD: "${GRAFANA_ADMIN_PASSWORD:?set in .env}" + GF_USERS_ALLOW_SIGN_UP: "false" + GF_AUTH_ANONYMOUS_ENABLED: "false" + volumes: + - ./grafana/provisioning:/etc/grafana/provisioning:ro + - ./grafana/dashboards:/var/lib/grafana/dashboards:ro + - ./grafana-data:/var/lib/grafana + ports: + - "${GRAFANA_PORT:-3000}:3000" + +networks: + dragonfork-mon: + driver: bridge +``` + +### `prom/prometheus.yml` + +```yaml +global: + scrape_interval: 15s + scrape_timeout: 10s + evaluation_interval: 15s + external_labels: + core: dragonfork-truenas + +rule_files: + - /etc/prometheus/rules/*.yml + +scrape_configs: + - job_name: dragonfork-core + static_configs: + - targets: ["host.docker.internal:8080"] + metrics_path: /metrics + # If API auth is enabled on /metrics, uncomment and provide creds via + # env-substituted file. v0.1 leaves /metrics public by default. + # basic_auth: + # username_file: /run/secrets/prom_basic_user + # password_file: /run/secrets/prom_basic_pass +``` + +### `prom/rules/webrtc-alerts.yml` + +```yaml +groups: + - name: dragonfork-webrtc + rules: + - alert: WebRTCWHEPErrorRateHigh + expr: | + sum by (stream_id) ( + rate(dragonfork_webrtc_whep_requests_total{code=~"4..|5.."}[5m]) + ) > 0.5 + for: 5m + labels: + severity: warning + annotations: + summary: "WHEP error rate high on stream {{ $labels.stream_id }}" + description: "Sustained 4xx/5xx rate >0.5/sec for 5m." + + - alert: WebRTCICEEstablishmentSlow + expr: | + histogram_quantile(0.95, + sum by (le, stream_id) ( + rate(dragonfork_webrtc_ice_establishment_duration_seconds_bucket[10m]) + ) + ) > 3 + for: 10m + labels: + severity: warning + annotations: + summary: "ICE establishment p95 >3s on {{ $labels.stream_id }}" + + - alert: WebRTCICEFailureRateHigh + expr: | + sum by (stream_id) (rate(dragonfork_webrtc_ice_failures_total[5m])) > 0.2 + for: 5m + labels: + severity: warning + annotations: + summary: "ICE failures sustained on {{ $labels.stream_id }}" + + - alert: WebRTCFFmpegLegFailure + expr: | + increase(dragonfork_webrtc_ffmpeg_leg_failures_total[5m]) > 0 + labels: + severity: critical + annotations: + summary: "FFmpeg RTP leg failed on {{ $labels.stream_id }} ({{ $labels.leg }})" + description: "Silent degradation of RTP output. Check FFmpeg logs." +``` + +Alerts evaluate but route nowhere. Alertmanager bundling deferred — see +non-goals. + +### Grafana provisioning + +Datasource provisioning at `grafana/provisioning/datasources/prometheus.yml`: + +```yaml +apiVersion: 1 +datasources: + - name: Prometheus + type: prometheus + access: proxy + url: http://prom:9090 + isDefault: true + editable: false +``` + +Dashboard provisioning at `grafana/provisioning/dashboards/webrtc.yml`: + +```yaml +apiVersion: 1 +providers: + - name: dragonfork + orgId: 1 + folder: "Dragon Fork" + type: file + disableDeletion: false + updateIntervalSeconds: 30 + options: + path: /var/lib/grafana/dashboards +``` + +### Dashboard JSON: `dragonfork-webrtc-health.json` + +Single dashboard, five rows aligned to the questions from the metric +inventory: + +1. **WHEP API health** — request rate by route (stat panel), error rate + stacked by code (timeseries), p95 request duration by route (timeseries). +2. **ICE establishment** — success/failure rate (gauge), p50/p95 + establishment duration (timeseries with a 3s threshold line for the + alert), failure breakdown by reason (table). +3. **What's flowing** — `active_streams` (stat), `active_peers` per stream + (timeseries), top 5 streams by peer count (table). +4. **Capacity headroom** — `udp_ports_available` (gauge with red-zone <10), + cap rejection rate by scope (timeseries). +5. **Silent degradation** — FFmpeg leg failure timeline (timeseries with + annotations), codec mismatch counter (stat). + +Built in Grafana 11.3, exported as JSON, committed to the repo. Refresh +default 30s. + +### `.env` template additions + +Append to `deploy/truenas/core/README.md`'s example `.env`: + +```sh +# --- Observability (added in v0.2) --- +GRAFANA_ADMIN_PASSWORD=$(openssl rand -base64 24) +GRAFANA_PORT=3000 +PROM_PORT=9090 +``` + +## Testing + +### Unit tests — `prometheus/webrtc_test.go` + +Mock `WebRTCStatsSource`. Drive the collector through three states (no +streams, one stream with N peers, multiple streams). Use +`testutil.CollectAndCompare` to assert exact metric/label/value output +against a golden plaintext fixture. + +```go +// Golden fixture (excerpt): +// # HELP dragonfork_webrtc_active_streams ... +// # TYPE dragonfork_webrtc_active_streams gauge +// dragonfork_webrtc_active_streams{core="test"} 2 +// # HELP dragonfork_webrtc_active_peers ... +// # TYPE dragonfork_webrtc_active_peers gauge +// dragonfork_webrtc_active_peers{core="test",stream_id="live"} 3 +// dragonfork_webrtc_active_peers{core="test",stream_id="cam"} 1 +``` + +### Unit tests — `app/webrtc/metrics_test.go` + +Reuse `handler_test.go` setup (fake registry, in-process Echo router). +Hit each WHEP route, assert the corresponding counter and histogram have +the expected increment via `testutil.ToFloat64`. Drive forced error paths +(unknown stream → 404, codec-less SDP → 406, cap exceeded → 503, ICE +timeout → 504) and assert the right error-bucket counters bumped. + +### Integration verification — `test/TESTING.md` + +New section "Verifying Prometheus metrics": + +``` +1. docker compose up -d +2. curl -s http://:8080/metrics | grep dragonfork_webrtc_ + - expect: 11 metric families present, all with `core="dragonfork-truenas"` +3. Open http://:3000 (Grafana), log in with GRAFANA_ADMIN_PASSWORD +4. Navigate to Dashboards → Dragon Fork → WebRTC Health + - expect: all 5 rows render, no "no data" panels except where stream traffic is absent +5. Trigger one of each error in test/whep-player.html (intentional codec + mismatch via SDP edit, kill the publisher mid-stream, etc.) +6. Watch the Grafana panels and verify counters tick within 15s. +``` + +### CI + +Existing test runner picks up the new `_test.go` files. No new CI gates +beyond standard build+test — observability isn't a contract; the unit +tests verify shape only. Grafana dashboard JSON is *not* validated in CI +(no good lightweight validator); manual verification only. + +### Load test alignment + +The deferred 5-peer × 10-min load test (separate spec) will use this +dashboard as its primary observation surface. Recording rules for the +load test's specific aggregations can be added in that spec without +touching this one. + +## Rollout + +The TrueNAS v0.1.0-dragonfork deploy upgrades via: + +```sh +cd deploy/truenas/core +git pull # latest main with this change +# Add new lines to .env (see template above) +docker compose pull # grabs prom + grafana images +docker compose up -d # core unchanged, prom + grafana new +``` + +Core continues on host networking. The new containers connect via +`host.docker.internal:host-gateway`, no firewall changes required for +intra-host traffic. External Grafana access is on `${GRAFANA_PORT}`. + +### Backwards compatibility + +- No upstream metric names or labels modified. New metrics are purely + additive in `dragonfork_webrtc_*` namespace. +- No API changes. `/metrics` payload grows but stays well-formed + Prometheus exposition. +- Existing config, env vars, and process JSON formats unchanged. + +### Forward compatibility + +- The `core` label being a `ConstLabels` value (not a per-event dimension) + means future federated multi-Core scrapes will distinguish series cleanly + by setting `core="dragonfork-truenas-east"` etc. in each deploy's config + loader. Spec'd here, implemented when needed. +- New metrics in this spec follow the `dragonfork__` naming + pattern. Future Dragon-Fork-specific metrics (WHIP, keyframe cache, + bandwidth) should adopt the same convention. + +### Known gaps post-rollout + +- No paging. Alerts evaluate, no Alertmanager. If `WebRTCFFmpegLegFailure` + fires at 3am, no notification — operator notices at next dashboard check. + Acceptable for v0.2 single-operator deploy. Track as a v0.3 spec. +- Grafana dashboard JSON is hand-edited via Grafana UI then re-exported. + No JSON-as-code library used. If dashboard maintenance gets painful, + Grafonnet/Grafana-as-code is a v0.3+ refactor. +- `/metrics` itself is unauthenticated by default in v0.1 (matches + upstream). If Core's deploy bundle is exposed to untrusted networks, + the operator should already be using auth on Core's HTTP listener. Not + this spec's problem to solve, but worth a one-line note in + `deploy/truenas/core/README.md`. + +## Open Decisions + +1. **Should the `Stats()` method live on `Subsystem` or on `Handler`?** + The peer count is in `Handler`'s per-stream peer index; stream count + is in `Subsystem`'s registry; UDP port pool is in `portalloc`. Easiest + shape: `Subsystem.Stats()` is the public surface and internally + gathers from `Handler` (via the existing teardown-hook plumbing) and + `portalloc`. Decide at implementation time based on which surface + exposes the cleanest seams. + +2. **Should histograms also include a `core` label, given it's already a + `ConstLabels`?** Yes — `ConstLabels` is automatically present on every + sample, no per-call overhead, and federations need it. + +3. **Should Prometheus retention be configurable via `.env`?** Defaulting + to 15d covers the realistic window for "what happened last week?" + queries. Adding `PROM_RETENTION_DAYS=15d` to `.env` is a one-line + change. Including it as optional, defaulting to 15d. + +4. **Import-alias collision.** The local package is `package prometheus` + (at `github.com/datarhei/core/v16/prometheus`) and `client_golang` is + also `package prometheus`. Files in `app/webrtc/` that need both must + alias one — convention is `coreprom "github.com/datarhei/core/v16/prometheus"`. + Implementation note only; doesn't change the design. + +## References + +- [Prometheus client_golang](https://pkg.go.dev/github.com/prometheus/client_golang/prometheus) +- [Prometheus instrumentation best practices](https://prometheus.io/docs/practices/instrumentation/) +- [Histogram bucket design](https://prometheus.io/docs/practices/histograms/) +- [Grafana provisioning docs](https://grafana.com/docs/grafana/latest/administration/provisioning/) +- v0.1 design: `docs/design/2026-04-16-datarhei-dragon-fork-webrtc-design.md` +- M2 integration: `docs/design/2026-04-17-datarhei-dragon-fork-m2-webrtc-core-integration.md`