# Datarhei - Dragon Fork: WebRTC Prometheus Metrics **Status:** Draft for review **Author:** Zac (Wild Dragon) **Date:** 2026-05-03 **Predecessors:** - [`2026-04-16-datarhei-dragon-fork-webrtc-design.md`](2026-04-16-datarhei-dragon-fork-webrtc-design.md) - [`2026-04-17-datarhei-dragon-fork-m2-webrtc-core-integration.md`](2026-04-17-datarhei-dragon-fork-m2-webrtc-core-integration.md) - v0.1.0-dragonfork released 2026-05-03 --- ## Summary Add Prometheus instrumentation to Dragon Fork's WebRTC subsystem and ship a collection-and-dashboard stack in the existing TrueNAS deploy bundle. Closes the v0.1 observability gap: the WHEP egress has been running in production since 2026-04-17 with zero per-subsystem signal. The deliverable is a RED-method dashboard ("rate, errors, duration") that answers a single operator question — _is the WebRTC stack healthy right now?_ Eleven new metrics in the `dragonfork_webrtc_*` namespace, two new containers (Prometheus + Grafana) in `deploy/truenas/core/`, four pre-loaded alert rules, one pre-provisioned dashboard. ## Goals - Operator can answer "is WebRTC healthy right now?" from a single Grafana dashboard, without tailing logs or hitting the API. - Per-stream drill-down available when the dashboard goes red — labels carry `stream_id` everywhere it's meaningful, never `peer_id`. - Deploy is one-command on a fresh TrueNAS box (`docker compose up -d`), matching the existing v0.1 deploy ergonomics. - Backwards-compatible: zero changes to upstream's `/metrics` payload. New metrics are purely additive. - Bucket choices and label sets are tuned for the realistic latency ranges observed in v0.1 (server-hop p95 ≈ 240µs, ICE establishment seconds-scale). ## Non-Goals - **Alertmanager bundling.** Alert rules are loaded into Prometheus but not routed. Paging configuration is too opinionated to ship a default; separate spec if/when paging is wanted. - **Per-peer metric labels.** Peer-level forensics (individual session lifetimes, per-resource teardown reasons) is out of scope. `peer_id` is unbounded under churn and risks cardinality bloat. - **Federated multi-Core scrape.** Single-deploy scrape config only. The `core` label is set statically to `dragonfork-truenas`. - **Latency p95 CI gate via Prometheus.** Server-hop latency stays a Go test gate (`-tags latency`); not a Prometheus histogram. - **Server-hop microsecond histogram.** The 240µs server-hop is well below HTTP request scales and would need its own bucket set; it's already covered by the latency CI test, no need to duplicate in Prom. - **Custom monitor/metric bus integration.** Upstream pulls from `monitor/metric.Reader`. We diverge — see Module Layout for rationale. ## Context v0.1 surface area: - WHEP HTTP routes: `POST /api/v3/whep/{id}`, `DELETE /api/v3/whep/{id}/{r}`, `PATCH /api/v3/whep/{id}/{r}`, plus admin `GET /api/v3/webrtc/streams` and `GET /api/v3/webrtc/streams/{id}/peers`. - Error matrix in v0.1: `406` codec mismatch, `503` cap reached (split into global vs per-stream in response body), `504` ICE timeout, `204` DELETE idempotent, `404` unknown stream. - Pion-mediated peer connection lifecycle in `app/webrtc/lifecycle.go` — ICE state transitions are the natural hook for ICE timing/failure metrics. - FFmpeg RTP output legs supervised by the existing process supervisor; silent leg failure is a known "quietly degrading" risk worth instrumenting. Existing Prometheus integration (upstream): - `prometheus/prometheus.go` exposes a `Metrics` interface with `Register` and an `HTTPHandler()`. Single shared `prometheus.Registry`. - `prometheus/restream.go` is the reference collector — pulls from `monitor/metric.Reader` via `metric.Pattern` queries, emits via `prometheus.MustNewConstMetric`. All upstream collectors carry a `core` label as the first dimension. - `/metrics` endpoint already exposed by Core; auth handled at the same layer as the rest of the API. ## Approach **Hybrid instrumentation, in two surfaces:** 1. **Direct `prometheus/client_golang` instrumentation** in `app/webrtc/` for hot-path counters and histograms (request rate, request duration, ICE establishment duration, error counters by reason). Histograms can't be reconstructed from a scrape-time snapshot, so this is non-negotiable for RED-method. 2. **Snapshot-style collector** in `prometheus/webrtc.go` for slow-changing gauges (active streams, active peers per stream, UDP port pool usage). Calls a new `Stats()` method on the WebRTC subsystem at scrape time. Both surfaces register against the same `prometheus.Registerer` exposed by `prometheus.Metrics`. No new HTTP endpoint, no new auth path. Both take a `core` first-label dimension to match upstream collector convention. ### Why not pure snapshot? Upstream's `prometheus/restream.go` pulls from a `monitor/metric` bus that the FFmpeg supervision layer writes into. We could mirror that for WebRTC — have `app/webrtc/lifecycle.go` and `handler.go` push events onto the bus, have `prometheus/webrtc.go` pull them. Two reasons not to: - **Histograms don't fit the pattern.** The bus stores point-in-time values (gauges and counters), not distributions. RED-method needs duration p50 and p95; you'd end up maintaining an in-process sliding-window quantile estimator inside the WebRTC subsystem, which is more code than just using `client_golang.Histogram` directly. - **The bus is FFmpeg-shaped.** `metric.Pattern` queries are designed for process-state metrics (process IDs, FFmpeg states). Bolting WebRTC semantics on requires defining new patterns the bus consumers all need to know about, for a payload only the WebRTC collector cares about. The hybrid keeps each metric type on the cleanest path. The cost is two patterns in the codebase instead of one — accepted, with a comment in `prometheus/webrtc.go` pointing at this rationale so the next contributor doesn't try to "fix" the divergence. ### Why not pure direct? Pure `client_golang` everywhere would mean the gauges (active streams, active peers, UDP ports) sit in `app/webrtc/` alongside histograms. Workable, but loses the "one collector file per subsystem in `prometheus/`" pattern that anyone reading the repo's existing structure would expect. Snapshot gauges are cheap to implement via the existing pattern, so we keep them where a casual reader would look. ## Module Layout ### New files ``` app/webrtc/metrics.go (~150 LOC) app/webrtc/metrics_test.go (~200 LOC) prometheus/webrtc.go (~120 LOC) prometheus/webrtc_test.go (~150 LOC) deploy/truenas/core/prom/prometheus.yml deploy/truenas/core/prom/rules/webrtc-alerts.yml deploy/truenas/core/grafana/provisioning/datasources/prometheus.yml deploy/truenas/core/grafana/provisioning/dashboards/webrtc.yml deploy/truenas/core/grafana/dashboards/dragonfork-webrtc-health.json ``` ### Modified files ``` app/webrtc/handler.go — add metric middleware around WHEP routes app/webrtc/lifecycle.go — record ICE timing in OnConnectionStateChange app/webrtc/subsystem.go — add Stats() method, instrument process hooks deploy/truenas/core/docker-compose.yml — add prom + grafana services deploy/truenas/core/README.md — document new env vars + ports README.md — quick-start mentions Grafana URL CHANGELOG.md — v0.2.0-dragonfork section ``` ### `app/webrtc/metrics.go` — direct instrumentation `promauto`-registered into the shared registry, exposed as package-level vars so `handler.go` and `lifecycle.go` can increment without dependency injection. Single `Init(reg prometheus.Registerer, core string)` called from `subsystem.New` after the registry is available. ```go // Sketch — exact wire format finalized at implementation. package webrtc import ( "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promauto" ) var histBuckets = []float64{0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10} type metrics struct { whepRequests *prometheus.CounterVec // route, code, stream_id whepRequestDuration *prometheus.HistogramVec // route, stream_id iceEstablishment *prometheus.HistogramVec // stream_id, result iceFailures *prometheus.CounterVec // stream_id, reason codecMismatches *prometheus.CounterVec // stream_id, kind capRejections *prometheus.CounterVec // stream_id, scope ffmpegLegFailures *prometheus.CounterVec // stream_id, leg } func newMetrics(reg prometheus.Registerer, core string) *metrics { factory := promauto.With(reg) return &metrics{ whepRequests: factory.NewCounterVec(prometheus.CounterOpts{ Name: "dragonfork_webrtc_whep_requests_total", Help: "Count of WHEP requests by route, status code, and stream.", ConstLabels: prometheus.Labels{"core": core}, }, []string{"route", "code", "stream_id"}), // ... etc } } ``` The `core` label is a `ConstLabels` (set once at construction) rather than a per-request dimension — matches the upstream collector pattern and avoids threading it through every call site. ### `prometheus/webrtc.go` — snapshot collector Standard `prometheus.Collector` interface (Describe / Collect). Keeps a reference to a `WebRTCStatsSource` interface, which the WebRTC subsystem implements via its `Stats()` method. Avoids importing `app/webrtc` from `prometheus/` — the dependency arrow points the right way. ```go // Sketch. type WebRTCStatsSource interface { Stats() WebRTCStats } type WebRTCStats struct { StreamCount int PeersByStream map[string]int UDPPortsInUse int UDPPortsAvailable int } type webrtcCollector struct { core string source WebRTCStatsSource activeStreamsDesc *prometheus.Desc activePeersDesc *prometheus.Desc udpPortsInUseDesc *prometheus.Desc udpPortsAvailableDesc *prometheus.Desc } func NewWebRTCCollector(core string, source WebRTCStatsSource) prometheus.Collector { ... } ``` The `WebRTCStats` type lives in `prometheus/webrtc.go` (not in `app/webrtc/`) so the dependency stays one-directional. The subsystem implements the interface by satisfying the shape, not by importing from `prometheus/`. ### `app/webrtc/subsystem.go` — `Stats()` method ```go func (s *Subsystem) Stats() prometheus.WebRTCStats { s.mu.Lock() defer s.mu.Unlock() peers := make(map[string]int, len(s.streams)) for id, st := range s.streams { peers[id] = len(st.peers) // assume peers tracked per-stream } return prometheus.WebRTCStats{ StreamCount: len(s.streams), PeersByStream: peers, UDPPortsInUse: s.portAlloc.InUse(), UDPPortsAvailable: s.portAlloc.Available(), } } ``` The existing subsystem tracks streams in `s.streams` under `s.mu`. Peer count per stream needs the per-stream peer index that already exists in `handler.go` — the `Stats()` method consults it via the existing teardown hook plumbing or a small new accessor on `Handler`. Pick whichever surface introduces the smaller blast radius at implementation time. ## Metric Inventory Eleven metrics. Eight new label dimensions across them. ~50 active series at typical 1-5 stream scale. ### Direct instrumentation (`app/webrtc/metrics.go`) | Name | Type | Labels | Description | |---|---|---|---| | `dragonfork_webrtc_whep_requests_total` | Counter | core, route, code, stream_id | Count of WHEP requests by route+status code. | | `dragonfork_webrtc_whep_request_duration_seconds` | Histogram | core, route, stream_id | Server-side WHEP request duration. Buckets: `[0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]`. | | `dragonfork_webrtc_ice_establishment_duration_seconds` | Histogram | core, stream_id, result | Time from `SetLocalDescription` to first `connected` or `failed` ICE state. Same buckets. | | `dragonfork_webrtc_ice_failures_total` | Counter | core, stream_id, reason | ICE failure count. `reason` ∈ {timeout, disconnected, failed}. | | `dragonfork_webrtc_codec_mismatches_total` | Counter | core, stream_id, kind | 406 rejections by kind. `kind` ∈ {video, audio}. | | `dragonfork_webrtc_cap_rejections_total` | Counter | core, stream_id, scope | 503 rejections. `scope` ∈ {global, stream}. | | `dragonfork_webrtc_ffmpeg_leg_failures_total` | Counter | core, stream_id, leg | RTP output leg failures. `leg` ∈ {video, audio}. | ### Snapshot collector (`prometheus/webrtc.go`) | Name | Type | Labels | Description | |---|---|---|---| | `dragonfork_webrtc_active_streams` | Gauge | core | Streams currently registered (processes with `webrtc.enabled=true` running). | | `dragonfork_webrtc_active_peers` | Gauge | core, stream_id | Currently subscribed WHEP peers per stream. | | `dragonfork_webrtc_udp_ports_in_use` | Gauge | core | UDP ports currently allocated from the pool. | | `dragonfork_webrtc_udp_ports_available` | Gauge | core | Pool size minus in-use (explicit for alert friendliness). | ### Label rationale - `whep_request_duration_seconds` deliberately omits `code` — separating distributions per outcome makes p95 noisy, and per-route per-stream p95 is what an operator actually looks at. Errors get visibility through the request-counter ratio. - `ice_establishment_duration_seconds` includes both `connected` and `failed` results in the same histogram via the `result` label — intentionally — so the dashboard can compare success latency to failure-tail latency on the same axis. - `cap_rejections_total` keeps the `scope` label because v0.1's response body already splits global vs per-stream rejections; metrics mirror that distinction so the dashboard shows whether to raise `max_peers_total` or just one stream's per-stream cap. - `ffmpeg_leg_failures_total` is the "quietly degrading" canary — a silent RTP-output-leg failure (port bind, encoder crash) is exactly what the "is it healthy?" framing is meant to catch. ### Cardinality budget At typical scale (5 streams, 3 routes, ~6 status codes seen in practice): - `whep_requests_total`: 5 × 3 × 6 = 90 series (worst case) - `whep_request_duration_seconds`: 5 × 3 × (8 buckets + sum + count) = 150 series - `ice_establishment_duration_seconds`: 5 × 2 × 10 = 100 series - All others: 5–15 series each - **Total: <500 active series at 5-stream sustained load** Well within Prometheus's comfort zone. At 15s scrape interval × 15-day retention, on-disk storage ~80MB. ### Specifically excluded metrics - **Per-peer session metrics.** Listed under non-goals. - **Bytes-out / bandwidth.** Pion exposes RTP write bytes via stats; would be useful but pulls peer-level state. Defer to a future v0.3 spec ("WebRTC bandwidth observability") if needed. - **Server-hop latency (FFmpeg → peer).** Microsecond scale, already covered by `-tags latency` test gate, would need its own bucket set. ## Deploy Bundle ### `deploy/truenas/core/docker-compose.yml` additions Two new services on a new bridge network `dragonfork-mon`. Core continues on `network_mode: host` unchanged. The new containers reach Core via `host.docker.internal:${CORE_HTTP_PORT}` (Linux Docker resolves this when `extra_hosts: ["host.docker.internal:host-gateway"]` is set on the service). ```yaml services: core: # ... existing definition unchanged prom: image: prom/prometheus:v2.55.0 container_name: dragonfork-prom restart: unless-stopped networks: [dragonfork-mon] extra_hosts: - "host.docker.internal:host-gateway" volumes: - ./prom/prometheus.yml:/etc/prometheus/prometheus.yml:ro - ./prom/rules:/etc/prometheus/rules:ro - ./prom-data:/prometheus command: - --config.file=/etc/prometheus/prometheus.yml - --storage.tsdb.retention.time=15d - --storage.tsdb.path=/prometheus - --web.console.libraries=/usr/share/prometheus/console_libraries - --web.console.templates=/usr/share/prometheus/consoles ports: - "${PROM_PORT:-9090}:9090" grafana: image: grafana/grafana-oss:11.3.0 container_name: dragonfork-grafana restart: unless-stopped networks: [dragonfork-mon] depends_on: [prom] environment: GF_SECURITY_ADMIN_PASSWORD: "${GRAFANA_ADMIN_PASSWORD:?set in .env}" GF_USERS_ALLOW_SIGN_UP: "false" GF_AUTH_ANONYMOUS_ENABLED: "false" volumes: - ./grafana/provisioning:/etc/grafana/provisioning:ro - ./grafana/dashboards:/var/lib/grafana/dashboards:ro - ./grafana-data:/var/lib/grafana ports: - "${GRAFANA_PORT:-3000}:3000" networks: dragonfork-mon: driver: bridge ``` ### `prom/prometheus.yml` ```yaml global: scrape_interval: 15s scrape_timeout: 10s evaluation_interval: 15s external_labels: core: dragonfork-truenas rule_files: - /etc/prometheus/rules/*.yml scrape_configs: - job_name: dragonfork-core static_configs: - targets: ["host.docker.internal:8080"] metrics_path: /metrics # If API auth is enabled on /metrics, uncomment and provide creds via # env-substituted file. v0.1 leaves /metrics public by default. # basic_auth: # username_file: /run/secrets/prom_basic_user # password_file: /run/secrets/prom_basic_pass ``` ### `prom/rules/webrtc-alerts.yml` ```yaml groups: - name: dragonfork-webrtc rules: - alert: WebRTCWHEPErrorRateHigh expr: | sum by (stream_id) ( rate(dragonfork_webrtc_whep_requests_total{code=~"4..|5.."}[5m]) ) > 0.5 for: 5m labels: severity: warning annotations: summary: "WHEP error rate high on stream {{ $labels.stream_id }}" description: "Sustained 4xx/5xx rate >0.5/sec for 5m." - alert: WebRTCICEEstablishmentSlow expr: | histogram_quantile(0.95, sum by (le, stream_id) ( rate(dragonfork_webrtc_ice_establishment_duration_seconds_bucket[10m]) ) ) > 3 for: 10m labels: severity: warning annotations: summary: "ICE establishment p95 >3s on {{ $labels.stream_id }}" - alert: WebRTCICEFailureRateHigh expr: | sum by (stream_id) (rate(dragonfork_webrtc_ice_failures_total[5m])) > 0.2 for: 5m labels: severity: warning annotations: summary: "ICE failures sustained on {{ $labels.stream_id }}" - alert: WebRTCFFmpegLegFailure expr: | increase(dragonfork_webrtc_ffmpeg_leg_failures_total[5m]) > 0 labels: severity: critical annotations: summary: "FFmpeg RTP leg failed on {{ $labels.stream_id }} ({{ $labels.leg }})" description: "Silent degradation of RTP output. Check FFmpeg logs." ``` Alerts evaluate but route nowhere. Alertmanager bundling deferred — see non-goals. ### Grafana provisioning Datasource provisioning at `grafana/provisioning/datasources/prometheus.yml`: ```yaml apiVersion: 1 datasources: - name: Prometheus type: prometheus access: proxy url: http://prom:9090 isDefault: true editable: false ``` Dashboard provisioning at `grafana/provisioning/dashboards/webrtc.yml`: ```yaml apiVersion: 1 providers: - name: dragonfork orgId: 1 folder: "Dragon Fork" type: file disableDeletion: false updateIntervalSeconds: 30 options: path: /var/lib/grafana/dashboards ``` ### Dashboard JSON: `dragonfork-webrtc-health.json` Single dashboard, five rows aligned to the questions from the metric inventory: 1. **WHEP API health** — request rate by route (stat panel), error rate stacked by code (timeseries), p95 request duration by route (timeseries). 2. **ICE establishment** — success/failure rate (gauge), p50/p95 establishment duration (timeseries with a 3s threshold line for the alert), failure breakdown by reason (table). 3. **What's flowing** — `active_streams` (stat), `active_peers` per stream (timeseries), top 5 streams by peer count (table). 4. **Capacity headroom** — `udp_ports_available` (gauge with red-zone <10), cap rejection rate by scope (timeseries). 5. **Silent degradation** — FFmpeg leg failure timeline (timeseries with annotations), codec mismatch counter (stat). Built in Grafana 11.3, exported as JSON, committed to the repo. Refresh default 30s. ### `.env` template additions Append to `deploy/truenas/core/README.md`'s example `.env`: ```sh # --- Observability (added in v0.2) --- GRAFANA_ADMIN_PASSWORD=$(openssl rand -base64 24) GRAFANA_PORT=3000 PROM_PORT=9090 ``` ## Testing ### Unit tests — `prometheus/webrtc_test.go` Mock `WebRTCStatsSource`. Drive the collector through three states (no streams, one stream with N peers, multiple streams). Use `testutil.CollectAndCompare` to assert exact metric/label/value output against a golden plaintext fixture. ```go // Golden fixture (excerpt): // # HELP dragonfork_webrtc_active_streams ... // # TYPE dragonfork_webrtc_active_streams gauge // dragonfork_webrtc_active_streams{core="test"} 2 // # HELP dragonfork_webrtc_active_peers ... // # TYPE dragonfork_webrtc_active_peers gauge // dragonfork_webrtc_active_peers{core="test",stream_id="live"} 3 // dragonfork_webrtc_active_peers{core="test",stream_id="cam"} 1 ``` ### Unit tests — `app/webrtc/metrics_test.go` Reuse `handler_test.go` setup (fake registry, in-process Echo router). Hit each WHEP route, assert the corresponding counter and histogram have the expected increment via `testutil.ToFloat64`. Drive forced error paths (unknown stream → 404, codec-less SDP → 406, cap exceeded → 503, ICE timeout → 504) and assert the right error-bucket counters bumped. ### Integration verification — `test/TESTING.md` New section "Verifying Prometheus metrics": ``` 1. docker compose up -d 2. curl -s http://:8080/metrics | grep dragonfork_webrtc_ - expect: 11 metric families present, all with `core="dragonfork-truenas"` 3. Open http://:3000 (Grafana), log in with GRAFANA_ADMIN_PASSWORD 4. Navigate to Dashboards → Dragon Fork → WebRTC Health - expect: all 5 rows render, no "no data" panels except where stream traffic is absent 5. Trigger one of each error in test/whep-player.html (intentional codec mismatch via SDP edit, kill the publisher mid-stream, etc.) 6. Watch the Grafana panels and verify counters tick within 15s. ``` ### CI Existing test runner picks up the new `_test.go` files. No new CI gates beyond standard build+test — observability isn't a contract; the unit tests verify shape only. Grafana dashboard JSON is *not* validated in CI (no good lightweight validator); manual verification only. ### Load test alignment The deferred 5-peer × 10-min load test (separate spec) will use this dashboard as its primary observation surface. Recording rules for the load test's specific aggregations can be added in that spec without touching this one. ## Rollout The TrueNAS v0.1.0-dragonfork deploy upgrades via: ```sh cd deploy/truenas/core git pull # latest main with this change # Add new lines to .env (see template above) docker compose pull # grabs prom + grafana images docker compose up -d # core unchanged, prom + grafana new ``` Core continues on host networking. The new containers connect via `host.docker.internal:host-gateway`, no firewall changes required for intra-host traffic. External Grafana access is on `${GRAFANA_PORT}`. ### Backwards compatibility - No upstream metric names or labels modified. New metrics are purely additive in `dragonfork_webrtc_*` namespace. - No API changes. `/metrics` payload grows but stays well-formed Prometheus exposition. - Existing config, env vars, and process JSON formats unchanged. ### Forward compatibility - The `core` label being a `ConstLabels` value (not a per-event dimension) means future federated multi-Core scrapes will distinguish series cleanly by setting `core="dragonfork-truenas-east"` etc. in each deploy's config loader. Spec'd here, implemented when needed. - New metrics in this spec follow the `dragonfork__` naming pattern. Future Dragon-Fork-specific metrics (WHIP, keyframe cache, bandwidth) should adopt the same convention. ### Known gaps post-rollout - No paging. Alerts evaluate, no Alertmanager. If `WebRTCFFmpegLegFailure` fires at 3am, no notification — operator notices at next dashboard check. Acceptable for v0.2 single-operator deploy. Track as a v0.3 spec. - Grafana dashboard JSON is hand-edited via Grafana UI then re-exported. No JSON-as-code library used. If dashboard maintenance gets painful, Grafonnet/Grafana-as-code is a v0.3+ refactor. - `/metrics` itself is unauthenticated by default in v0.1 (matches upstream). If Core's deploy bundle is exposed to untrusted networks, the operator should already be using auth on Core's HTTP listener. Not this spec's problem to solve, but worth a one-line note in `deploy/truenas/core/README.md`. ## Open Decisions 1. **Should the `Stats()` method live on `Subsystem` or on `Handler`?** The peer count is in `Handler`'s per-stream peer index; stream count is in `Subsystem`'s registry; UDP port pool is in `portalloc`. Easiest shape: `Subsystem.Stats()` is the public surface and internally gathers from `Handler` (via the existing teardown-hook plumbing) and `portalloc`. Decide at implementation time based on which surface exposes the cleanest seams. 2. **Should histograms also include a `core` label, given it's already a `ConstLabels`?** Yes — `ConstLabels` is automatically present on every sample, no per-call overhead, and federations need it. 3. **Should Prometheus retention be configurable via `.env`?** Defaulting to 15d covers the realistic window for "what happened last week?" queries. Adding `PROM_RETENTION_DAYS=15d` to `.env` is a one-line change. Including it as optional, defaulting to 15d. 4. **Import-alias collision.** The local package is `package prometheus` (at `github.com/datarhei/core/v16/prometheus`) and `client_golang` is also `package prometheus`. Files in `app/webrtc/` that need both must alias one — convention is `coreprom "github.com/datarhei/core/v16/prometheus"`. Implementation note only; doesn't change the design. ## References - [Prometheus client_golang](https://pkg.go.dev/github.com/prometheus/client_golang/prometheus) - [Prometheus instrumentation best practices](https://prometheus.io/docs/practices/instrumentation/) - [Histogram bucket design](https://prometheus.io/docs/practices/histograms/) - [Grafana provisioning docs](https://grafana.com/docs/grafana/latest/administration/provisioning/) - v0.1 design: `docs/design/2026-04-16-datarhei-dragon-fork-webrtc-design.md` - M2 integration: `docs/design/2026-04-17-datarhei-dragon-fork-m2-webrtc-core-integration.md`