datarhei-dragonfork-core/docs/design/2026-05-03-datarhei-dragon-fork-webrtc-prometheus-metrics-design.md

667 lines
26 KiB
Markdown
Raw Permalink Normal View History

# Datarhei - Dragon Fork: WebRTC Prometheus Metrics
**Status:** Draft for review
**Author:** Zac (Wild Dragon)
**Date:** 2026-05-03
**Predecessors:**
- [`2026-04-16-datarhei-dragon-fork-webrtc-design.md`](2026-04-16-datarhei-dragon-fork-webrtc-design.md)
- [`2026-04-17-datarhei-dragon-fork-m2-webrtc-core-integration.md`](2026-04-17-datarhei-dragon-fork-m2-webrtc-core-integration.md)
- v0.1.0-dragonfork released 2026-05-03
---
## Summary
Add Prometheus instrumentation to Dragon Fork's WebRTC subsystem and ship a
collection-and-dashboard stack in the existing TrueNAS deploy bundle. Closes
the v0.1 observability gap: the WHEP egress has been running in production
since 2026-04-17 with zero per-subsystem signal.
The deliverable is a RED-method dashboard ("rate, errors, duration") that
answers a single operator question — _is the WebRTC stack healthy right now?_
Eleven new metrics in the `dragonfork_webrtc_*` namespace, two new containers
(Prometheus + Grafana) in `deploy/truenas/core/`, four pre-loaded alert rules,
one pre-provisioned dashboard.
## Goals
- Operator can answer "is WebRTC healthy right now?" from a single Grafana
dashboard, without tailing logs or hitting the API.
- Per-stream drill-down available when the dashboard goes red — labels carry
`stream_id` everywhere it's meaningful, never `peer_id`.
- Deploy is one-command on a fresh TrueNAS box (`docker compose up -d`),
matching the existing v0.1 deploy ergonomics.
- Backwards-compatible: zero changes to upstream's `/metrics` payload. New
metrics are purely additive.
- Bucket choices and label sets are tuned for the realistic latency ranges
observed in v0.1 (server-hop p95 ≈ 240µs, ICE establishment seconds-scale).
## Non-Goals
- **Alertmanager bundling.** Alert rules are loaded into Prometheus but not
routed. Paging configuration is too opinionated to ship a default; separate
spec if/when paging is wanted.
- **Per-peer metric labels.** Peer-level forensics (individual session
lifetimes, per-resource teardown reasons) is out of scope. `peer_id` is
unbounded under churn and risks cardinality bloat.
- **Federated multi-Core scrape.** Single-deploy scrape config only. The
`core` label is set statically to `dragonfork-truenas`.
- **Latency p95 CI gate via Prometheus.** Server-hop latency stays a Go
test gate (`-tags latency`); not a Prometheus histogram.
- **Server-hop microsecond histogram.** The 240µs server-hop is well below
HTTP request scales and would need its own bucket set; it's already
covered by the latency CI test, no need to duplicate in Prom.
- **Custom monitor/metric bus integration.** Upstream pulls from
`monitor/metric.Reader`. We diverge — see Module Layout for rationale.
## Context
v0.1 surface area:
- WHEP HTTP routes: `POST /api/v3/whep/{id}`, `DELETE /api/v3/whep/{id}/{r}`,
`PATCH /api/v3/whep/{id}/{r}`, plus admin `GET /api/v3/webrtc/streams`
and `GET /api/v3/webrtc/streams/{id}/peers`.
- Error matrix in v0.1: `406` codec mismatch, `503` cap reached (split into
global vs per-stream in response body), `504` ICE timeout, `204` DELETE
idempotent, `404` unknown stream.
- Pion-mediated peer connection lifecycle in `app/webrtc/lifecycle.go`
ICE state transitions are the natural hook for ICE timing/failure metrics.
- FFmpeg RTP output legs supervised by the existing process supervisor;
silent leg failure is a known "quietly degrading" risk worth instrumenting.
Existing Prometheus integration (upstream):
- `prometheus/prometheus.go` exposes a `Metrics` interface with `Register`
and an `HTTPHandler()`. Single shared `prometheus.Registry`.
- `prometheus/restream.go` is the reference collector — pulls from
`monitor/metric.Reader` via `metric.Pattern` queries, emits via
`prometheus.MustNewConstMetric`. All upstream collectors carry a `core`
label as the first dimension.
- `/metrics` endpoint already exposed by Core; auth handled at the same
layer as the rest of the API.
## Approach
**Hybrid instrumentation, in two surfaces:**
1. **Direct `prometheus/client_golang` instrumentation** in `app/webrtc/`
for hot-path counters and histograms (request rate, request duration,
ICE establishment duration, error counters by reason). Histograms can't
be reconstructed from a scrape-time snapshot, so this is non-negotiable
for RED-method.
2. **Snapshot-style collector** in `prometheus/webrtc.go` for slow-changing
gauges (active streams, active peers per stream, UDP port pool usage).
Calls a new `Stats()` method on the WebRTC subsystem at scrape time.
Both surfaces register against the same `prometheus.Registerer` exposed by
`prometheus.Metrics`. No new HTTP endpoint, no new auth path. Both take a
`core` first-label dimension to match upstream collector convention.
### Why not pure snapshot?
Upstream's `prometheus/restream.go` pulls from a `monitor/metric` bus that
the FFmpeg supervision layer writes into. We could mirror that for WebRTC
— have `app/webrtc/lifecycle.go` and `handler.go` push events onto the bus,
have `prometheus/webrtc.go` pull them. Two reasons not to:
- **Histograms don't fit the pattern.** The bus stores point-in-time values
(gauges and counters), not distributions. RED-method needs duration p50
and p95; you'd end up maintaining an in-process sliding-window quantile
estimator inside the WebRTC subsystem, which is more code than just using
`client_golang.Histogram` directly.
- **The bus is FFmpeg-shaped.** `metric.Pattern` queries are designed for
process-state metrics (process IDs, FFmpeg states). Bolting WebRTC
semantics on requires defining new patterns the bus consumers all need
to know about, for a payload only the WebRTC collector cares about.
The hybrid keeps each metric type on the cleanest path. The cost is two
patterns in the codebase instead of one — accepted, with a comment in
`prometheus/webrtc.go` pointing at this rationale so the next contributor
doesn't try to "fix" the divergence.
### Why not pure direct?
Pure `client_golang` everywhere would mean the gauges (active streams,
active peers, UDP ports) sit in `app/webrtc/` alongside histograms. Workable,
but loses the "one collector file per subsystem in `prometheus/`" pattern
that anyone reading the repo's existing structure would expect. Snapshot
gauges are cheap to implement via the existing pattern, so we keep them
where a casual reader would look.
## Module Layout
### New files
```
app/webrtc/metrics.go (~150 LOC)
app/webrtc/metrics_test.go (~200 LOC)
prometheus/webrtc.go (~120 LOC)
prometheus/webrtc_test.go (~150 LOC)
deploy/truenas/core/prom/prometheus.yml
deploy/truenas/core/prom/rules/webrtc-alerts.yml
deploy/truenas/core/grafana/provisioning/datasources/prometheus.yml
deploy/truenas/core/grafana/provisioning/dashboards/webrtc.yml
deploy/truenas/core/grafana/dashboards/dragonfork-webrtc-health.json
```
### Modified files
```
app/webrtc/handler.go — add metric middleware around WHEP routes
app/webrtc/lifecycle.go — record ICE timing in OnConnectionStateChange
app/webrtc/subsystem.go — add Stats() method, instrument process hooks
deploy/truenas/core/docker-compose.yml — add prom + grafana services
deploy/truenas/core/README.md — document new env vars + ports
README.md — quick-start mentions Grafana URL
CHANGELOG.md — v0.2.0-dragonfork section
```
### `app/webrtc/metrics.go` — direct instrumentation
`promauto`-registered into the shared registry, exposed as package-level
vars so `handler.go` and `lifecycle.go` can increment without dependency
injection. Single `Init(reg prometheus.Registerer, core string)` called
from `subsystem.New` after the registry is available.
```go
// Sketch — exact wire format finalized at implementation.
package webrtc
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
)
var histBuckets = []float64{0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10}
type metrics struct {
whepRequests *prometheus.CounterVec // route, code, stream_id
whepRequestDuration *prometheus.HistogramVec // route, stream_id
iceEstablishment *prometheus.HistogramVec // stream_id, result
iceFailures *prometheus.CounterVec // stream_id, reason
codecMismatches *prometheus.CounterVec // stream_id, kind
capRejections *prometheus.CounterVec // stream_id, scope
ffmpegLegFailures *prometheus.CounterVec // stream_id, leg
}
func newMetrics(reg prometheus.Registerer, core string) *metrics {
factory := promauto.With(reg)
return &metrics{
whepRequests: factory.NewCounterVec(prometheus.CounterOpts{
Name: "dragonfork_webrtc_whep_requests_total",
Help: "Count of WHEP requests by route, status code, and stream.",
ConstLabels: prometheus.Labels{"core": core},
}, []string{"route", "code", "stream_id"}),
// ... etc
}
}
```
The `core` label is a `ConstLabels` (set once at construction) rather than a
per-request dimension — matches the upstream collector pattern and avoids
threading it through every call site.
### `prometheus/webrtc.go` — snapshot collector
Standard `prometheus.Collector` interface (Describe / Collect). Keeps a
reference to a `WebRTCStatsSource` interface, which the WebRTC subsystem
implements via its `Stats()` method. Avoids importing `app/webrtc` from
`prometheus/` — the dependency arrow points the right way.
```go
// Sketch.
type WebRTCStatsSource interface {
Stats() WebRTCStats
}
type WebRTCStats struct {
StreamCount int
PeersByStream map[string]int
UDPPortsInUse int
UDPPortsAvailable int
}
type webrtcCollector struct {
core string
source WebRTCStatsSource
activeStreamsDesc *prometheus.Desc
activePeersDesc *prometheus.Desc
udpPortsInUseDesc *prometheus.Desc
udpPortsAvailableDesc *prometheus.Desc
}
func NewWebRTCCollector(core string, source WebRTCStatsSource) prometheus.Collector { ... }
```
The `WebRTCStats` type lives in `prometheus/webrtc.go` (not in `app/webrtc/`)
so the dependency stays one-directional. The subsystem implements the
interface by satisfying the shape, not by importing from `prometheus/`.
### `app/webrtc/subsystem.go` — `Stats()` method
```go
func (s *Subsystem) Stats() prometheus.WebRTCStats {
s.mu.Lock()
defer s.mu.Unlock()
peers := make(map[string]int, len(s.streams))
for id, st := range s.streams {
peers[id] = len(st.peers) // assume peers tracked per-stream
}
return prometheus.WebRTCStats{
StreamCount: len(s.streams),
PeersByStream: peers,
UDPPortsInUse: s.portAlloc.InUse(),
UDPPortsAvailable: s.portAlloc.Available(),
}
}
```
The existing subsystem tracks streams in `s.streams` under `s.mu`. Peer
count per stream needs the per-stream peer index that already exists in
`handler.go` — the `Stats()` method consults it via the existing teardown
hook plumbing or a small new accessor on `Handler`. Pick whichever surface
introduces the smaller blast radius at implementation time.
## Metric Inventory
Eleven metrics. Eight new label dimensions across them. ~50 active series
at typical 1-5 stream scale.
### Direct instrumentation (`app/webrtc/metrics.go`)
| Name | Type | Labels | Description |
|---|---|---|---|
| `dragonfork_webrtc_whep_requests_total` | Counter | core, route, code, stream_id | Count of WHEP requests by route+status code. |
| `dragonfork_webrtc_whep_request_duration_seconds` | Histogram | core, route, stream_id | Server-side WHEP request duration. Buckets: `[0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]`. |
| `dragonfork_webrtc_ice_establishment_duration_seconds` | Histogram | core, stream_id, result | Time from `SetLocalDescription` to first `connected` or `failed` ICE state. Same buckets. |
| `dragonfork_webrtc_ice_failures_total` | Counter | core, stream_id, reason | ICE failure count. `reason` ∈ {timeout, disconnected, failed}. |
| `dragonfork_webrtc_codec_mismatches_total` | Counter | core, stream_id, kind | 406 rejections by kind. `kind` ∈ {video, audio}. |
| `dragonfork_webrtc_cap_rejections_total` | Counter | core, stream_id, scope | 503 rejections. `scope` ∈ {global, stream}. |
| `dragonfork_webrtc_ffmpeg_leg_failures_total` | Counter | core, stream_id, leg | RTP output leg failures. `leg` ∈ {video, audio}. |
### Snapshot collector (`prometheus/webrtc.go`)
| Name | Type | Labels | Description |
|---|---|---|---|
| `dragonfork_webrtc_active_streams` | Gauge | core | Streams currently registered (processes with `webrtc.enabled=true` running). |
| `dragonfork_webrtc_active_peers` | Gauge | core, stream_id | Currently subscribed WHEP peers per stream. |
| `dragonfork_webrtc_udp_ports_in_use` | Gauge | core | UDP ports currently allocated from the pool. |
| `dragonfork_webrtc_udp_ports_available` | Gauge | core | Pool size minus in-use (explicit for alert friendliness). |
### Label rationale
- `whep_request_duration_seconds` deliberately omits `code` — separating
distributions per outcome makes p95 noisy, and per-route per-stream p95
is what an operator actually looks at. Errors get visibility through the
request-counter ratio.
- `ice_establishment_duration_seconds` includes both `connected` and
`failed` results in the same histogram via the `result` label —
intentionally — so the dashboard can compare success latency to
failure-tail latency on the same axis.
- `cap_rejections_total` keeps the `scope` label because v0.1's response
body already splits global vs per-stream rejections; metrics mirror that
distinction so the dashboard shows whether to raise `max_peers_total`
or just one stream's per-stream cap.
- `ffmpeg_leg_failures_total` is the "quietly degrading" canary — a silent
RTP-output-leg failure (port bind, encoder crash) is exactly what the
"is it healthy?" framing is meant to catch.
### Cardinality budget
At typical scale (5 streams, 3 routes, ~6 status codes seen in practice):
- `whep_requests_total`: 5 × 3 × 6 = 90 series (worst case)
- `whep_request_duration_seconds`: 5 × 3 × (8 buckets + sum + count) = 150 series
- `ice_establishment_duration_seconds`: 5 × 2 × 10 = 100 series
- All others: 515 series each
- **Total: <500 active series at 5-stream sustained load**
Well within Prometheus's comfort zone. At 15s scrape interval × 15-day
retention, on-disk storage ~80MB.
### Specifically excluded metrics
- **Per-peer session metrics.** Listed under non-goals.
- **Bytes-out / bandwidth.** Pion exposes RTP write bytes via stats; would
be useful but pulls peer-level state. Defer to a future v0.3 spec
("WebRTC bandwidth observability") if needed.
- **Server-hop latency (FFmpeg → peer).** Microsecond scale, already
covered by `-tags latency` test gate, would need its own bucket set.
## Deploy Bundle
### `deploy/truenas/core/docker-compose.yml` additions
Two new services on a new bridge network `dragonfork-mon`. Core continues
on `network_mode: host` unchanged. The new containers reach Core via
`host.docker.internal:${CORE_HTTP_PORT}` (Linux Docker resolves this when
`extra_hosts: ["host.docker.internal:host-gateway"]` is set on the service).
```yaml
services:
core:
# ... existing definition unchanged
prom:
image: prom/prometheus:v2.55.0
container_name: dragonfork-prom
restart: unless-stopped
networks: [dragonfork-mon]
extra_hosts:
- "host.docker.internal:host-gateway"
volumes:
- ./prom/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./prom/rules:/etc/prometheus/rules:ro
- ./prom-data:/prometheus
command:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.retention.time=15d
- --storage.tsdb.path=/prometheus
- --web.console.libraries=/usr/share/prometheus/console_libraries
- --web.console.templates=/usr/share/prometheus/consoles
ports:
- "${PROM_PORT:-9090}:9090"
grafana:
image: grafana/grafana-oss:11.3.0
container_name: dragonfork-grafana
restart: unless-stopped
networks: [dragonfork-mon]
depends_on: [prom]
environment:
GF_SECURITY_ADMIN_PASSWORD: "${GRAFANA_ADMIN_PASSWORD:?set in .env}"
GF_USERS_ALLOW_SIGN_UP: "false"
GF_AUTH_ANONYMOUS_ENABLED: "false"
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning:ro
- ./grafana/dashboards:/var/lib/grafana/dashboards:ro
- ./grafana-data:/var/lib/grafana
ports:
- "${GRAFANA_PORT:-3000}:3000"
networks:
dragonfork-mon:
driver: bridge
```
### `prom/prometheus.yml`
```yaml
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 15s
external_labels:
core: dragonfork-truenas
rule_files:
- /etc/prometheus/rules/*.yml
scrape_configs:
- job_name: dragonfork-core
static_configs:
- targets: ["host.docker.internal:8080"]
metrics_path: /metrics
# If API auth is enabled on /metrics, uncomment and provide creds via
# env-substituted file. v0.1 leaves /metrics public by default.
# basic_auth:
# username_file: /run/secrets/prom_basic_user
# password_file: /run/secrets/prom_basic_pass
```
### `prom/rules/webrtc-alerts.yml`
```yaml
groups:
- name: dragonfork-webrtc
rules:
- alert: WebRTCWHEPErrorRateHigh
expr: |
sum by (stream_id) (
rate(dragonfork_webrtc_whep_requests_total{code=~"4..|5.."}[5m])
) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "WHEP error rate high on stream {{ $labels.stream_id }}"
description: "Sustained 4xx/5xx rate >0.5/sec for 5m."
- alert: WebRTCICEEstablishmentSlow
expr: |
histogram_quantile(0.95,
sum by (le, stream_id) (
rate(dragonfork_webrtc_ice_establishment_duration_seconds_bucket[10m])
)
) > 3
for: 10m
labels:
severity: warning
annotations:
summary: "ICE establishment p95 >3s on {{ $labels.stream_id }}"
- alert: WebRTCICEFailureRateHigh
expr: |
sum by (stream_id) (rate(dragonfork_webrtc_ice_failures_total[5m])) > 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "ICE failures sustained on {{ $labels.stream_id }}"
- alert: WebRTCFFmpegLegFailure
expr: |
increase(dragonfork_webrtc_ffmpeg_leg_failures_total[5m]) > 0
labels:
severity: critical
annotations:
summary: "FFmpeg RTP leg failed on {{ $labels.stream_id }} ({{ $labels.leg }})"
description: "Silent degradation of RTP output. Check FFmpeg logs."
```
Alerts evaluate but route nowhere. Alertmanager bundling deferred — see
non-goals.
### Grafana provisioning
Datasource provisioning at `grafana/provisioning/datasources/prometheus.yml`:
```yaml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prom:9090
isDefault: true
editable: false
```
Dashboard provisioning at `grafana/provisioning/dashboards/webrtc.yml`:
```yaml
apiVersion: 1
providers:
- name: dragonfork
orgId: 1
folder: "Dragon Fork"
type: file
disableDeletion: false
updateIntervalSeconds: 30
options:
path: /var/lib/grafana/dashboards
```
### Dashboard JSON: `dragonfork-webrtc-health.json`
Single dashboard, five rows aligned to the questions from the metric
inventory:
1. **WHEP API health** — request rate by route (stat panel), error rate
stacked by code (timeseries), p95 request duration by route (timeseries).
2. **ICE establishment** — success/failure rate (gauge), p50/p95
establishment duration (timeseries with a 3s threshold line for the
alert), failure breakdown by reason (table).
3. **What's flowing**`active_streams` (stat), `active_peers` per stream
(timeseries), top 5 streams by peer count (table).
4. **Capacity headroom**`udp_ports_available` (gauge with red-zone <10),
cap rejection rate by scope (timeseries).
5. **Silent degradation** — FFmpeg leg failure timeline (timeseries with
annotations), codec mismatch counter (stat).
Built in Grafana 11.3, exported as JSON, committed to the repo. Refresh
default 30s.
### `.env` template additions
Append to `deploy/truenas/core/README.md`'s example `.env`:
```sh
# --- Observability (added in v0.2) ---
GRAFANA_ADMIN_PASSWORD=$(openssl rand -base64 24)
GRAFANA_PORT=3000
PROM_PORT=9090
```
## Testing
### Unit tests — `prometheus/webrtc_test.go`
Mock `WebRTCStatsSource`. Drive the collector through three states (no
streams, one stream with N peers, multiple streams). Use
`testutil.CollectAndCompare` to assert exact metric/label/value output
against a golden plaintext fixture.
```go
// Golden fixture (excerpt):
// # HELP dragonfork_webrtc_active_streams ...
// # TYPE dragonfork_webrtc_active_streams gauge
// dragonfork_webrtc_active_streams{core="test"} 2
// # HELP dragonfork_webrtc_active_peers ...
// # TYPE dragonfork_webrtc_active_peers gauge
// dragonfork_webrtc_active_peers{core="test",stream_id="live"} 3
// dragonfork_webrtc_active_peers{core="test",stream_id="cam"} 1
```
### Unit tests — `app/webrtc/metrics_test.go`
Reuse `handler_test.go` setup (fake registry, in-process Echo router).
Hit each WHEP route, assert the corresponding counter and histogram have
the expected increment via `testutil.ToFloat64`. Drive forced error paths
(unknown stream → 404, codec-less SDP → 406, cap exceeded → 503, ICE
timeout → 504) and assert the right error-bucket counters bumped.
### Integration verification — `test/TESTING.md`
New section "Verifying Prometheus metrics":
```
1. docker compose up -d
2. curl -s http://<host>:8080/metrics | grep dragonfork_webrtc_
- expect: 11 metric families present, all with `core="dragonfork-truenas"`
3. Open http://<host>:3000 (Grafana), log in with GRAFANA_ADMIN_PASSWORD
4. Navigate to Dashboards → Dragon Fork → WebRTC Health
- expect: all 5 rows render, no "no data" panels except where stream traffic is absent
5. Trigger one of each error in test/whep-player.html (intentional codec
mismatch via SDP edit, kill the publisher mid-stream, etc.)
6. Watch the Grafana panels and verify counters tick within 15s.
```
### CI
Existing test runner picks up the new `_test.go` files. No new CI gates
beyond standard build+test — observability isn't a contract; the unit
tests verify shape only. Grafana dashboard JSON is *not* validated in CI
(no good lightweight validator); manual verification only.
### Load test alignment
The deferred 5-peer × 10-min load test (separate spec) will use this
dashboard as its primary observation surface. Recording rules for the
load test's specific aggregations can be added in that spec without
touching this one.
## Rollout
The TrueNAS v0.1.0-dragonfork deploy upgrades via:
```sh
cd deploy/truenas/core
git pull # latest main with this change
# Add new lines to .env (see template above)
docker compose pull # grabs prom + grafana images
docker compose up -d # core unchanged, prom + grafana new
```
Core continues on host networking. The new containers connect via
`host.docker.internal:host-gateway`, no firewall changes required for
intra-host traffic. External Grafana access is on `${GRAFANA_PORT}`.
### Backwards compatibility
- No upstream metric names or labels modified. New metrics are purely
additive in `dragonfork_webrtc_*` namespace.
- No API changes. `/metrics` payload grows but stays well-formed
Prometheus exposition.
- Existing config, env vars, and process JSON formats unchanged.
### Forward compatibility
- The `core` label being a `ConstLabels` value (not a per-event dimension)
means future federated multi-Core scrapes will distinguish series cleanly
by setting `core="dragonfork-truenas-east"` etc. in each deploy's config
loader. Spec'd here, implemented when needed.
- New metrics in this spec follow the `dragonfork_<subsystem>_<noun>` naming
pattern. Future Dragon-Fork-specific metrics (WHIP, keyframe cache,
bandwidth) should adopt the same convention.
### Known gaps post-rollout
- No paging. Alerts evaluate, no Alertmanager. If `WebRTCFFmpegLegFailure`
fires at 3am, no notification — operator notices at next dashboard check.
Acceptable for v0.2 single-operator deploy. Track as a v0.3 spec.
- Grafana dashboard JSON is hand-edited via Grafana UI then re-exported.
No JSON-as-code library used. If dashboard maintenance gets painful,
Grafonnet/Grafana-as-code is a v0.3+ refactor.
- `/metrics` itself is unauthenticated by default in v0.1 (matches
upstream). If Core's deploy bundle is exposed to untrusted networks,
the operator should already be using auth on Core's HTTP listener. Not
this spec's problem to solve, but worth a one-line note in
`deploy/truenas/core/README.md`.
## Open Decisions
1. **Should the `Stats()` method live on `Subsystem` or on `Handler`?**
The peer count is in `Handler`'s per-stream peer index; stream count
is in `Subsystem`'s registry; UDP port pool is in `portalloc`. Easiest
shape: `Subsystem.Stats()` is the public surface and internally
gathers from `Handler` (via the existing teardown-hook plumbing) and
`portalloc`. Decide at implementation time based on which surface
exposes the cleanest seams.
2. **Should histograms also include a `core` label, given it's already a
`ConstLabels`?** Yes — `ConstLabels` is automatically present on every
sample, no per-call overhead, and federations need it.
3. **Should Prometheus retention be configurable via `.env`?** Defaulting
to 15d covers the realistic window for "what happened last week?"
queries. Adding `PROM_RETENTION_DAYS=15d` to `.env` is a one-line
change. Including it as optional, defaulting to 15d.
4. **Import-alias collision.** The local package is `package prometheus`
(at `github.com/datarhei/core/v16/prometheus`) and `client_golang` is
also `package prometheus`. Files in `app/webrtc/` that need both must
alias one — convention is `coreprom "github.com/datarhei/core/v16/prometheus"`.
Implementation note only; doesn't change the design.
## References
- [Prometheus client_golang](https://pkg.go.dev/github.com/prometheus/client_golang/prometheus)
- [Prometheus instrumentation best practices](https://prometheus.io/docs/practices/instrumentation/)
- [Histogram bucket design](https://prometheus.io/docs/practices/histograms/)
- [Grafana provisioning docs](https://grafana.com/docs/grafana/latest/administration/provisioning/)
- v0.1 design: `docs/design/2026-04-16-datarhei-dragon-fork-webrtc-design.md`
- M2 integration: `docs/design/2026-04-17-datarhei-dragon-fork-m2-webrtc-core-integration.md`