docs(design): WebRTC Prometheus metrics + Grafana stack design
Closes the v0.1 observability gap. Eleven new metrics in the dragonfork_webrtc_* namespace (RED-method on the WHEP surface plus state gauges from the WebRTC subsystem), Prom + Grafana containers added to deploy/truenas/core/, four pre-loaded alert rules, one pre-provisioned dashboard. Hybrid instrumentation: direct client_golang in app/webrtc/ for hot-path counters and histograms; snapshot collector in prometheus/webrtc.go for slow-changing gauges. Rationale and trade-offs against the upstream monitor/metric bus pattern documented in the Approach section. Targets v0.2.0-dragonfork.
This commit is contained in:
parent
75afcbc0d1
commit
949daa26b5
1 changed files with 666 additions and 0 deletions
|
|
@ -0,0 +1,666 @@
|
||||||
|
# Datarhei - Dragon Fork: WebRTC Prometheus Metrics
|
||||||
|
|
||||||
|
**Status:** Draft for review
|
||||||
|
**Author:** Zac (Wild Dragon)
|
||||||
|
**Date:** 2026-05-03
|
||||||
|
**Predecessors:**
|
||||||
|
- [`2026-04-16-datarhei-dragon-fork-webrtc-design.md`](2026-04-16-datarhei-dragon-fork-webrtc-design.md)
|
||||||
|
- [`2026-04-17-datarhei-dragon-fork-m2-webrtc-core-integration.md`](2026-04-17-datarhei-dragon-fork-m2-webrtc-core-integration.md)
|
||||||
|
- v0.1.0-dragonfork released 2026-05-03
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
Add Prometheus instrumentation to Dragon Fork's WebRTC subsystem and ship a
|
||||||
|
collection-and-dashboard stack in the existing TrueNAS deploy bundle. Closes
|
||||||
|
the v0.1 observability gap: the WHEP egress has been running in production
|
||||||
|
since 2026-04-17 with zero per-subsystem signal.
|
||||||
|
|
||||||
|
The deliverable is a RED-method dashboard ("rate, errors, duration") that
|
||||||
|
answers a single operator question — _is the WebRTC stack healthy right now?_
|
||||||
|
Eleven new metrics in the `dragonfork_webrtc_*` namespace, two new containers
|
||||||
|
(Prometheus + Grafana) in `deploy/truenas/core/`, four pre-loaded alert rules,
|
||||||
|
one pre-provisioned dashboard.
|
||||||
|
|
||||||
|
## Goals
|
||||||
|
|
||||||
|
- Operator can answer "is WebRTC healthy right now?" from a single Grafana
|
||||||
|
dashboard, without tailing logs or hitting the API.
|
||||||
|
- Per-stream drill-down available when the dashboard goes red — labels carry
|
||||||
|
`stream_id` everywhere it's meaningful, never `peer_id`.
|
||||||
|
- Deploy is one-command on a fresh TrueNAS box (`docker compose up -d`),
|
||||||
|
matching the existing v0.1 deploy ergonomics.
|
||||||
|
- Backwards-compatible: zero changes to upstream's `/metrics` payload. New
|
||||||
|
metrics are purely additive.
|
||||||
|
- Bucket choices and label sets are tuned for the realistic latency ranges
|
||||||
|
observed in v0.1 (server-hop p95 ≈ 240µs, ICE establishment seconds-scale).
|
||||||
|
|
||||||
|
## Non-Goals
|
||||||
|
|
||||||
|
- **Alertmanager bundling.** Alert rules are loaded into Prometheus but not
|
||||||
|
routed. Paging configuration is too opinionated to ship a default; separate
|
||||||
|
spec if/when paging is wanted.
|
||||||
|
- **Per-peer metric labels.** Peer-level forensics (individual session
|
||||||
|
lifetimes, per-resource teardown reasons) is out of scope. `peer_id` is
|
||||||
|
unbounded under churn and risks cardinality bloat.
|
||||||
|
- **Federated multi-Core scrape.** Single-deploy scrape config only. The
|
||||||
|
`core` label is set statically to `dragonfork-truenas`.
|
||||||
|
- **Latency p95 CI gate via Prometheus.** Server-hop latency stays a Go
|
||||||
|
test gate (`-tags latency`); not a Prometheus histogram.
|
||||||
|
- **Server-hop microsecond histogram.** The 240µs server-hop is well below
|
||||||
|
HTTP request scales and would need its own bucket set; it's already
|
||||||
|
covered by the latency CI test, no need to duplicate in Prom.
|
||||||
|
- **Custom monitor/metric bus integration.** Upstream pulls from
|
||||||
|
`monitor/metric.Reader`. We diverge — see Module Layout for rationale.
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
v0.1 surface area:
|
||||||
|
|
||||||
|
- WHEP HTTP routes: `POST /api/v3/whep/{id}`, `DELETE /api/v3/whep/{id}/{r}`,
|
||||||
|
`PATCH /api/v3/whep/{id}/{r}`, plus admin `GET /api/v3/webrtc/streams`
|
||||||
|
and `GET /api/v3/webrtc/streams/{id}/peers`.
|
||||||
|
- Error matrix in v0.1: `406` codec mismatch, `503` cap reached (split into
|
||||||
|
global vs per-stream in response body), `504` ICE timeout, `204` DELETE
|
||||||
|
idempotent, `404` unknown stream.
|
||||||
|
- Pion-mediated peer connection lifecycle in `app/webrtc/lifecycle.go` —
|
||||||
|
ICE state transitions are the natural hook for ICE timing/failure metrics.
|
||||||
|
- FFmpeg RTP output legs supervised by the existing process supervisor;
|
||||||
|
silent leg failure is a known "quietly degrading" risk worth instrumenting.
|
||||||
|
|
||||||
|
Existing Prometheus integration (upstream):
|
||||||
|
|
||||||
|
- `prometheus/prometheus.go` exposes a `Metrics` interface with `Register`
|
||||||
|
and an `HTTPHandler()`. Single shared `prometheus.Registry`.
|
||||||
|
- `prometheus/restream.go` is the reference collector — pulls from
|
||||||
|
`monitor/metric.Reader` via `metric.Pattern` queries, emits via
|
||||||
|
`prometheus.MustNewConstMetric`. All upstream collectors carry a `core`
|
||||||
|
label as the first dimension.
|
||||||
|
- `/metrics` endpoint already exposed by Core; auth handled at the same
|
||||||
|
layer as the rest of the API.
|
||||||
|
|
||||||
|
## Approach
|
||||||
|
|
||||||
|
**Hybrid instrumentation, in two surfaces:**
|
||||||
|
|
||||||
|
1. **Direct `prometheus/client_golang` instrumentation** in `app/webrtc/`
|
||||||
|
for hot-path counters and histograms (request rate, request duration,
|
||||||
|
ICE establishment duration, error counters by reason). Histograms can't
|
||||||
|
be reconstructed from a scrape-time snapshot, so this is non-negotiable
|
||||||
|
for RED-method.
|
||||||
|
|
||||||
|
2. **Snapshot-style collector** in `prometheus/webrtc.go` for slow-changing
|
||||||
|
gauges (active streams, active peers per stream, UDP port pool usage).
|
||||||
|
Calls a new `Stats()` method on the WebRTC subsystem at scrape time.
|
||||||
|
|
||||||
|
Both surfaces register against the same `prometheus.Registerer` exposed by
|
||||||
|
`prometheus.Metrics`. No new HTTP endpoint, no new auth path. Both take a
|
||||||
|
`core` first-label dimension to match upstream collector convention.
|
||||||
|
|
||||||
|
### Why not pure snapshot?
|
||||||
|
|
||||||
|
Upstream's `prometheus/restream.go` pulls from a `monitor/metric` bus that
|
||||||
|
the FFmpeg supervision layer writes into. We could mirror that for WebRTC
|
||||||
|
— have `app/webrtc/lifecycle.go` and `handler.go` push events onto the bus,
|
||||||
|
have `prometheus/webrtc.go` pull them. Two reasons not to:
|
||||||
|
|
||||||
|
- **Histograms don't fit the pattern.** The bus stores point-in-time values
|
||||||
|
(gauges and counters), not distributions. RED-method needs duration p50
|
||||||
|
and p95; you'd end up maintaining an in-process sliding-window quantile
|
||||||
|
estimator inside the WebRTC subsystem, which is more code than just using
|
||||||
|
`client_golang.Histogram` directly.
|
||||||
|
- **The bus is FFmpeg-shaped.** `metric.Pattern` queries are designed for
|
||||||
|
process-state metrics (process IDs, FFmpeg states). Bolting WebRTC
|
||||||
|
semantics on requires defining new patterns the bus consumers all need
|
||||||
|
to know about, for a payload only the WebRTC collector cares about.
|
||||||
|
|
||||||
|
The hybrid keeps each metric type on the cleanest path. The cost is two
|
||||||
|
patterns in the codebase instead of one — accepted, with a comment in
|
||||||
|
`prometheus/webrtc.go` pointing at this rationale so the next contributor
|
||||||
|
doesn't try to "fix" the divergence.
|
||||||
|
|
||||||
|
### Why not pure direct?
|
||||||
|
|
||||||
|
Pure `client_golang` everywhere would mean the gauges (active streams,
|
||||||
|
active peers, UDP ports) sit in `app/webrtc/` alongside histograms. Workable,
|
||||||
|
but loses the "one collector file per subsystem in `prometheus/`" pattern
|
||||||
|
that anyone reading the repo's existing structure would expect. Snapshot
|
||||||
|
gauges are cheap to implement via the existing pattern, so we keep them
|
||||||
|
where a casual reader would look.
|
||||||
|
|
||||||
|
## Module Layout
|
||||||
|
|
||||||
|
### New files
|
||||||
|
|
||||||
|
```
|
||||||
|
app/webrtc/metrics.go (~150 LOC)
|
||||||
|
app/webrtc/metrics_test.go (~200 LOC)
|
||||||
|
prometheus/webrtc.go (~120 LOC)
|
||||||
|
prometheus/webrtc_test.go (~150 LOC)
|
||||||
|
deploy/truenas/core/prom/prometheus.yml
|
||||||
|
deploy/truenas/core/prom/rules/webrtc-alerts.yml
|
||||||
|
deploy/truenas/core/grafana/provisioning/datasources/prometheus.yml
|
||||||
|
deploy/truenas/core/grafana/provisioning/dashboards/webrtc.yml
|
||||||
|
deploy/truenas/core/grafana/dashboards/dragonfork-webrtc-health.json
|
||||||
|
```
|
||||||
|
|
||||||
|
### Modified files
|
||||||
|
|
||||||
|
```
|
||||||
|
app/webrtc/handler.go — add metric middleware around WHEP routes
|
||||||
|
app/webrtc/lifecycle.go — record ICE timing in OnConnectionStateChange
|
||||||
|
app/webrtc/subsystem.go — add Stats() method, instrument process hooks
|
||||||
|
deploy/truenas/core/docker-compose.yml — add prom + grafana services
|
||||||
|
deploy/truenas/core/README.md — document new env vars + ports
|
||||||
|
README.md — quick-start mentions Grafana URL
|
||||||
|
CHANGELOG.md — v0.2.0-dragonfork section
|
||||||
|
```
|
||||||
|
|
||||||
|
### `app/webrtc/metrics.go` — direct instrumentation
|
||||||
|
|
||||||
|
`promauto`-registered into the shared registry, exposed as package-level
|
||||||
|
vars so `handler.go` and `lifecycle.go` can increment without dependency
|
||||||
|
injection. Single `Init(reg prometheus.Registerer, core string)` called
|
||||||
|
from `subsystem.New` after the registry is available.
|
||||||
|
|
||||||
|
```go
|
||||||
|
// Sketch — exact wire format finalized at implementation.
|
||||||
|
package webrtc
|
||||||
|
|
||||||
|
import (
|
||||||
|
"github.com/prometheus/client_golang/prometheus"
|
||||||
|
"github.com/prometheus/client_golang/prometheus/promauto"
|
||||||
|
)
|
||||||
|
|
||||||
|
var histBuckets = []float64{0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10}
|
||||||
|
|
||||||
|
type metrics struct {
|
||||||
|
whepRequests *prometheus.CounterVec // route, code, stream_id
|
||||||
|
whepRequestDuration *prometheus.HistogramVec // route, stream_id
|
||||||
|
iceEstablishment *prometheus.HistogramVec // stream_id, result
|
||||||
|
iceFailures *prometheus.CounterVec // stream_id, reason
|
||||||
|
codecMismatches *prometheus.CounterVec // stream_id, kind
|
||||||
|
capRejections *prometheus.CounterVec // stream_id, scope
|
||||||
|
ffmpegLegFailures *prometheus.CounterVec // stream_id, leg
|
||||||
|
}
|
||||||
|
|
||||||
|
func newMetrics(reg prometheus.Registerer, core string) *metrics {
|
||||||
|
factory := promauto.With(reg)
|
||||||
|
return &metrics{
|
||||||
|
whepRequests: factory.NewCounterVec(prometheus.CounterOpts{
|
||||||
|
Name: "dragonfork_webrtc_whep_requests_total",
|
||||||
|
Help: "Count of WHEP requests by route, status code, and stream.",
|
||||||
|
ConstLabels: prometheus.Labels{"core": core},
|
||||||
|
}, []string{"route", "code", "stream_id"}),
|
||||||
|
// ... etc
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
The `core` label is a `ConstLabels` (set once at construction) rather than a
|
||||||
|
per-request dimension — matches the upstream collector pattern and avoids
|
||||||
|
threading it through every call site.
|
||||||
|
|
||||||
|
### `prometheus/webrtc.go` — snapshot collector
|
||||||
|
|
||||||
|
Standard `prometheus.Collector` interface (Describe / Collect). Keeps a
|
||||||
|
reference to a `WebRTCStatsSource` interface, which the WebRTC subsystem
|
||||||
|
implements via its `Stats()` method. Avoids importing `app/webrtc` from
|
||||||
|
`prometheus/` — the dependency arrow points the right way.
|
||||||
|
|
||||||
|
```go
|
||||||
|
// Sketch.
|
||||||
|
type WebRTCStatsSource interface {
|
||||||
|
Stats() WebRTCStats
|
||||||
|
}
|
||||||
|
|
||||||
|
type WebRTCStats struct {
|
||||||
|
StreamCount int
|
||||||
|
PeersByStream map[string]int
|
||||||
|
UDPPortsInUse int
|
||||||
|
UDPPortsAvailable int
|
||||||
|
}
|
||||||
|
|
||||||
|
type webrtcCollector struct {
|
||||||
|
core string
|
||||||
|
source WebRTCStatsSource
|
||||||
|
|
||||||
|
activeStreamsDesc *prometheus.Desc
|
||||||
|
activePeersDesc *prometheus.Desc
|
||||||
|
udpPortsInUseDesc *prometheus.Desc
|
||||||
|
udpPortsAvailableDesc *prometheus.Desc
|
||||||
|
}
|
||||||
|
|
||||||
|
func NewWebRTCCollector(core string, source WebRTCStatsSource) prometheus.Collector { ... }
|
||||||
|
```
|
||||||
|
|
||||||
|
The `WebRTCStats` type lives in `prometheus/webrtc.go` (not in `app/webrtc/`)
|
||||||
|
so the dependency stays one-directional. The subsystem implements the
|
||||||
|
interface by satisfying the shape, not by importing from `prometheus/`.
|
||||||
|
|
||||||
|
### `app/webrtc/subsystem.go` — `Stats()` method
|
||||||
|
|
||||||
|
```go
|
||||||
|
func (s *Subsystem) Stats() prometheus.WebRTCStats {
|
||||||
|
s.mu.Lock()
|
||||||
|
defer s.mu.Unlock()
|
||||||
|
peers := make(map[string]int, len(s.streams))
|
||||||
|
for id, st := range s.streams {
|
||||||
|
peers[id] = len(st.peers) // assume peers tracked per-stream
|
||||||
|
}
|
||||||
|
return prometheus.WebRTCStats{
|
||||||
|
StreamCount: len(s.streams),
|
||||||
|
PeersByStream: peers,
|
||||||
|
UDPPortsInUse: s.portAlloc.InUse(),
|
||||||
|
UDPPortsAvailable: s.portAlloc.Available(),
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
The existing subsystem tracks streams in `s.streams` under `s.mu`. Peer
|
||||||
|
count per stream needs the per-stream peer index that already exists in
|
||||||
|
`handler.go` — the `Stats()` method consults it via the existing teardown
|
||||||
|
hook plumbing or a small new accessor on `Handler`. Pick whichever surface
|
||||||
|
introduces the smaller blast radius at implementation time.
|
||||||
|
|
||||||
|
## Metric Inventory
|
||||||
|
|
||||||
|
Eleven metrics. Eight new label dimensions across them. ~50 active series
|
||||||
|
at typical 1-5 stream scale.
|
||||||
|
|
||||||
|
### Direct instrumentation (`app/webrtc/metrics.go`)
|
||||||
|
|
||||||
|
| Name | Type | Labels | Description |
|
||||||
|
|---|---|---|---|
|
||||||
|
| `dragonfork_webrtc_whep_requests_total` | Counter | core, route, code, stream_id | Count of WHEP requests by route+status code. |
|
||||||
|
| `dragonfork_webrtc_whep_request_duration_seconds` | Histogram | core, route, stream_id | Server-side WHEP request duration. Buckets: `[0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]`. |
|
||||||
|
| `dragonfork_webrtc_ice_establishment_duration_seconds` | Histogram | core, stream_id, result | Time from `SetLocalDescription` to first `connected` or `failed` ICE state. Same buckets. |
|
||||||
|
| `dragonfork_webrtc_ice_failures_total` | Counter | core, stream_id, reason | ICE failure count. `reason` ∈ {timeout, disconnected, failed}. |
|
||||||
|
| `dragonfork_webrtc_codec_mismatches_total` | Counter | core, stream_id, kind | 406 rejections by kind. `kind` ∈ {video, audio}. |
|
||||||
|
| `dragonfork_webrtc_cap_rejections_total` | Counter | core, stream_id, scope | 503 rejections. `scope` ∈ {global, stream}. |
|
||||||
|
| `dragonfork_webrtc_ffmpeg_leg_failures_total` | Counter | core, stream_id, leg | RTP output leg failures. `leg` ∈ {video, audio}. |
|
||||||
|
|
||||||
|
### Snapshot collector (`prometheus/webrtc.go`)
|
||||||
|
|
||||||
|
| Name | Type | Labels | Description |
|
||||||
|
|---|---|---|---|
|
||||||
|
| `dragonfork_webrtc_active_streams` | Gauge | core | Streams currently registered (processes with `webrtc.enabled=true` running). |
|
||||||
|
| `dragonfork_webrtc_active_peers` | Gauge | core, stream_id | Currently subscribed WHEP peers per stream. |
|
||||||
|
| `dragonfork_webrtc_udp_ports_in_use` | Gauge | core | UDP ports currently allocated from the pool. |
|
||||||
|
| `dragonfork_webrtc_udp_ports_available` | Gauge | core | Pool size minus in-use (explicit for alert friendliness). |
|
||||||
|
|
||||||
|
### Label rationale
|
||||||
|
|
||||||
|
- `whep_request_duration_seconds` deliberately omits `code` — separating
|
||||||
|
distributions per outcome makes p95 noisy, and per-route per-stream p95
|
||||||
|
is what an operator actually looks at. Errors get visibility through the
|
||||||
|
request-counter ratio.
|
||||||
|
- `ice_establishment_duration_seconds` includes both `connected` and
|
||||||
|
`failed` results in the same histogram via the `result` label —
|
||||||
|
intentionally — so the dashboard can compare success latency to
|
||||||
|
failure-tail latency on the same axis.
|
||||||
|
- `cap_rejections_total` keeps the `scope` label because v0.1's response
|
||||||
|
body already splits global vs per-stream rejections; metrics mirror that
|
||||||
|
distinction so the dashboard shows whether to raise `max_peers_total`
|
||||||
|
or just one stream's per-stream cap.
|
||||||
|
- `ffmpeg_leg_failures_total` is the "quietly degrading" canary — a silent
|
||||||
|
RTP-output-leg failure (port bind, encoder crash) is exactly what the
|
||||||
|
"is it healthy?" framing is meant to catch.
|
||||||
|
|
||||||
|
### Cardinality budget
|
||||||
|
|
||||||
|
At typical scale (5 streams, 3 routes, ~6 status codes seen in practice):
|
||||||
|
|
||||||
|
- `whep_requests_total`: 5 × 3 × 6 = 90 series (worst case)
|
||||||
|
- `whep_request_duration_seconds`: 5 × 3 × (8 buckets + sum + count) = 150 series
|
||||||
|
- `ice_establishment_duration_seconds`: 5 × 2 × 10 = 100 series
|
||||||
|
- All others: 5–15 series each
|
||||||
|
- **Total: <500 active series at 5-stream sustained load**
|
||||||
|
|
||||||
|
Well within Prometheus's comfort zone. At 15s scrape interval × 15-day
|
||||||
|
retention, on-disk storage ~80MB.
|
||||||
|
|
||||||
|
### Specifically excluded metrics
|
||||||
|
|
||||||
|
- **Per-peer session metrics.** Listed under non-goals.
|
||||||
|
- **Bytes-out / bandwidth.** Pion exposes RTP write bytes via stats; would
|
||||||
|
be useful but pulls peer-level state. Defer to a future v0.3 spec
|
||||||
|
("WebRTC bandwidth observability") if needed.
|
||||||
|
- **Server-hop latency (FFmpeg → peer).** Microsecond scale, already
|
||||||
|
covered by `-tags latency` test gate, would need its own bucket set.
|
||||||
|
|
||||||
|
## Deploy Bundle
|
||||||
|
|
||||||
|
### `deploy/truenas/core/docker-compose.yml` additions
|
||||||
|
|
||||||
|
Two new services on a new bridge network `dragonfork-mon`. Core continues
|
||||||
|
on `network_mode: host` unchanged. The new containers reach Core via
|
||||||
|
`host.docker.internal:${CORE_HTTP_PORT}` (Linux Docker resolves this when
|
||||||
|
`extra_hosts: ["host.docker.internal:host-gateway"]` is set on the service).
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
services:
|
||||||
|
core:
|
||||||
|
# ... existing definition unchanged
|
||||||
|
|
||||||
|
prom:
|
||||||
|
image: prom/prometheus:v2.55.0
|
||||||
|
container_name: dragonfork-prom
|
||||||
|
restart: unless-stopped
|
||||||
|
networks: [dragonfork-mon]
|
||||||
|
extra_hosts:
|
||||||
|
- "host.docker.internal:host-gateway"
|
||||||
|
volumes:
|
||||||
|
- ./prom/prometheus.yml:/etc/prometheus/prometheus.yml:ro
|
||||||
|
- ./prom/rules:/etc/prometheus/rules:ro
|
||||||
|
- ./prom-data:/prometheus
|
||||||
|
command:
|
||||||
|
- --config.file=/etc/prometheus/prometheus.yml
|
||||||
|
- --storage.tsdb.retention.time=15d
|
||||||
|
- --storage.tsdb.path=/prometheus
|
||||||
|
- --web.console.libraries=/usr/share/prometheus/console_libraries
|
||||||
|
- --web.console.templates=/usr/share/prometheus/consoles
|
||||||
|
ports:
|
||||||
|
- "${PROM_PORT:-9090}:9090"
|
||||||
|
|
||||||
|
grafana:
|
||||||
|
image: grafana/grafana-oss:11.3.0
|
||||||
|
container_name: dragonfork-grafana
|
||||||
|
restart: unless-stopped
|
||||||
|
networks: [dragonfork-mon]
|
||||||
|
depends_on: [prom]
|
||||||
|
environment:
|
||||||
|
GF_SECURITY_ADMIN_PASSWORD: "${GRAFANA_ADMIN_PASSWORD:?set in .env}"
|
||||||
|
GF_USERS_ALLOW_SIGN_UP: "false"
|
||||||
|
GF_AUTH_ANONYMOUS_ENABLED: "false"
|
||||||
|
volumes:
|
||||||
|
- ./grafana/provisioning:/etc/grafana/provisioning:ro
|
||||||
|
- ./grafana/dashboards:/var/lib/grafana/dashboards:ro
|
||||||
|
- ./grafana-data:/var/lib/grafana
|
||||||
|
ports:
|
||||||
|
- "${GRAFANA_PORT:-3000}:3000"
|
||||||
|
|
||||||
|
networks:
|
||||||
|
dragonfork-mon:
|
||||||
|
driver: bridge
|
||||||
|
```
|
||||||
|
|
||||||
|
### `prom/prometheus.yml`
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
global:
|
||||||
|
scrape_interval: 15s
|
||||||
|
scrape_timeout: 10s
|
||||||
|
evaluation_interval: 15s
|
||||||
|
external_labels:
|
||||||
|
core: dragonfork-truenas
|
||||||
|
|
||||||
|
rule_files:
|
||||||
|
- /etc/prometheus/rules/*.yml
|
||||||
|
|
||||||
|
scrape_configs:
|
||||||
|
- job_name: dragonfork-core
|
||||||
|
static_configs:
|
||||||
|
- targets: ["host.docker.internal:8080"]
|
||||||
|
metrics_path: /metrics
|
||||||
|
# If API auth is enabled on /metrics, uncomment and provide creds via
|
||||||
|
# env-substituted file. v0.1 leaves /metrics public by default.
|
||||||
|
# basic_auth:
|
||||||
|
# username_file: /run/secrets/prom_basic_user
|
||||||
|
# password_file: /run/secrets/prom_basic_pass
|
||||||
|
```
|
||||||
|
|
||||||
|
### `prom/rules/webrtc-alerts.yml`
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
groups:
|
||||||
|
- name: dragonfork-webrtc
|
||||||
|
rules:
|
||||||
|
- alert: WebRTCWHEPErrorRateHigh
|
||||||
|
expr: |
|
||||||
|
sum by (stream_id) (
|
||||||
|
rate(dragonfork_webrtc_whep_requests_total{code=~"4..|5.."}[5m])
|
||||||
|
) > 0.5
|
||||||
|
for: 5m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
summary: "WHEP error rate high on stream {{ $labels.stream_id }}"
|
||||||
|
description: "Sustained 4xx/5xx rate >0.5/sec for 5m."
|
||||||
|
|
||||||
|
- alert: WebRTCICEEstablishmentSlow
|
||||||
|
expr: |
|
||||||
|
histogram_quantile(0.95,
|
||||||
|
sum by (le, stream_id) (
|
||||||
|
rate(dragonfork_webrtc_ice_establishment_duration_seconds_bucket[10m])
|
||||||
|
)
|
||||||
|
) > 3
|
||||||
|
for: 10m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
summary: "ICE establishment p95 >3s on {{ $labels.stream_id }}"
|
||||||
|
|
||||||
|
- alert: WebRTCICEFailureRateHigh
|
||||||
|
expr: |
|
||||||
|
sum by (stream_id) (rate(dragonfork_webrtc_ice_failures_total[5m])) > 0.2
|
||||||
|
for: 5m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
summary: "ICE failures sustained on {{ $labels.stream_id }}"
|
||||||
|
|
||||||
|
- alert: WebRTCFFmpegLegFailure
|
||||||
|
expr: |
|
||||||
|
increase(dragonfork_webrtc_ffmpeg_leg_failures_total[5m]) > 0
|
||||||
|
labels:
|
||||||
|
severity: critical
|
||||||
|
annotations:
|
||||||
|
summary: "FFmpeg RTP leg failed on {{ $labels.stream_id }} ({{ $labels.leg }})"
|
||||||
|
description: "Silent degradation of RTP output. Check FFmpeg logs."
|
||||||
|
```
|
||||||
|
|
||||||
|
Alerts evaluate but route nowhere. Alertmanager bundling deferred — see
|
||||||
|
non-goals.
|
||||||
|
|
||||||
|
### Grafana provisioning
|
||||||
|
|
||||||
|
Datasource provisioning at `grafana/provisioning/datasources/prometheus.yml`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
apiVersion: 1
|
||||||
|
datasources:
|
||||||
|
- name: Prometheus
|
||||||
|
type: prometheus
|
||||||
|
access: proxy
|
||||||
|
url: http://prom:9090
|
||||||
|
isDefault: true
|
||||||
|
editable: false
|
||||||
|
```
|
||||||
|
|
||||||
|
Dashboard provisioning at `grafana/provisioning/dashboards/webrtc.yml`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
apiVersion: 1
|
||||||
|
providers:
|
||||||
|
- name: dragonfork
|
||||||
|
orgId: 1
|
||||||
|
folder: "Dragon Fork"
|
||||||
|
type: file
|
||||||
|
disableDeletion: false
|
||||||
|
updateIntervalSeconds: 30
|
||||||
|
options:
|
||||||
|
path: /var/lib/grafana/dashboards
|
||||||
|
```
|
||||||
|
|
||||||
|
### Dashboard JSON: `dragonfork-webrtc-health.json`
|
||||||
|
|
||||||
|
Single dashboard, five rows aligned to the questions from the metric
|
||||||
|
inventory:
|
||||||
|
|
||||||
|
1. **WHEP API health** — request rate by route (stat panel), error rate
|
||||||
|
stacked by code (timeseries), p95 request duration by route (timeseries).
|
||||||
|
2. **ICE establishment** — success/failure rate (gauge), p50/p95
|
||||||
|
establishment duration (timeseries with a 3s threshold line for the
|
||||||
|
alert), failure breakdown by reason (table).
|
||||||
|
3. **What's flowing** — `active_streams` (stat), `active_peers` per stream
|
||||||
|
(timeseries), top 5 streams by peer count (table).
|
||||||
|
4. **Capacity headroom** — `udp_ports_available` (gauge with red-zone <10),
|
||||||
|
cap rejection rate by scope (timeseries).
|
||||||
|
5. **Silent degradation** — FFmpeg leg failure timeline (timeseries with
|
||||||
|
annotations), codec mismatch counter (stat).
|
||||||
|
|
||||||
|
Built in Grafana 11.3, exported as JSON, committed to the repo. Refresh
|
||||||
|
default 30s.
|
||||||
|
|
||||||
|
### `.env` template additions
|
||||||
|
|
||||||
|
Append to `deploy/truenas/core/README.md`'s example `.env`:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
# --- Observability (added in v0.2) ---
|
||||||
|
GRAFANA_ADMIN_PASSWORD=$(openssl rand -base64 24)
|
||||||
|
GRAFANA_PORT=3000
|
||||||
|
PROM_PORT=9090
|
||||||
|
```
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
### Unit tests — `prometheus/webrtc_test.go`
|
||||||
|
|
||||||
|
Mock `WebRTCStatsSource`. Drive the collector through three states (no
|
||||||
|
streams, one stream with N peers, multiple streams). Use
|
||||||
|
`testutil.CollectAndCompare` to assert exact metric/label/value output
|
||||||
|
against a golden plaintext fixture.
|
||||||
|
|
||||||
|
```go
|
||||||
|
// Golden fixture (excerpt):
|
||||||
|
// # HELP dragonfork_webrtc_active_streams ...
|
||||||
|
// # TYPE dragonfork_webrtc_active_streams gauge
|
||||||
|
// dragonfork_webrtc_active_streams{core="test"} 2
|
||||||
|
// # HELP dragonfork_webrtc_active_peers ...
|
||||||
|
// # TYPE dragonfork_webrtc_active_peers gauge
|
||||||
|
// dragonfork_webrtc_active_peers{core="test",stream_id="live"} 3
|
||||||
|
// dragonfork_webrtc_active_peers{core="test",stream_id="cam"} 1
|
||||||
|
```
|
||||||
|
|
||||||
|
### Unit tests — `app/webrtc/metrics_test.go`
|
||||||
|
|
||||||
|
Reuse `handler_test.go` setup (fake registry, in-process Echo router).
|
||||||
|
Hit each WHEP route, assert the corresponding counter and histogram have
|
||||||
|
the expected increment via `testutil.ToFloat64`. Drive forced error paths
|
||||||
|
(unknown stream → 404, codec-less SDP → 406, cap exceeded → 503, ICE
|
||||||
|
timeout → 504) and assert the right error-bucket counters bumped.
|
||||||
|
|
||||||
|
### Integration verification — `test/TESTING.md`
|
||||||
|
|
||||||
|
New section "Verifying Prometheus metrics":
|
||||||
|
|
||||||
|
```
|
||||||
|
1. docker compose up -d
|
||||||
|
2. curl -s http://<host>:8080/metrics | grep dragonfork_webrtc_
|
||||||
|
- expect: 11 metric families present, all with `core="dragonfork-truenas"`
|
||||||
|
3. Open http://<host>:3000 (Grafana), log in with GRAFANA_ADMIN_PASSWORD
|
||||||
|
4. Navigate to Dashboards → Dragon Fork → WebRTC Health
|
||||||
|
- expect: all 5 rows render, no "no data" panels except where stream traffic is absent
|
||||||
|
5. Trigger one of each error in test/whep-player.html (intentional codec
|
||||||
|
mismatch via SDP edit, kill the publisher mid-stream, etc.)
|
||||||
|
6. Watch the Grafana panels and verify counters tick within 15s.
|
||||||
|
```
|
||||||
|
|
||||||
|
### CI
|
||||||
|
|
||||||
|
Existing test runner picks up the new `_test.go` files. No new CI gates
|
||||||
|
beyond standard build+test — observability isn't a contract; the unit
|
||||||
|
tests verify shape only. Grafana dashboard JSON is *not* validated in CI
|
||||||
|
(no good lightweight validator); manual verification only.
|
||||||
|
|
||||||
|
### Load test alignment
|
||||||
|
|
||||||
|
The deferred 5-peer × 10-min load test (separate spec) will use this
|
||||||
|
dashboard as its primary observation surface. Recording rules for the
|
||||||
|
load test's specific aggregations can be added in that spec without
|
||||||
|
touching this one.
|
||||||
|
|
||||||
|
## Rollout
|
||||||
|
|
||||||
|
The TrueNAS v0.1.0-dragonfork deploy upgrades via:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
cd deploy/truenas/core
|
||||||
|
git pull # latest main with this change
|
||||||
|
# Add new lines to .env (see template above)
|
||||||
|
docker compose pull # grabs prom + grafana images
|
||||||
|
docker compose up -d # core unchanged, prom + grafana new
|
||||||
|
```
|
||||||
|
|
||||||
|
Core continues on host networking. The new containers connect via
|
||||||
|
`host.docker.internal:host-gateway`, no firewall changes required for
|
||||||
|
intra-host traffic. External Grafana access is on `${GRAFANA_PORT}`.
|
||||||
|
|
||||||
|
### Backwards compatibility
|
||||||
|
|
||||||
|
- No upstream metric names or labels modified. New metrics are purely
|
||||||
|
additive in `dragonfork_webrtc_*` namespace.
|
||||||
|
- No API changes. `/metrics` payload grows but stays well-formed
|
||||||
|
Prometheus exposition.
|
||||||
|
- Existing config, env vars, and process JSON formats unchanged.
|
||||||
|
|
||||||
|
### Forward compatibility
|
||||||
|
|
||||||
|
- The `core` label being a `ConstLabels` value (not a per-event dimension)
|
||||||
|
means future federated multi-Core scrapes will distinguish series cleanly
|
||||||
|
by setting `core="dragonfork-truenas-east"` etc. in each deploy's config
|
||||||
|
loader. Spec'd here, implemented when needed.
|
||||||
|
- New metrics in this spec follow the `dragonfork_<subsystem>_<noun>` naming
|
||||||
|
pattern. Future Dragon-Fork-specific metrics (WHIP, keyframe cache,
|
||||||
|
bandwidth) should adopt the same convention.
|
||||||
|
|
||||||
|
### Known gaps post-rollout
|
||||||
|
|
||||||
|
- No paging. Alerts evaluate, no Alertmanager. If `WebRTCFFmpegLegFailure`
|
||||||
|
fires at 3am, no notification — operator notices at next dashboard check.
|
||||||
|
Acceptable for v0.2 single-operator deploy. Track as a v0.3 spec.
|
||||||
|
- Grafana dashboard JSON is hand-edited via Grafana UI then re-exported.
|
||||||
|
No JSON-as-code library used. If dashboard maintenance gets painful,
|
||||||
|
Grafonnet/Grafana-as-code is a v0.3+ refactor.
|
||||||
|
- `/metrics` itself is unauthenticated by default in v0.1 (matches
|
||||||
|
upstream). If Core's deploy bundle is exposed to untrusted networks,
|
||||||
|
the operator should already be using auth on Core's HTTP listener. Not
|
||||||
|
this spec's problem to solve, but worth a one-line note in
|
||||||
|
`deploy/truenas/core/README.md`.
|
||||||
|
|
||||||
|
## Open Decisions
|
||||||
|
|
||||||
|
1. **Should the `Stats()` method live on `Subsystem` or on `Handler`?**
|
||||||
|
The peer count is in `Handler`'s per-stream peer index; stream count
|
||||||
|
is in `Subsystem`'s registry; UDP port pool is in `portalloc`. Easiest
|
||||||
|
shape: `Subsystem.Stats()` is the public surface and internally
|
||||||
|
gathers from `Handler` (via the existing teardown-hook plumbing) and
|
||||||
|
`portalloc`. Decide at implementation time based on which surface
|
||||||
|
exposes the cleanest seams.
|
||||||
|
|
||||||
|
2. **Should histograms also include a `core` label, given it's already a
|
||||||
|
`ConstLabels`?** Yes — `ConstLabels` is automatically present on every
|
||||||
|
sample, no per-call overhead, and federations need it.
|
||||||
|
|
||||||
|
3. **Should Prometheus retention be configurable via `.env`?** Defaulting
|
||||||
|
to 15d covers the realistic window for "what happened last week?"
|
||||||
|
queries. Adding `PROM_RETENTION_DAYS=15d` to `.env` is a one-line
|
||||||
|
change. Including it as optional, defaulting to 15d.
|
||||||
|
|
||||||
|
4. **Import-alias collision.** The local package is `package prometheus`
|
||||||
|
(at `github.com/datarhei/core/v16/prometheus`) and `client_golang` is
|
||||||
|
also `package prometheus`. Files in `app/webrtc/` that need both must
|
||||||
|
alias one — convention is `coreprom "github.com/datarhei/core/v16/prometheus"`.
|
||||||
|
Implementation note only; doesn't change the design.
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- [Prometheus client_golang](https://pkg.go.dev/github.com/prometheus/client_golang/prometheus)
|
||||||
|
- [Prometheus instrumentation best practices](https://prometheus.io/docs/practices/instrumentation/)
|
||||||
|
- [Histogram bucket design](https://prometheus.io/docs/practices/histograms/)
|
||||||
|
- [Grafana provisioning docs](https://grafana.com/docs/grafana/latest/administration/provisioning/)
|
||||||
|
- v0.1 design: `docs/design/2026-04-16-datarhei-dragon-fork-webrtc-design.md`
|
||||||
|
- M2 integration: `docs/design/2026-04-17-datarhei-dragon-fork-m2-webrtc-core-integration.md`
|
||||||
Loading…
Reference in a new issue