zgaetano/datarhei-dragonfork-core

Fork 0

ZGaetano 949daa26b5

ci / vet + build (push) Successful in 9m51s

Details

ci / race tests (push) Failing after 8m5s

Details

ci / WebRTC smoke (5-viewer fanout) (push) Successful in 9m53s

Details

ci / WebRTC latency p95 gate (push) Successful in 10m4s

Details

docs(design): WebRTC Prometheus metrics + Grafana stack design

Closes the v0.1 observability gap. Eleven new metrics in the
dragonfork_webrtc_* namespace (RED-method on the WHEP surface plus
state gauges from the WebRTC subsystem), Prom + Grafana containers
added to deploy/truenas/core/, four pre-loaded alert rules, one
pre-provisioned dashboard.

Hybrid instrumentation: direct client_golang in app/webrtc/ for
hot-path counters and histograms; snapshot collector in
prometheus/webrtc.go for slow-changing gauges. Rationale and
trade-offs against the upstream monitor/metric bus pattern documented
in the Approach section.

Targets v0.2.0-dragonfork.

2026-05-03 14:50:56 -04:00

26 KiB

Raw Blame History

Datarhei - Dragon Fork: WebRTC Prometheus Metrics

Status: Draft for review Author: Zac (Wild Dragon) Date: 2026-05-03 Predecessors:

2026-04-16-datarhei-dragon-fork-webrtc-design.md
2026-04-17-datarhei-dragon-fork-m2-webrtc-core-integration.md
v0.1.0-dragonfork released 2026-05-03

Summary

Add Prometheus instrumentation to Dragon Fork's WebRTC subsystem and ship a collection-and-dashboard stack in the existing TrueNAS deploy bundle. Closes the v0.1 observability gap: the WHEP egress has been running in production since 2026-04-17 with zero per-subsystem signal.

The deliverable is a RED-method dashboard ("rate, errors, duration") that answers a single operator question — is the WebRTC stack healthy right now? Eleven new metrics in the dragonfork_webrtc_* namespace, two new containers (Prometheus + Grafana) in deploy/truenas/core/, four pre-loaded alert rules, one pre-provisioned dashboard.

Goals

Operator can answer "is WebRTC healthy right now?" from a single Grafana dashboard, without tailing logs or hitting the API.
Per-stream drill-down available when the dashboard goes red — labels carry stream_id everywhere it's meaningful, never peer_id.
Deploy is one-command on a fresh TrueNAS box (docker compose up -d), matching the existing v0.1 deploy ergonomics.
Backwards-compatible: zero changes to upstream's /metrics payload. New metrics are purely additive.
Bucket choices and label sets are tuned for the realistic latency ranges observed in v0.1 (server-hop p95 ≈ 240µs, ICE establishment seconds-scale).

Non-Goals

Alertmanager bundling. Alert rules are loaded into Prometheus but not routed. Paging configuration is too opinionated to ship a default; separate spec if/when paging is wanted.
Per-peer metric labels. Peer-level forensics (individual session lifetimes, per-resource teardown reasons) is out of scope. peer_id is unbounded under churn and risks cardinality bloat.
Federated multi-Core scrape. Single-deploy scrape config only. The core label is set statically to dragonfork-truenas.
Latency p95 CI gate via Prometheus. Server-hop latency stays a Go test gate (-tags latency); not a Prometheus histogram.
Server-hop microsecond histogram. The 240µs server-hop is well below HTTP request scales and would need its own bucket set; it's already covered by the latency CI test, no need to duplicate in Prom.
Custom monitor/metric bus integration. Upstream pulls from monitor/metric.Reader. We diverge — see Module Layout for rationale.

Context

v0.1 surface area:

WHEP HTTP routes: POST /api/v3/whep/{id}, DELETE /api/v3/whep/{id}/{r}, PATCH /api/v3/whep/{id}/{r}, plus admin GET /api/v3/webrtc/streams and GET /api/v3/webrtc/streams/{id}/peers.
Error matrix in v0.1: 406 codec mismatch, 503 cap reached (split into global vs per-stream in response body), 504 ICE timeout, 204 DELETE idempotent, 404 unknown stream.
Pion-mediated peer connection lifecycle in app/webrtc/lifecycle.go — ICE state transitions are the natural hook for ICE timing/failure metrics.
FFmpeg RTP output legs supervised by the existing process supervisor; silent leg failure is a known "quietly degrading" risk worth instrumenting.

Existing Prometheus integration (upstream):

prometheus/prometheus.go exposes a Metrics interface with Register and an HTTPHandler(). Single shared prometheus.Registry.
prometheus/restream.go is the reference collector — pulls from monitor/metric.Reader via metric.Pattern queries, emits via prometheus.MustNewConstMetric. All upstream collectors carry a core label as the first dimension.
/metrics endpoint already exposed by Core; auth handled at the same layer as the rest of the API.

Approach

Hybrid instrumentation, in two surfaces:

Direct prometheus/client_golang instrumentation in app/webrtc/ for hot-path counters and histograms (request rate, request duration, ICE establishment duration, error counters by reason). Histograms can't be reconstructed from a scrape-time snapshot, so this is non-negotiable for RED-method.
Snapshot-style collector in prometheus/webrtc.go for slow-changing gauges (active streams, active peers per stream, UDP port pool usage). Calls a new Stats() method on the WebRTC subsystem at scrape time.

Both surfaces register against the same prometheus.Registerer exposed by prometheus.Metrics. No new HTTP endpoint, no new auth path. Both take a core first-label dimension to match upstream collector convention.

Why not pure snapshot?

Upstream's prometheus/restream.go pulls from a monitor/metric bus that the FFmpeg supervision layer writes into. We could mirror that for WebRTC — have app/webrtc/lifecycle.go and handler.go push events onto the bus, have prometheus/webrtc.go pull them. Two reasons not to:

Histograms don't fit the pattern. The bus stores point-in-time values (gauges and counters), not distributions. RED-method needs duration p50 and p95; you'd end up maintaining an in-process sliding-window quantile estimator inside the WebRTC subsystem, which is more code than just using client_golang.Histogram directly.
The bus is FFmpeg-shaped. metric.Pattern queries are designed for process-state metrics (process IDs, FFmpeg states). Bolting WebRTC semantics on requires defining new patterns the bus consumers all need to know about, for a payload only the WebRTC collector cares about.

The hybrid keeps each metric type on the cleanest path. The cost is two patterns in the codebase instead of one — accepted, with a comment in prometheus/webrtc.go pointing at this rationale so the next contributor doesn't try to "fix" the divergence.

Why not pure direct?

Pure client_golang everywhere would mean the gauges (active streams, active peers, UDP ports) sit in app/webrtc/ alongside histograms. Workable, but loses the "one collector file per subsystem in prometheus/" pattern that anyone reading the repo's existing structure would expect. Snapshot gauges are cheap to implement via the existing pattern, so we keep them where a casual reader would look.

Module Layout

New files

app/webrtc/metrics.go       (~150 LOC)
app/webrtc/metrics_test.go  (~200 LOC)
prometheus/webrtc.go        (~120 LOC)
prometheus/webrtc_test.go   (~150 LOC)
deploy/truenas/core/prom/prometheus.yml
deploy/truenas/core/prom/rules/webrtc-alerts.yml
deploy/truenas/core/grafana/provisioning/datasources/prometheus.yml
deploy/truenas/core/grafana/provisioning/dashboards/webrtc.yml
deploy/truenas/core/grafana/dashboards/dragonfork-webrtc-health.json

Modified files

app/webrtc/handler.go       — add metric middleware around WHEP routes
app/webrtc/lifecycle.go     — record ICE timing in OnConnectionStateChange
app/webrtc/subsystem.go     — add Stats() method, instrument process hooks
deploy/truenas/core/docker-compose.yml  — add prom + grafana services
deploy/truenas/core/README.md           — document new env vars + ports
README.md                   — quick-start mentions Grafana URL
CHANGELOG.md                — v0.2.0-dragonfork section

`app/webrtc/metrics.go` — direct instrumentation

promauto-registered into the shared registry, exposed as package-level vars so handler.go and lifecycle.go can increment without dependency injection. Single Init(reg prometheus.Registerer, core string) called from subsystem.New after the registry is available.

// Sketch — exact wire format finalized at implementation.
package webrtc

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var histBuckets = []float64{0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10}

type metrics struct {
    whepRequests          *prometheus.CounterVec   // route, code, stream_id
    whepRequestDuration   *prometheus.HistogramVec // route, stream_id
    iceEstablishment      *prometheus.HistogramVec // stream_id, result
    iceFailures           *prometheus.CounterVec   // stream_id, reason
    codecMismatches       *prometheus.CounterVec   // stream_id, kind
    capRejections         *prometheus.CounterVec   // stream_id, scope
    ffmpegLegFailures     *prometheus.CounterVec   // stream_id, leg
}

func newMetrics(reg prometheus.Registerer, core string) *metrics {
    factory := promauto.With(reg)
    return &metrics{
        whepRequests: factory.NewCounterVec(prometheus.CounterOpts{
            Name:        "dragonfork_webrtc_whep_requests_total",
            Help:        "Count of WHEP requests by route, status code, and stream.",
            ConstLabels: prometheus.Labels{"core": core},
        }, []string{"route", "code", "stream_id"}),
        // ... etc
    }
}

The core label is a ConstLabels (set once at construction) rather than a per-request dimension — matches the upstream collector pattern and avoids threading it through every call site.

`prometheus/webrtc.go` — snapshot collector

Standard prometheus.Collector interface (Describe / Collect). Keeps a reference to a WebRTCStatsSource interface, which the WebRTC subsystem implements via its Stats() method. Avoids importing app/webrtc from prometheus/ — the dependency arrow points the right way.

// Sketch.
type WebRTCStatsSource interface {
    Stats() WebRTCStats
}

type WebRTCStats struct {
    StreamCount        int
    PeersByStream      map[string]int
    UDPPortsInUse      int
    UDPPortsAvailable  int
}

type webrtcCollector struct {
    core   string
    source WebRTCStatsSource

    activeStreamsDesc      *prometheus.Desc
    activePeersDesc        *prometheus.Desc
    udpPortsInUseDesc      *prometheus.Desc
    udpPortsAvailableDesc  *prometheus.Desc
}

func NewWebRTCCollector(core string, source WebRTCStatsSource) prometheus.Collector { ... }

The WebRTCStats type lives in prometheus/webrtc.go (not in app/webrtc/) so the dependency stays one-directional. The subsystem implements the interface by satisfying the shape, not by importing from prometheus/.

`app/webrtc/subsystem.go` — `Stats()` method

func (s *Subsystem) Stats() prometheus.WebRTCStats {
    s.mu.Lock()
    defer s.mu.Unlock()
    peers := make(map[string]int, len(s.streams))
    for id, st := range s.streams {
        peers[id] = len(st.peers)  // assume peers tracked per-stream
    }
    return prometheus.WebRTCStats{
        StreamCount:       len(s.streams),
        PeersByStream:     peers,
        UDPPortsInUse:     s.portAlloc.InUse(),
        UDPPortsAvailable: s.portAlloc.Available(),
    }
}

The existing subsystem tracks streams in s.streams under s.mu. Peer count per stream needs the per-stream peer index that already exists in handler.go — the Stats() method consults it via the existing teardown hook plumbing or a small new accessor on Handler. Pick whichever surface introduces the smaller blast radius at implementation time.

Metric Inventory

Eleven metrics. Eight new label dimensions across them. ~50 active series at typical 1-5 stream scale.

Direct instrumentation (`app/webrtc/metrics.go`)

Name	Type	Labels	Description
`dragonfork_webrtc_whep_requests_total`	Counter	core, route, code, stream_id	Count of WHEP requests by route+status code.
`dragonfork_webrtc_whep_request_duration_seconds`	Histogram	core, route, stream_id	Server-side WHEP request duration. Buckets: `[0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]`.
`dragonfork_webrtc_ice_establishment_duration_seconds`	Histogram	core, stream_id, result	Time from `SetLocalDescription` to first `connected` or `failed` ICE state. Same buckets.
`dragonfork_webrtc_ice_failures_total`	Counter	core, stream_id, reason	ICE failure count. `reason` ∈ {timeout, disconnected, failed}.
`dragonfork_webrtc_codec_mismatches_total`	Counter	core, stream_id, kind	406 rejections by kind. `kind` ∈ {video, audio}.
`dragonfork_webrtc_cap_rejections_total`	Counter	core, stream_id, scope	503 rejections. `scope` ∈ {global, stream}.
`dragonfork_webrtc_ffmpeg_leg_failures_total`	Counter	core, stream_id, leg	RTP output leg failures. `leg` ∈ {video, audio}.

Snapshot collector (`prometheus/webrtc.go`)

Name	Type	Labels	Description
`dragonfork_webrtc_active_streams`	Gauge	core	Streams currently registered (processes with `webrtc.enabled=true` running).
`dragonfork_webrtc_active_peers`	Gauge	core, stream_id	Currently subscribed WHEP peers per stream.
`dragonfork_webrtc_udp_ports_in_use`	Gauge	core	UDP ports currently allocated from the pool.
`dragonfork_webrtc_udp_ports_available`	Gauge	core	Pool size minus in-use (explicit for alert friendliness).

Label rationale

whep_request_duration_seconds deliberately omits code — separating distributions per outcome makes p95 noisy, and per-route per-stream p95 is what an operator actually looks at. Errors get visibility through the request-counter ratio.
ice_establishment_duration_seconds includes both connected and failed results in the same histogram via the result label — intentionally — so the dashboard can compare success latency to failure-tail latency on the same axis.
cap_rejections_total keeps the scope label because v0.1's response body already splits global vs per-stream rejections; metrics mirror that distinction so the dashboard shows whether to raise max_peers_total or just one stream's per-stream cap.
ffmpeg_leg_failures_total is the "quietly degrading" canary — a silent RTP-output-leg failure (port bind, encoder crash) is exactly what the "is it healthy?" framing is meant to catch.

Cardinality budget

At typical scale (5 streams, 3 routes, ~6 status codes seen in practice):

whep_requests_total: 5 × 3 × 6 = 90 series (worst case)
whep_request_duration_seconds: 5 × 3 × (8 buckets + sum + count) = 150 series
ice_establishment_duration_seconds: 5 × 2 × 10 = 100 series
All others: 5–15 series each
Total: <500 active series at 5-stream sustained load

Well within Prometheus's comfort zone. At 15s scrape interval × 15-day retention, on-disk storage ~80MB.

Specifically excluded metrics

Per-peer session metrics. Listed under non-goals.
Bytes-out / bandwidth. Pion exposes RTP write bytes via stats; would be useful but pulls peer-level state. Defer to a future v0.3 spec ("WebRTC bandwidth observability") if needed.
Server-hop latency (FFmpeg → peer). Microsecond scale, already covered by -tags latency test gate, would need its own bucket set.

Deploy Bundle

`deploy/truenas/core/docker-compose.yml` additions

Two new services on a new bridge network dragonfork-mon. Core continues on network_mode: host unchanged. The new containers reach Core via host.docker.internal:${CORE_HTTP_PORT} (Linux Docker resolves this when extra_hosts: ["host.docker.internal:host-gateway"] is set on the service).

services:
  core:
    # ... existing definition unchanged

  prom:
    image: prom/prometheus:v2.55.0
    container_name: dragonfork-prom
    restart: unless-stopped
    networks: [dragonfork-mon]
    extra_hosts:
      - "host.docker.internal:host-gateway"
    volumes:
      - ./prom/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prom/rules:/etc/prometheus/rules:ro
      - ./prom-data:/prometheus
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.retention.time=15d
      - --storage.tsdb.path=/prometheus
      - --web.console.libraries=/usr/share/prometheus/console_libraries
      - --web.console.templates=/usr/share/prometheus/consoles
    ports:
      - "${PROM_PORT:-9090}:9090"

  grafana:
    image: grafana/grafana-oss:11.3.0
    container_name: dragonfork-grafana
    restart: unless-stopped
    networks: [dragonfork-mon]
    depends_on: [prom]
    environment:
      GF_SECURITY_ADMIN_PASSWORD: "${GRAFANA_ADMIN_PASSWORD:?set in .env}"
      GF_USERS_ALLOW_SIGN_UP: "false"
      GF_AUTH_ANONYMOUS_ENABLED: "false"
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
      - ./grafana/dashboards:/var/lib/grafana/dashboards:ro
      - ./grafana-data:/var/lib/grafana
    ports:
      - "${GRAFANA_PORT:-3000}:3000"

networks:
  dragonfork-mon:
    driver: bridge

`prom/prometheus.yml`

global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 15s
  external_labels:
    core: dragonfork-truenas

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  - job_name: dragonfork-core
    static_configs:
      - targets: ["host.docker.internal:8080"]
    metrics_path: /metrics
    # If API auth is enabled on /metrics, uncomment and provide creds via
    # env-substituted file. v0.1 leaves /metrics public by default.
    # basic_auth:
    #   username_file: /run/secrets/prom_basic_user
    #   password_file: /run/secrets/prom_basic_pass

`prom/rules/webrtc-alerts.yml`

groups:
  - name: dragonfork-webrtc
    rules:
      - alert: WebRTCWHEPErrorRateHigh
        expr: |
          sum by (stream_id) (
            rate(dragonfork_webrtc_whep_requests_total{code=~"4..|5.."}[5m])
          ) > 0.5          
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "WHEP error rate high on stream {{ $labels.stream_id }}"
          description: "Sustained 4xx/5xx rate >0.5/sec for 5m."

      - alert: WebRTCICEEstablishmentSlow
        expr: |
          histogram_quantile(0.95,
            sum by (le, stream_id) (
              rate(dragonfork_webrtc_ice_establishment_duration_seconds_bucket[10m])
            )
          ) > 3          
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "ICE establishment p95 >3s on {{ $labels.stream_id }}"

      - alert: WebRTCICEFailureRateHigh
        expr: |
          sum by (stream_id) (rate(dragonfork_webrtc_ice_failures_total[5m])) > 0.2          
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "ICE failures sustained on {{ $labels.stream_id }}"

      - alert: WebRTCFFmpegLegFailure
        expr: |
          increase(dragonfork_webrtc_ffmpeg_leg_failures_total[5m]) > 0          
        labels:
          severity: critical
        annotations:
          summary: "FFmpeg RTP leg failed on {{ $labels.stream_id }} ({{ $labels.leg }})"
          description: "Silent degradation of RTP output. Check FFmpeg logs."

Alerts evaluate but route nowhere. Alertmanager bundling deferred — see non-goals.

Grafana provisioning

Datasource provisioning at grafana/provisioning/datasources/prometheus.yml:

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prom:9090
    isDefault: true
    editable: false

Dashboard provisioning at grafana/provisioning/dashboards/webrtc.yml:

apiVersion: 1
providers:
  - name: dragonfork
    orgId: 1
    folder: "Dragon Fork"
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    options:
      path: /var/lib/grafana/dashboards

Dashboard JSON: `dragonfork-webrtc-health.json`

Single dashboard, five rows aligned to the questions from the metric inventory:

WHEP API health — request rate by route (stat panel), error rate stacked by code (timeseries), p95 request duration by route (timeseries).
ICE establishment — success/failure rate (gauge), p50/p95 establishment duration (timeseries with a 3s threshold line for the alert), failure breakdown by reason (table).
What's flowing — active_streams (stat), active_peers per stream (timeseries), top 5 streams by peer count (table).
Capacity headroom — udp_ports_available (gauge with red-zone <10), cap rejection rate by scope (timeseries).
Silent degradation — FFmpeg leg failure timeline (timeseries with annotations), codec mismatch counter (stat).

Built in Grafana 11.3, exported as JSON, committed to the repo. Refresh default 30s.

`.env` template additions

Append to deploy/truenas/core/README.md's example .env:

# --- Observability (added in v0.2) ---
GRAFANA_ADMIN_PASSWORD=$(openssl rand -base64 24)
GRAFANA_PORT=3000
PROM_PORT=9090

Testing

Unit tests — `prometheus/webrtc_test.go`

Mock WebRTCStatsSource. Drive the collector through three states (no streams, one stream with N peers, multiple streams). Use testutil.CollectAndCompare to assert exact metric/label/value output against a golden plaintext fixture.

// Golden fixture (excerpt):
// # HELP dragonfork_webrtc_active_streams ...
// # TYPE dragonfork_webrtc_active_streams gauge
// dragonfork_webrtc_active_streams{core="test"} 2
// # HELP dragonfork_webrtc_active_peers ...
// # TYPE dragonfork_webrtc_active_peers gauge
// dragonfork_webrtc_active_peers{core="test",stream_id="live"} 3
// dragonfork_webrtc_active_peers{core="test",stream_id="cam"} 1

Unit tests — `app/webrtc/metrics_test.go`

Reuse handler_test.go setup (fake registry, in-process Echo router). Hit each WHEP route, assert the corresponding counter and histogram have the expected increment via testutil.ToFloat64. Drive forced error paths (unknown stream → 404, codec-less SDP → 406, cap exceeded → 503, ICE timeout → 504) and assert the right error-bucket counters bumped.

Integration verification — `test/TESTING.md`

New section "Verifying Prometheus metrics":

1. docker compose up -d
2. curl -s http://<host>:8080/metrics | grep dragonfork_webrtc_
   - expect: 11 metric families present, all with `core="dragonfork-truenas"`
3. Open http://<host>:3000 (Grafana), log in with GRAFANA_ADMIN_PASSWORD
4. Navigate to Dashboards → Dragon Fork → WebRTC Health
   - expect: all 5 rows render, no "no data" panels except where stream traffic is absent
5. Trigger one of each error in test/whep-player.html (intentional codec
   mismatch via SDP edit, kill the publisher mid-stream, etc.)
6. Watch the Grafana panels and verify counters tick within 15s.

CI

Existing test runner picks up the new _test.go files. No new CI gates beyond standard build+test — observability isn't a contract; the unit tests verify shape only. Grafana dashboard JSON is not validated in CI (no good lightweight validator); manual verification only.

Load test alignment

The deferred 5-peer × 10-min load test (separate spec) will use this dashboard as its primary observation surface. Recording rules for the load test's specific aggregations can be added in that spec without touching this one.

Rollout

The TrueNAS v0.1.0-dragonfork deploy upgrades via:

cd deploy/truenas/core
git pull                          # latest main with this change
# Add new lines to .env (see template above)
docker compose pull               # grabs prom + grafana images
docker compose up -d              # core unchanged, prom + grafana new

Core continues on host networking. The new containers connect via host.docker.internal:host-gateway, no firewall changes required for intra-host traffic. External Grafana access is on ${GRAFANA_PORT}.

Backwards compatibility

No upstream metric names or labels modified. New metrics are purely additive in dragonfork_webrtc_* namespace.
No API changes. /metrics payload grows but stays well-formed Prometheus exposition.
Existing config, env vars, and process JSON formats unchanged.

Forward compatibility

The core label being a ConstLabels value (not a per-event dimension) means future federated multi-Core scrapes will distinguish series cleanly by setting core="dragonfork-truenas-east" etc. in each deploy's config loader. Spec'd here, implemented when needed.
New metrics in this spec follow the dragonfork_<subsystem>_<noun> naming pattern. Future Dragon-Fork-specific metrics (WHIP, keyframe cache, bandwidth) should adopt the same convention.

Known gaps post-rollout

No paging. Alerts evaluate, no Alertmanager. If WebRTCFFmpegLegFailure fires at 3am, no notification — operator notices at next dashboard check. Acceptable for v0.2 single-operator deploy. Track as a v0.3 spec.
Grafana dashboard JSON is hand-edited via Grafana UI then re-exported. No JSON-as-code library used. If dashboard maintenance gets painful, Grafonnet/Grafana-as-code is a v0.3+ refactor.
/metrics itself is unauthenticated by default in v0.1 (matches upstream). If Core's deploy bundle is exposed to untrusted networks, the operator should already be using auth on Core's HTTP listener. Not this spec's problem to solve, but worth a one-line note in deploy/truenas/core/README.md.

Open Decisions

Should the Stats() method live on Subsystem or on Handler? The peer count is in Handler's per-stream peer index; stream count is in Subsystem's registry; UDP port pool is in portalloc. Easiest shape: Subsystem.Stats() is the public surface and internally gathers from Handler (via the existing teardown-hook plumbing) and portalloc. Decide at implementation time based on which surface exposes the cleanest seams.
Should histograms also include a core label, given it's already a ConstLabels? Yes — ConstLabels is automatically present on every sample, no per-call overhead, and federations need it.
Should Prometheus retention be configurable via .env? Defaulting to 15d covers the realistic window for "what happened last week?" queries. Adding PROM_RETENTION_DAYS=15d to .env is a one-line change. Including it as optional, defaulting to 15d.
Import-alias collision. The local package is package prometheus (at github.com/datarhei/core/v16/prometheus) and client_golang is also package prometheus. Files in app/webrtc/ that need both must alias one — convention is coreprom "github.com/datarhei/core/v16/prometheus". Implementation note only; doesn't change the design.

References

Prometheus client_golang
Prometheus instrumentation best practices
Histogram bucket design
Grafana provisioning docs
v0.1 design: docs/design/2026-04-16-datarhei-dragon-fork-webrtc-design.md
M2 integration: docs/design/2026-04-17-datarhei-dragon-fork-m2-webrtc-core-integration.md

26 KiB Raw Blame History Unescape Escape

Datarhei - Dragon Fork: WebRTC Prometheus Metrics

Summary

Goals

Non-Goals

Context

Approach

Why not pure snapshot?

Why not pure direct?

Module Layout

New files

Modified files

app/webrtc/metrics.go — direct instrumentation

prometheus/webrtc.go — snapshot collector

app/webrtc/subsystem.go — Stats() method

Metric Inventory

Direct instrumentation (app/webrtc/metrics.go)

Snapshot collector (prometheus/webrtc.go)

Label rationale

Cardinality budget

Specifically excluded metrics

Deploy Bundle

deploy/truenas/core/docker-compose.yml additions

prom/prometheus.yml

prom/rules/webrtc-alerts.yml

Grafana provisioning

Dashboard JSON: dragonfork-webrtc-health.json

.env template additions

Testing

Unit tests — prometheus/webrtc_test.go

Unit tests — app/webrtc/metrics_test.go

Integration verification — test/TESTING.md

CI

Load test alignment

Rollout

Backwards compatibility

Forward compatibility

Known gaps post-rollout

Open Decisions

References

26 KiB

Raw Blame History

`app/webrtc/metrics.go` — direct instrumentation

`prometheus/webrtc.go` — snapshot collector

`app/webrtc/subsystem.go` — `Stats()` method

Direct instrumentation (`app/webrtc/metrics.go`)

Snapshot collector (`prometheus/webrtc.go`)

`deploy/truenas/core/docker-compose.yml` additions

`prom/prometheus.yml`

`prom/rules/webrtc-alerts.yml`

Dashboard JSON: `dragonfork-webrtc-health.json`

`.env` template additions

Unit tests — `prometheus/webrtc_test.go`

Unit tests — `app/webrtc/metrics_test.go`

Integration verification — `test/TESTING.md`