zampp2 GPU capabilities stuck on raw /dev detection — GPU_COUNT env override blocks nvidia-smi enrichment #108

Closed
opened 2026-05-26 18:19:21 -04:00 by zgaetano · 1 comment
Owner

Fixed in 04ce096. GPU_COUNT override now merges with the nvidia-smi cache when present, so the model / memory / driver version still flow through for the overridden indexes instead of being discarded.

Fixed in 04ce096. `GPU_COUNT` override now merges with the nvidia-smi cache when present, so the model / memory / driver version still flow through for the overridden indexes instead of being discarded.
Author
Owner

Fix Plan — #108 GPU_COUNT env override blocks nvidia-smi enrichment

Root cause: node-agent/index.js:269-284 — zampp2 .env.worker sets GPU_COUNT=1. Override path pushes raw {device, type, index} triples — never name or memory_mb. The _gpuCache from probeGpusViaSmi() ("NVIDIA L4", 23028 MB) is discarded when GPU_COUNT set.

Fix: Always prefer nvidia-smi enrichment when available. Use override only as fallback count:

if (_gpuCache.length > 0) {
  // use real nvidia-smi data regardless of GPU_COUNT
  gpus = _gpuCache;
} else {
  // fallback to override
  gpus = buildGpusFromOverride();
}

On zampp2: remove GPU_COUNT=1 from .env.worker once cache works.

Files: services/node-agent/index.js:269-284
Effort: ~1h
**Priority: P2 — blocks GPU-aware scheduling

## Fix Plan — #108 GPU_COUNT env override blocks nvidia-smi enrichment **Root cause:** `node-agent/index.js:269-284` — zampp2 `.env.worker` sets `GPU_COUNT=1`. Override path pushes raw `{device, type, index}` triples — never `name` or `memory_mb`. The `_gpuCache` from `probeGpusViaSmi()` ("NVIDIA L4", 23028 MB) is discarded when `GPU_COUNT` set. **Fix:** Always prefer nvidia-smi enrichment when available. Use override only as fallback count: ```js if (_gpuCache.length > 0) { // use real nvidia-smi data regardless of GPU_COUNT gpus = _gpuCache; } else { // fallback to override gpus = buildGpusFromOverride(); } ``` On zampp2: remove `GPU_COUNT=1` from `.env.worker` once cache works. **Files:** `services/node-agent/index.js:269-284` **Effort:** ~1h **Priority: P2 — blocks GPU-aware scheduling
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: WildDragonLLC/dragonflight#108
No description provided.