DELETE /jobs/:id was throwing "404 not found" when the operator tried to
cancel a running job. BullMQ refuses job.remove() while a job is in the
active state; the route caught that error and fell through to the
404 branch, which was misleading because the job actually exists — the
queue was just refusing to drop it from under the worker.
Fix:
- Detect 'active' state explicitly and call moveToFailed(err, '0', false)
first. Token '0' bypasses the per-worker lock check (the operator-side
cancel doesn't hold the worker lock). That transitions active -> failed
and frees the queue's concurrency slot.
- If moveToFailed itself fails (lock owned by a live worker), fall back
to job.discard() so at least the result is thrown away.
- If remove() then fails (stalled, broken state), drop the job's Redis
key directly via queue.client. Last-resort obliteration.
- Stop swallowing getJob() errors — if Redis is sad, surface it via
next(err) instead of returning a misleading 404.
- Return { cancelled: true } when the job was active, so the client
can show "Cancelled" rather than "Removed" in any future toast.
While here: thumbnail jobs now run with concurrency 4 by default
(proxy 2, conform 1, import 1 unchanged). Every queue defaulted to
concurrency 1 before, so a single stalled job blocked the entire queue.
All three are overridable via PROXY_CONCURRENCY / THUMBNAIL_CONCURRENCY
/ CONFORM_CONCURRENCY env vars for nodes with more headroom.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds Ingest → YouTube. UI takes a URL + project, API enqueues a BullMQ
"import" job, worker shells out to yt-dlp, lands the MP4 in S3 at the
same originals/{assetId}/... path uploads use, then hands off to the
existing proxy queue. Imported assets share one lifecycle with uploads
from that point on.
Worker container picks up yt-dlp + python3 (apk on alpine, apt on the
GPU variant). The new 'import' queue is registered in jobs.js so it
appears in the Jobs SSE stream and retry/delete work for free.
Spec: docs/superpowers/specs/2026-05-23-youtube-importer-design.md
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A thumbnail job from earlier stayed 'active' for 6+ hours: worker was restarted at 70% progress, BullMQ left it in the active set, and there was no stall reaper because the worker was created with only the default options.
Worker now passes stalledInterval: 30000, lockDuration: 60000, lockRenewTime: 15000, maxStalledCount: 1 to the Worker constructor. If a run dies, BullMQ reclaims the job back to waiting within 30s and a 'stalled' event is logged. Otherwise the lock is renewed mid-job.
Jobs UI gains a 'Kill' button per row next to Details. Calls DELETE /api/v1/jobs/:id which already removes the job from Redis. Use it on any row that looks stuck.