fix(jobs): real cancel for active jobs + multi-threaded thumbnail worker #29
No reviewers
Labels
No labels
bug
bug
duplicate
enhancement
help wanted
invalid
question
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: WildDragonLLC/dragonflight#29
Loading…
Reference in a new issue
No description provided.
Delete branch "fix/jobs-cancel-and-concurrency"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Two problems the operator hit while trying to unstick the thumbnail queue:
1. Cancel returned a misleading "404 not found"
DELETE /jobs/:idcalledjob.remove(), which BullMQ flat-out refuses while a job is in theactivestate. The route caught the error and fell through to the 404 branch, even though the job clearly exists — Redis just won't let us drop it from under the worker.Now we detect
activeexplicitly and:moveToFailed(err, '0', false)first. Token'0'bypasses the per-worker lock check (the operator-side cancel doesn't hold the worker lock); that transitionsactive → failedand frees the concurrency slot.moveToFaileditself fails (lock owned by a live worker),discard()as a fallback so at least the result is thrown away.remove()still fails, drop the job's Redis key directly viaqueue.client.del. Last-resort obliteration for genuinely stalled rows.getJob()errors — if Redis is sad, surface vianext(err)instead of returning a misleading 404.{ cancelled: true }when the job was active, so any future client can distinguish "cancelled a running job" from "removed a queued job".2. Every queue ran with concurrency 1
A single slow or stalled thumbnail blocked every other thumbnail. Now:
PROXY_CONCURRENCYTHUMBNAIL_CONCURRENCYCONFORM_CONCURRENCYProxy + conform stay lower because they're heavy ffmpeg transcodes; tune via env on bigger nodes. Imports stay at 1 to avoid burning rate limits.
Roll-out
Test plan
activecount drops to 0 after cancelConcurrency: proxy=2 thumbnail=4 conform=1 import=1🤖 Generated with Claude Code
DELETE /jobs/:id was throwing "404 not found" when the operator tried to cancel a running job. BullMQ refuses job.remove() while a job is in the active state; the route caught that error and fell through to the 404 branch, which was misleading because the job actually exists — the queue was just refusing to drop it from under the worker. Fix: - Detect 'active' state explicitly and call moveToFailed(err, '0', false) first. Token '0' bypasses the per-worker lock check (the operator-side cancel doesn't hold the worker lock). That transitions active -> failed and frees the queue's concurrency slot. - If moveToFailed itself fails (lock owned by a live worker), fall back to job.discard() so at least the result is thrown away. - If remove() then fails (stalled, broken state), drop the job's Redis key directly via queue.client. Last-resort obliteration. - Stop swallowing getJob() errors — if Redis is sad, surface it via next(err) instead of returning a misleading 404. - Return { cancelled: true } when the job was active, so the client can show "Cancelled" rather than "Removed" in any future toast. While here: thumbnail jobs now run with concurrency 4 by default (proxy 2, conform 1, import 1 unchanged). Every queue defaulted to concurrency 1 before, so a single stalled job blocked the entire queue. All three are overridable via PROXY_CONCURRENCY / THUMBNAIL_CONCURRENCY / CONFORM_CONCURRENCY env vars for nodes with more headroom. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>