fix(jobs): real cancel for active jobs + multi-threaded thumbnail worker #29

Merged
zgaetano merged 1 commit from fix/jobs-cancel-and-concurrency into main 2026-05-23 17:23:52 -04:00
Owner

Summary

Two problems the operator hit while trying to unstick the thumbnail queue:

1. Cancel returned a misleading "404 not found"

DELETE /jobs/:id called job.remove(), which BullMQ flat-out refuses while a job is in the active state. The route caught the error and fell through to the 404 branch, even though the job clearly exists — Redis just won't let us drop it from under the worker.

Now we detect active explicitly and:

  • Call moveToFailed(err, '0', false) first. Token '0' bypasses the per-worker lock check (the operator-side cancel doesn't hold the worker lock); that transitions active → failed and frees the concurrency slot.
  • If moveToFailed itself fails (lock owned by a live worker), discard() as a fallback so at least the result is thrown away.
  • If remove() still fails, drop the job's Redis key directly via queue.client.del. Last-resort obliteration for genuinely stalled rows.
  • Stop swallowing getJob() errors — if Redis is sad, surface via next(err) instead of returning a misleading 404.
  • Return { cancelled: true } when the job was active, so any future client can distinguish "cancelled a running job" from "removed a queued job".

2. Every queue ran with concurrency 1

A single slow or stalled thumbnail blocked every other thumbnail. Now:

Queue Concurrency Env override
proxy 2 PROXY_CONCURRENCY
thumbnail 4 THUMBNAIL_CONCURRENCY
conform 1 CONFORM_CONCURRENCY
import 1 (locked)

Proxy + conform stay lower because they're heavy ffmpeg transcodes; tune via env on bigger nodes. Imports stay at 1 to avoid burning rate limits.

Roll-out

  1. Merge → rebuild mam-api + worker
  2. Worker restart automatically picks the new concurrency
  3. Operator can now cancel the stuck thumbnail; queue drains; future thumbnails run 4-up

Test plan

  • Trigger a thumbnail job, click Cancel while it's running — succeeds without 404
  • Verify queue's active count drops to 0 after cancel
  • Drop 5+ thumbnail jobs at once — up to 4 process in parallel
  • Worker startup log: Concurrency: proxy=2 thumbnail=4 conform=1 import=1

🤖 Generated with Claude Code

## Summary Two problems the operator hit while trying to unstick the thumbnail queue: ### 1. Cancel returned a misleading "404 not found" `DELETE /jobs/:id` called `job.remove()`, which BullMQ flat-out refuses while a job is in the `active` state. The route caught the error and fell through to the 404 branch, even though the job clearly exists — Redis just won't let us drop it from under the worker. Now we detect `active` explicitly and: - Call `moveToFailed(err, '0', false)` first. Token `'0'` bypasses the per-worker lock check (the operator-side cancel doesn't hold the worker lock); that transitions `active → failed` and frees the concurrency slot. - If `moveToFailed` itself fails (lock owned by a live worker), `discard()` as a fallback so at least the result is thrown away. - If `remove()` still fails, drop the job's Redis key directly via `queue.client.del`. Last-resort obliteration for genuinely stalled rows. - Stop swallowing `getJob()` errors — if Redis is sad, surface via `next(err)` instead of returning a misleading 404. - Return `{ cancelled: true }` when the job was active, so any future client can distinguish "cancelled a running job" from "removed a queued job". ### 2. Every queue ran with concurrency 1 A single slow or stalled thumbnail blocked every other thumbnail. Now: | Queue | Concurrency | Env override | |---|---|---| | proxy | 2 | `PROXY_CONCURRENCY` | | **thumbnail** | **4** | `THUMBNAIL_CONCURRENCY` | | conform | 1 | `CONFORM_CONCURRENCY` | | import | 1 (locked) | — | Proxy + conform stay lower because they're heavy ffmpeg transcodes; tune via env on bigger nodes. Imports stay at 1 to avoid burning rate limits. ## Roll-out 1. Merge → rebuild mam-api + worker 2. Worker restart automatically picks the new concurrency 3. Operator can now cancel the stuck thumbnail; queue drains; future thumbnails run 4-up ## Test plan - [ ] Trigger a thumbnail job, click Cancel while it's running — succeeds without 404 - [ ] Verify queue's `active` count drops to 0 after cancel - [ ] Drop 5+ thumbnail jobs at once — up to 4 process in parallel - [ ] Worker startup log: `Concurrency: proxy=2 thumbnail=4 conform=1 import=1` 🤖 Generated with [Claude Code](https://claude.com/claude-code)
zgaetano added 1 commit 2026-05-23 17:23:42 -04:00
DELETE /jobs/:id was throwing "404 not found" when the operator tried to
cancel a running job. BullMQ refuses job.remove() while a job is in the
active state; the route caught that error and fell through to the
404 branch, which was misleading because the job actually exists — the
queue was just refusing to drop it from under the worker.

Fix:
- Detect 'active' state explicitly and call moveToFailed(err, '0', false)
  first. Token '0' bypasses the per-worker lock check (the operator-side
  cancel doesn't hold the worker lock). That transitions active -> failed
  and frees the queue's concurrency slot.
- If moveToFailed itself fails (lock owned by a live worker), fall back
  to job.discard() so at least the result is thrown away.
- If remove() then fails (stalled, broken state), drop the job's Redis
  key directly via queue.client. Last-resort obliteration.
- Stop swallowing getJob() errors — if Redis is sad, surface it via
  next(err) instead of returning a misleading 404.
- Return { cancelled: true } when the job was active, so the client
  can show "Cancelled" rather than "Removed" in any future toast.

While here: thumbnail jobs now run with concurrency 4 by default
(proxy 2, conform 1, import 1 unchanged). Every queue defaulted to
concurrency 1 before, so a single stalled job blocked the entire queue.
All three are overridable via PROXY_CONCURRENCY / THUMBNAIL_CONCURRENCY
/ CONFORM_CONCURRENCY env vars for nodes with more headroom.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
zgaetano merged commit a6d789279c into main 2026-05-23 17:23:52 -04:00
zgaetano deleted branch fix/jobs-cancel-and-concurrency 2026-05-23 17:23:52 -04:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: WildDragonLLC/dragonflight#29
No description provided.