Scheduler tick has race condition — multi-node deploy will double-fire recorder starts #103

Closed
opened 2026-05-26 18:18:31 -04:00 by zgaetano · 1 comment
Owner

Fixed in 04ce096. scheduler.js now wraps each tick in pg_try_advisory_lock(8210301) so exactly one replica processes a given interval. Pending → starting and running → stopping transitions are atomic via UPDATE … FOR UPDATE SKIP LOCKED, so even without the lock a row can only be claimed once.

Fixed in 04ce096. `scheduler.js` now wraps each tick in `pg_try_advisory_lock(8210301)` so exactly one replica processes a given interval. Pending → starting and running → stopping transitions are atomic via `UPDATE … FOR UPDATE SKIP LOCKED`, so even without the lock a row can only be claimed once.
Author
Owner

Fix Plan — #103 Scheduler race condition (multi-node double-fire)

Root cause: scheduler.js:31-130 polls every 15s. Two mam-api instances both SELECT same pending row, both call /start, second races on container create.

Fix — use SELECT ... FOR UPDATE SKIP LOCKED:

// In scheduler tick:
const client = await pool.connect();
try {
  await client.query("BEGIN");
  const { rows: [job] } = await client.query(
    `UPDATE recorder_schedules SET status = "triggering"
     WHERE id = (
       SELECT id FROM recorder_schedules
       WHERE status = "pending" AND next_run <= NOW()
       ORDER BY next_run ASC
       FOR UPDATE SKIP LOCKED
       LIMIT 1
     )
     RETURNING *`
  );
  await client.query("COMMIT");
  if (job) await startRecorder(job);
} finally {
  client.release();
}

This ensures only one instance claims each pending row.

Files: src/scheduler.js:31-130
Effort: ~1h
**Priority: P1 — only triggers in multi-node deploy

## Fix Plan — #103 Scheduler race condition (multi-node double-fire) **Root cause:** `scheduler.js:31-130` polls every 15s. Two mam-api instances both SELECT same `pending` row, both call `/start`, second races on container create. **Fix — use `SELECT ... FOR UPDATE SKIP LOCKED`:** ```js // In scheduler tick: const client = await pool.connect(); try { await client.query("BEGIN"); const { rows: [job] } = await client.query( `UPDATE recorder_schedules SET status = "triggering" WHERE id = ( SELECT id FROM recorder_schedules WHERE status = "pending" AND next_run <= NOW() ORDER BY next_run ASC FOR UPDATE SKIP LOCKED LIMIT 1 ) RETURNING *` ); await client.query("COMMIT"); if (job) await startRecorder(job); } finally { client.release(); } ``` This ensures only one instance claims each pending row. **Files:** `src/scheduler.js:31-130` **Effort:** ~1h **Priority: P1 — only triggers in multi-node deploy
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: WildDragonLLC/dragonflight#103
No description provided.