Skip to main content

Replay Failed Jobs

When to use this runbook

  • Queue tasks are in a permanent error state
  • After fixing a root cause, tasks need to be replayed
  • A batch of tasks failed due to a transient incident

Step 1: Identify failed tasks

# List failed tasks
# Adapt to your queue table schema
mysql -e "SELECT id, type, error_message, created_at FROM queue_table WHERE status = 'error' ORDER BY created_at DESC LIMIT 50;"

Step 2: Analyze the errors

Group errors by type:

  • Connection errors (external API) — likely transient, replay directly
  • Validation errors (incorrect data) — require a fix before replay
  • Deadlocks — direct replay possible after resolving the concurrency issue

Step 3: Replay the tasks

# Reset failed tasks back to pending
# Adapt the command to your project's replay mechanism
mysql -e "UPDATE queue_table SET status = 'pending', error_message = NULL WHERE status = 'error' AND type = 'TARGET_TYPE';"

:::caution Warning Do not replay tasks whose source data has changed without verifying that the processing is idempotent. :::

Step 4: Validate the replay

  1. Restart the queue processor if needed
  2. Monitor replayed tasks to confirm they complete successfully
  3. Check for side effects in the database
  4. Compare results with expected data

Step 5: Document the incident

  • Note the root cause
  • Number of affected tasks
  • The fix that was applied
  • Precautions to prevent recurrence