Replay Failed Jobs

When to use this runbook

Queue tasks are in a permanent error state
After fixing a root cause, tasks need to be replayed
A batch of tasks failed due to a transient incident

Step 1: Identify failed tasks

# List failed tasks
# Adapt to your queue table schema
mysql -e "SELECT id, type, error_message, created_at FROM queue_table WHERE status = 'error' ORDER BY created_at DESC LIMIT 50;"

Step 2: Analyze the errors

Group errors by type:

Connection errors (external API) — likely transient, replay directly
Validation errors (incorrect data) — require a fix before replay
Deadlocks — direct replay possible after resolving the concurrency issue

Step 3: Replay the tasks

# Reset failed tasks back to pending
# Adapt the command to your project's replay mechanism
mysql -e "UPDATE queue_table SET status = 'pending', error_message = NULL WHERE status = 'error' AND type = 'TARGET_TYPE';"

:::caution Warning Do not replay tasks whose source data has changed without verifying that the processing is idempotent. :::

Step 4: Validate the replay

Restart the queue processor if needed
Monitor replayed tasks to confirm they complete successfully
Check for side effects in the database
Compare results with expected data

Step 5: Document the incident

Note the root cause
Number of affected tasks
The fix that was applied
Precautions to prevent recurrence

When to use this runbook​

Step 1: Identify failed tasks​

Step 2: Analyze the errors​

Step 3: Replay the tasks​

Step 4: Validate the replay​

Step 5: Document the incident​

When to use this runbook

Step 1: Identify failed tasks

Step 2: Analyze the errors

Step 3: Replay the tasks

Step 4: Validate the replay

Step 5: Document the incident