Replay Failed Jobs
When to use this runbook
- Queue tasks are in a permanent error state
- After fixing a root cause, tasks need to be replayed
- A batch of tasks failed due to a transient incident
Step 1: Identify failed tasks
# List failed tasks
# Adapt to your queue table schema
mysql -e "SELECT id, type, error_message, created_at FROM queue_table WHERE status = 'error' ORDER BY created_at DESC LIMIT 50;"
Step 2: Analyze the errors
Group errors by type:
- Connection errors (external API) — likely transient, replay directly
- Validation errors (incorrect data) — require a fix before replay
- Deadlocks — direct replay possible after resolving the concurrency issue
Step 3: Replay the tasks
# Reset failed tasks back to pending
# Adapt the command to your project's replay mechanism
mysql -e "UPDATE queue_table SET status = 'pending', error_message = NULL WHERE status = 'error' AND type = 'TARGET_TYPE';"
:::caution Warning Do not replay tasks whose source data has changed without verifying that the processing is idempotent. :::
Step 4: Validate the replay
- Restart the queue processor if needed
- Monitor replayed tasks to confirm they complete successfully
- Check for side effects in the database
- Compare results with expected data
Step 5: Document the incident
- Note the root cause
- Number of affected tasks
- The fix that was applied
- Precautions to prevent recurrence