Design Distributed Job Scheduler - System Design Interview

📝 Problem Description

Design a distributed job scheduler that can execute tasks at specified times or intervals. It should handle millions of jobs, support cron-like scheduling, and guarantee at-least-once execution with proper failure handling.

👤 Use Cases

Developer wants to schedule a one-time job so that job executes at specified time

Developer wants to schedule a recurring job so that job executes on schedule (e.g., every hour)

System wants to detects job failure so that retries with exponential backoff

Developer wants to cancels a scheduled job so that job is removed from schedule

✅ Functional Requirements

•Schedule one-time jobs for future execution
•Schedule recurring jobs (cron syntax)
•Cancel or modify scheduled jobs
•Retry failed jobs with backoff
•Job prioritization
•Job dependencies (job B runs after job A)
•Job history and logging

⚡ Non-Functional Requirements

•Execute jobs within 1 second of scheduled time
•Handle 10M+ scheduled jobs
•Process 10K job executions per second
•At-least-once execution guarantee
•99.99% availability

⚠️ Constraints & Assumptions

•Jobs should not be missed even during failures
•No duplicate concurrent executions
•Support variable execution environments

📊 Capacity Estimation

👥 Users

1000 internal services scheduling jobs

💾 Storage

100GB for job metadata (10M jobs × 10KB)

⚡ QPS

Job triggers: 10K/sec, Schedule updates: 100/sec

📐 Assumptions

• 10M scheduled jobs active at any time
• Average job metadata: 10KB
• 60% recurring, 40% one-time jobs
• Average job execution: 10 seconds

💡 Key Concepts

CRITICAL

Time-Based Partitioning

Partition jobs by execution time for efficient due-job queries.

CRITICAL

Distributed Locking

Prevent multiple schedulers from claiming the same job.

HIGH

At-Least-Once Delivery

Jobs are requeued if executor doesn't acknowledge completion.

HIGH

Idempotent Jobs

Jobs should be safe to retry - design for at-least-once execution.

💡 Interview Tips

💡Start with the core polling mechanism
💡Address exactly-once vs at-least-once early
💡Discuss time-based partitioning for scale
💡Cover failure scenarios and retries