📝 Problem Description
Design a distributed job scheduler that can execute tasks at specified times or intervals. It should handle millions of jobs, support cron-like scheduling, and guarantee at-least-once execution with proper failure handling.
👤 Use Cases
1.
Developer wants to schedule a one-time job so that job executes at specified time
2.
Developer wants to schedule a recurring job so that job executes on schedule (e.g., every hour)
3.
System wants to detects job failure so that retries with exponential backoff
4.
Developer wants to cancels a scheduled job so that job is removed from schedule
✅ Functional Requirements
- •Schedule one-time jobs for future execution
- •Schedule recurring jobs (cron syntax)
- •Cancel or modify scheduled jobs
- •Retry failed jobs with backoff
- •Job prioritization
- •Job dependencies (job B runs after job A)
- •Job history and logging
⚡ Non-Functional Requirements
- •Execute jobs within 1 second of scheduled time
- •Handle 10M+ scheduled jobs
- •Process 10K job executions per second
- •At-least-once execution guarantee
- •99.99% availability
⚠️ Constraints & Assumptions
- •Jobs should not be missed even during failures
- •No duplicate concurrent executions
- •Support variable execution environments
📊 Capacity Estimation
👥 Users
1000 internal services scheduling jobs
💾 Storage
100GB for job metadata (10M jobs × 10KB)
⚡ QPS
Job triggers: 10K/sec, Schedule updates: 100/sec
📐 Assumptions
- • 10M scheduled jobs active at any time
- • Average job metadata: 10KB
- • 60% recurring, 40% one-time jobs
- • Average job execution: 10 seconds
💡 Key Concepts
CRITICAL
Time-Based Partitioning
Partition jobs by execution time for efficient due-job queries.
CRITICAL
Distributed Locking
Prevent multiple schedulers from claiming the same job.
HIGH
At-Least-Once Delivery
Jobs are requeued if executor doesn't acknowledge completion.
HIGH
Idempotent Jobs
Jobs should be safe to retry - design for at-least-once execution.
💡 Interview Tips
- 💡Start with the core polling mechanism
- 💡Address exactly-once vs at-least-once early
- 💡Discuss time-based partitioning for scale
- 💡Cover failure scenarios and retries