← Back to All Questions
Easy~40 minInfrastructure

Design Distributed Job Scheduler

UberNetflixLinkedInGoogleAmazon

📝 Problem Description

Design a distributed job scheduler that can execute tasks at specified times or intervals. It should handle millions of jobs, support cron-like scheduling, and guarantee at-least-once execution with proper failure handling.

👤 Use Cases

1.
Developer wants to schedule a one-time job so that job executes at specified time
2.
Developer wants to schedule a recurring job so that job executes on schedule (e.g., every hour)
3.
System wants to detects job failure so that retries with exponential backoff
4.
Developer wants to cancels a scheduled job so that job is removed from schedule

✅ Functional Requirements

  • Schedule one-time jobs for future execution
  • Schedule recurring jobs (cron syntax)
  • Cancel or modify scheduled jobs
  • Retry failed jobs with backoff
  • Job prioritization
  • Job dependencies (job B runs after job A)
  • Job history and logging

⚡ Non-Functional Requirements

  • Execute jobs within 1 second of scheduled time
  • Handle 10M+ scheduled jobs
  • Process 10K job executions per second
  • At-least-once execution guarantee
  • 99.99% availability

⚠️ Constraints & Assumptions

  • Jobs should not be missed even during failures
  • No duplicate concurrent executions
  • Support variable execution environments

📊 Capacity Estimation

👥 Users
1000 internal services scheduling jobs
💾 Storage
100GB for job metadata (10M jobs × 10KB)
⚡ QPS
Job triggers: 10K/sec, Schedule updates: 100/sec
📐 Assumptions
  • 10M scheduled jobs active at any time
  • Average job metadata: 10KB
  • 60% recurring, 40% one-time jobs
  • Average job execution: 10 seconds

💡 Key Concepts

CRITICAL
Time-Based Partitioning
Partition jobs by execution time for efficient due-job queries.
CRITICAL
Distributed Locking
Prevent multiple schedulers from claiming the same job.
HIGH
At-Least-Once Delivery
Jobs are requeued if executor doesn't acknowledge completion.
HIGH
Idempotent Jobs
Jobs should be safe to retry - design for at-least-once execution.

💡 Interview Tips

  • 💡Start with the core polling mechanism
  • 💡Address exactly-once vs at-least-once early
  • 💡Discuss time-based partitioning for scale
  • 💡Cover failure scenarios and retries