← Back to All Questions
Medium~40 minInfrastructure

Design On-Call Escalation System

PagerDutyOpsGenieVictorOpsDatadogSlack

📝 Problem Description

Design an on-call escalation system like PagerDuty. When an alert fires, it notifies the on-call engineer. If not acknowledged, it escalates to the next person. Support schedules, rotations, and multiple notification channels.

👤 Use Cases

1.
Monitoring wants to sends alert so that on-call person is notified
2.
Engineer wants to acknowledges alert so that escalation stops
3.
System wants to detects no ack after timeout so that escalates to next person
4.
Admin wants to configures schedule so that rotation applied automatically

✅ Functional Requirements

  • Create alerts from external systems
  • Notify on-call person (SMS, call, push, email)
  • Escalate if not acknowledged
  • Schedule management with rotations
  • Override schedules temporarily
  • Alert grouping and deduplication

⚡ Non-Functional Requirements

  • Alert delivery < 30 seconds
  • Notification reliability > 99.99%
  • 99.99% availability
  • Support thousands of teams

⚠️ Constraints & Assumptions

  • Notifications must be delivered reliably
  • Timezone-aware scheduling
  • Multiple notification channels needed

📊 Capacity Estimation

👥 Users
100K users across 10K teams
💾 Storage
100GB (alerts, schedules)
⚡ QPS
Alerts: 100/sec, Notifications: 500/sec
📐 Assumptions
  • 10K teams
  • 1M alerts per day
  • Average 3 notifications per alert
  • Peak: 10x average during incidents

💡 Key Concepts

CRITICAL
Escalation Policies
Define who to notify and when to escalate if no ack.
HIGH
Multi-channel Notification
Try SMS, then call, then email in sequence.
HIGH
Rotation Schedules
Weekly/daily rotations with timezone support.
MEDIUM
Alert Deduplication
Group related alerts to avoid alert fatigue.

💡 Interview Tips

  • 💡Start with the escalation policy model
  • 💡Discuss the notification delivery mechanism
  • 💡Emphasize the reliability requirements
  • 💡Be prepared to discuss acknowledgment tracking
  • 💡Know the tradeoffs between push and poll
  • 💡Understand the scheduling complexity