📝 Problem Description
Design a system to monitor the health of a large cluster of servers. Track metrics like CPU, memory, disk, network. Detect anomalies, aggregate data, and trigger alerts when thresholds are exceeded.
👤 Use Cases
1.
Agent wants to reports metrics so that data stored and aggregated
2.
Operator wants to views dashboard so that sees cluster health overview
3.
System wants to detects anomaly so that triggers alert
4.
Operator wants to queries historical data so that analyzes trends
✅ Functional Requirements
- •Collect metrics from all nodes (CPU, memory, disk, network)
- •Aggregate metrics (avg, min, max, percentiles)
- •Store historical data with configurable retention
- •Real-time dashboards
- •Alerting on thresholds and anomalies
- •Query interface for ad-hoc analysis
⚡ Non-Functional Requirements
- •Support 100K nodes
- •Metric collection every 10 seconds
- •Query latency < 1 second
- •Store 1 year of data
- •99.9% availability
⚠️ Constraints & Assumptions
- •High volume of time-series data
- •Must handle node failures gracefully
- •Dashboard must be responsive
📊 Capacity Estimation
👥 Users
100K monitored nodes
💾 Storage
50TB (1 year of metrics)
⚡ QPS
Writes: 1M metrics/sec, Queries: 1K/sec
📐 Assumptions
- • 100K nodes
- • 100 metrics per node
- • Collection every 10 seconds
- • 10M data points per 10 seconds
💡 Key Concepts
CRITICAL
Time-Series Database
Optimized for write-heavy time-stamped data with efficient compression.
HIGH
Data Rollups
Pre-aggregate old data (1min → 5min → 1hr) to reduce storage.
HIGH
Push vs Pull
Push model scales better for large clusters; pull (Prometheus) simpler for smaller.
MEDIUM
Cardinality Control
Limit unique label combinations to prevent TSDB explosion.
💡 Interview Tips
- 💡Start with the metric collection architecture
- 💡Discuss the time-series storage
- 💡Emphasize the alerting pipeline
- 💡Be prepared to discuss pull vs push
- 💡Know the tradeoffs between resolution and storage
- 💡Understand Prometheus/Grafana architecture