Design Cluster Health Monitoring System - System Design Interview

📝 Problem Description

Design a system to monitor the health of a large cluster of servers. Track metrics like CPU, memory, disk, network. Detect anomalies, aggregate data, and trigger alerts when thresholds are exceeded.

👤 Use Cases

Agent wants to reports metrics so that data stored and aggregated

Operator wants to views dashboard so that sees cluster health overview

System wants to detects anomaly so that triggers alert

Operator wants to queries historical data so that analyzes trends

✅ Functional Requirements

•Collect metrics from all nodes (CPU, memory, disk, network)
•Aggregate metrics (avg, min, max, percentiles)
•Store historical data with configurable retention
•Real-time dashboards
•Alerting on thresholds and anomalies
•Query interface for ad-hoc analysis

⚡ Non-Functional Requirements

•Support 100K nodes
•Metric collection every 10 seconds
•Query latency < 1 second
•Store 1 year of data
•99.9% availability

⚠️ Constraints & Assumptions

•High volume of time-series data
•Must handle node failures gracefully
•Dashboard must be responsive

📊 Capacity Estimation

👥 Users

100K monitored nodes

💾 Storage

50TB (1 year of metrics)

⚡ QPS

Writes: 1M metrics/sec, Queries: 1K/sec

📐 Assumptions

• 100K nodes
• 100 metrics per node
• Collection every 10 seconds
• 10M data points per 10 seconds

💡 Key Concepts

CRITICAL

Time-Series Database

Optimized for write-heavy time-stamped data with efficient compression.

HIGH

Data Rollups

Pre-aggregate old data (1min → 5min → 1hr) to reduce storage.

HIGH

Push vs Pull

Push model scales better for large clusters; pull (Prometheus) simpler for smaller.

MEDIUM

Cardinality Control

Limit unique label combinations to prevent TSDB explosion.

💡 Interview Tips

💡Start with the metric collection architecture
💡Discuss the time-series storage
💡Emphasize the alerting pipeline
💡Be prepared to discuss pull vs push
💡Know the tradeoffs between resolution and storage
💡Understand Prometheus/Grafana architecture