Design Server Metrics Aggregation System - System Design Interview

📝 Problem Description

Design a system to collect, aggregate, and query server metrics at scale. Handle millions of metrics per second, provide real-time dashboards, and support flexible queries. Think Prometheus/Datadog.

👤 Use Cases

Server wants to emits metrics so that metrics collected and stored

Operator wants to views dashboard so that sees aggregated metrics

System wants to detects anomaly so that triggers alert

Analyst wants to queries historical data so that gets aggregated results

✅ Functional Requirements

•Collect metrics from servers (push/pull)
•Aggregate by time window and dimensions
•Store with configurable retention
•Real-time dashboards
•Alerting on thresholds
•Ad-hoc query interface

⚡ Non-Functional Requirements

•Handle 10M metrics/sec
•Query latency < 1 second
•Store 1 year of data
•99.9% availability

⚠️ Constraints & Assumptions

•High cardinality labels can explode storage
•Real-time and historical queries needed
•Must handle bursty traffic

📊 Capacity Estimation

👥 Users

100K servers, 1000 metrics each

💾 Storage

100TB (1 year with rollups)

⚡ QPS

Writes: 10M/sec, Queries: 10K/sec

📐 Assumptions

• 100K servers
• 1000 metrics per server
• Collection every 10 seconds
• Rollup: 10s → 1m → 1h → 1d

💡 Key Concepts

CRITICAL

Time-Series Data

Metrics are time-stamped values with labels/dimensions.

CRITICAL

Rollup/Downsampling

Aggregate old data to reduce storage (10s → 1m → 1h).

HIGH

Cardinality Control

Limit unique label combinations to prevent explosion.

HIGH

PromQL-style Queries

Flexible query language for aggregation and math.

💡 Interview Tips

💡Start with the metric collection architecture
💡Discuss the time-series storage
💡Emphasize the aggregation strategy
💡Be prepared to discuss push vs pull
💡Know the Prometheus/Grafana stack
💡Understand the cardinality challenges