← Back to All Questions
Hard~50 minInfrastructure

Design Server Metrics Aggregation System

DatadogNew RelicGrafanaAWSGoogle

📝 Problem Description

Design a system to collect, aggregate, and query server metrics at scale. Handle millions of metrics per second, provide real-time dashboards, and support flexible queries. Think Prometheus/Datadog.

👤 Use Cases

1.
Server wants to emits metrics so that metrics collected and stored
2.
Operator wants to views dashboard so that sees aggregated metrics
3.
System wants to detects anomaly so that triggers alert
4.
Analyst wants to queries historical data so that gets aggregated results

✅ Functional Requirements

  • Collect metrics from servers (push/pull)
  • Aggregate by time window and dimensions
  • Store with configurable retention
  • Real-time dashboards
  • Alerting on thresholds
  • Ad-hoc query interface

⚡ Non-Functional Requirements

  • Handle 10M metrics/sec
  • Query latency < 1 second
  • Store 1 year of data
  • 99.9% availability

⚠️ Constraints & Assumptions

  • High cardinality labels can explode storage
  • Real-time and historical queries needed
  • Must handle bursty traffic

📊 Capacity Estimation

👥 Users
100K servers, 1000 metrics each
💾 Storage
100TB (1 year with rollups)
⚡ QPS
Writes: 10M/sec, Queries: 10K/sec
📐 Assumptions
  • 100K servers
  • 1000 metrics per server
  • Collection every 10 seconds
  • Rollup: 10s → 1m → 1h → 1d

💡 Key Concepts

CRITICAL
Time-Series Data
Metrics are time-stamped values with labels/dimensions.
CRITICAL
Rollup/Downsampling
Aggregate old data to reduce storage (10s → 1m → 1h).
HIGH
Cardinality Control
Limit unique label combinations to prevent explosion.
HIGH
PromQL-style Queries
Flexible query language for aggregation and math.

💡 Interview Tips

  • 💡Start with the metric collection architecture
  • 💡Discuss the time-series storage
  • 💡Emphasize the aggregation strategy
  • 💡Be prepared to discuss push vs pull
  • 💡Know the Prometheus/Grafana stack
  • 💡Understand the cardinality challenges