📝 Problem Description
Design a distributed log aggregation system like ELK Stack or Splunk. Collect logs from thousands of servers, index for search, and provide real-time dashboards.
👤 Use Cases
1.
Server wants to emits logs so that logs collected and stored
2.
Developer wants to searches logs so that finds relevant entries
3.
System wants to detects pattern so that triggers alert
✅ Functional Requirements
- •Collect logs from multiple sources
- •Parse and structure logs
- •Full-text search
- •Time-range queries
- •Real-time tailing
- •Dashboards and alerts
- •Retention policies
⚡ Non-Functional Requirements
- •Ingest 1TB logs/day
- •Search latency < 5 seconds
- •Near real-time (< 30s delay)
- •Store 30 days of logs
⚠️ Constraints & Assumptions
- •Log volume is massive
- •Logs are unstructured or semi-structured
- •Must handle spikes during incidents
📊 Capacity Estimation
👥 Users
10K servers, 1000 users
💾 Storage
30TB (30-day retention)
⚡ QPS
Ingest: 100K/sec, Search: 1K/sec
📐 Assumptions
- • 10K servers
- • 1KB average log line
- • 100M log lines per day
- • 1TB per day ingestion
- • 30-day hot storage, 1-year cold
- • 10x spike during incidents
💡 Key Concepts
CRITICAL
Log Parsing
Extract fields from unstructured text (grok patterns).
CRITICAL
Inverted Index
Index terms to documents for fast search.
HIGH
Time-Series Sharding
Index per time window for efficient retention.
MEDIUM
Log Sampling
Keep subset of high-volume logs.
💡 Interview Tips
- 💡Start with the ingestion pipeline: agents → collectors → storage
- 💡Discuss the tradeoff between indexing and storage costs
- 💡Emphasize the importance of structured logging
- 💡Be prepared to discuss ELK stack vs alternatives
- 💡Know the difference between logs, metrics, and traces
- 💡Understand the scale: billions of log lines per day