Design Distributed Log Aggregation System - System Design Interview

📝 Problem Description

Design a distributed log aggregation system like ELK Stack or Splunk. Collect logs from thousands of servers, index for search, and provide real-time dashboards.

👤 Use Cases

Server wants to emits logs so that logs collected and stored

Developer wants to searches logs so that finds relevant entries

System wants to detects pattern so that triggers alert

✅ Functional Requirements

•Collect logs from multiple sources
•Parse and structure logs
•Full-text search
•Time-range queries
•Real-time tailing
•Dashboards and alerts
•Retention policies

⚡ Non-Functional Requirements

•Ingest 1TB logs/day
•Search latency < 5 seconds
•Near real-time (< 30s delay)
•Store 30 days of logs

⚠️ Constraints & Assumptions

•Log volume is massive
•Logs are unstructured or semi-structured
•Must handle spikes during incidents

📊 Capacity Estimation

👥 Users

10K servers, 1000 users

💾 Storage

30TB (30-day retention)

⚡ QPS

Ingest: 100K/sec, Search: 1K/sec

📐 Assumptions

• 10K servers
• 1KB average log line
• 100M log lines per day
• 1TB per day ingestion
• 30-day hot storage, 1-year cold
• 10x spike during incidents

💡 Key Concepts

CRITICAL

Log Parsing

Extract fields from unstructured text (grok patterns).

CRITICAL

Inverted Index

Index terms to documents for fast search.

HIGH

Time-Series Sharding

Index per time window for efficient retention.

MEDIUM

Log Sampling

Keep subset of high-volume logs.

💡 Interview Tips

💡Start with the ingestion pipeline: agents → collectors → storage
💡Discuss the tradeoff between indexing and storage costs
💡Emphasize the importance of structured logging
💡Be prepared to discuss ELK stack vs alternatives
💡Know the difference between logs, metrics, and traces
💡Understand the scale: billions of log lines per day