📝 Problem Description
Design a system to find the top K most frequent items in a data stream. Examples: top 10 trending hashtags, most searched queries, most viewed products. Handle high-throughput streams with memory constraints.
👤 Use Cases
1.
System wants to processes event stream so that identifies top K frequent items
2.
User wants to views trending page so that sees top 10 hashtags
3.
Analyst wants to queries top products so that gets real-time ranking
✅ Functional Requirements
- •Track frequency of items in stream
- •Return top K items by frequency
- •Support time-windowed queries (last hour, day)
- •Handle high-cardinality streams (millions of unique items)
⚡ Non-Functional Requirements
- •Process 1M events per second
- •Query top K in < 10ms
- •Memory efficient (can't store all counts)
- •Approximate results acceptable (within 5% error)
⚠️ Constraints & Assumptions
- •Cannot store exact count for every item
- •Stream is unbounded and continuous
- •Memory limited (e.g., 10GB for the system)
📊 Capacity Estimation
👥 Users
N/A (internal system)
💾 Storage
10GB for approximate data structures
⚡ QPS
Ingest: 1M events/sec, Queries: 1K/sec
📐 Assumptions
- • 1M events per second
- • 100M unique items per day
- • Only top 1000 items needed accurately
- • Exact counts not required for long tail
💡 Key Concepts
CRITICAL
Count-Min Sketch
Probabilistic data structure for approximate frequency counting in sublinear space.
HIGH
Min Heap
Track top K items efficiently with O(log K) updates.
HIGH
Space-Saving Algorithm
Alternative to CM Sketch - tracks top K with guaranteed error bounds.
MEDIUM
Time Windowing
Maintain separate counts for different time windows (tumbling, sliding).
💡 Interview Tips
- 💡Start with the streaming constraints
- 💡Discuss Count-Min Sketch and Space-Saving algorithms
- 💡Emphasize the accuracy vs memory tradeoff
- 💡Be prepared to discuss heap-based solutions
- 💡Know the difference between exact and approximate
- 💡Understand the merging problem for distributed systems