📝 Problem Description
Design a system to sort datasets that don't fit in memory. Handle terabytes of data by using disk-based sorting algorithms, distributed processing, and merge operations.
👤 Use Cases
1.
System wants to receives 1TB file so that sorts and outputs sorted file
2.
System wants to processes chunk so that sorts in memory, writes to disk
3.
System wants to merges sorted chunks so that produces final sorted output
✅ Functional Requirements
- •Sort datasets larger than available memory
- •Handle multiple data formats (CSV, JSON, binary)
- •Support custom sort keys and comparators
- •Produce sorted output file
- •Track progress and handle restarts
⚡ Non-Functional Requirements
- •Process 1TB in < 1 hour
- •Minimize disk I/O
- •Handle machine failures
- •Use available memory efficiently
⚠️ Constraints & Assumptions
- •Data doesn't fit in RAM
- •Disk I/O is bottleneck
- •Must handle temporary disk space
📊 Capacity Estimation
👥 Users
N/A (batch processing)
💾 Storage
3x input size for intermediate files
⚡ QPS
N/A - batch job
📐 Assumptions
- • 1TB input data
- • 64GB available RAM
- • SSD with 500MB/s read/write
- • 100M records to sort
💡 Key Concepts
CRITICAL
External Merge Sort
Sort chunks in memory, merge on disk using k-way merge.
CRITICAL
K-way Merge
Merge K sorted files simultaneously using min-heap.
HIGH
Memory Budgeting
Allocate memory for read buffers, sort buffer, write buffer.
MEDIUM
Run Generation
Create sorted runs from input using replacement selection.
💡 Interview Tips
- 💡Start with the external merge sort algorithm
- 💡Discuss the chunk size optimization
- 💡Emphasize the I/O efficiency
- 💡Be prepared to discuss k-way merge
- 💡Know the MapReduce sorting approach
- 💡Understand the memory vs I/O tradeoff