Design External Sorting System - System Design Interview

📝 Problem Description

Design a system to sort datasets that don't fit in memory. Handle terabytes of data by using disk-based sorting algorithms, distributed processing, and merge operations.

👤 Use Cases

System wants to receives 1TB file so that sorts and outputs sorted file

System wants to processes chunk so that sorts in memory, writes to disk

System wants to merges sorted chunks so that produces final sorted output

✅ Functional Requirements

•Sort datasets larger than available memory
•Handle multiple data formats (CSV, JSON, binary)
•Support custom sort keys and comparators
•Produce sorted output file
•Track progress and handle restarts

⚡ Non-Functional Requirements

•Process 1TB in < 1 hour
•Minimize disk I/O
•Handle machine failures
•Use available memory efficiently

⚠️ Constraints & Assumptions

•Data doesn't fit in RAM
•Disk I/O is bottleneck
•Must handle temporary disk space

📊 Capacity Estimation

👥 Users

N/A (batch processing)

💾 Storage

3x input size for intermediate files

⚡ QPS

N/A - batch job

📐 Assumptions

• 1TB input data
• 64GB available RAM
• SSD with 500MB/s read/write
• 100M records to sort

💡 Key Concepts

CRITICAL

External Merge Sort

Sort chunks in memory, merge on disk using k-way merge.

CRITICAL

K-way Merge

Merge K sorted files simultaneously using min-heap.

HIGH

Memory Budgeting

Allocate memory for read buffers, sort buffer, write buffer.

MEDIUM

Run Generation

Create sorted runs from input using replacement selection.

💡 Interview Tips

💡Start with the external merge sort algorithm
💡Discuss the chunk size optimization
💡Emphasize the I/O efficiency
💡Be prepared to discuss k-way merge
💡Know the MapReduce sorting approach
💡Understand the memory vs I/O tradeoff