๐ Problem Description
Design a system to migrate petabytes of data from on-premises data centers to Google Cloud. The system must handle live migrations with minimal downtime, ensure data integrity, and provide progress tracking and rollback capabilities.
๐ค Use Cases
1.
DBA wants to initiates migration so that data transfer begins with monitoring
2.
System wants to replicates changes so that live data kept in sync
3.
System wants to validates data so that checksums match, integrity confirmed
4.
Operator wants to performs cutover so that applications switch to cloud
โ Functional Requirements
- โขSupport various source systems (Oracle, MySQL, S3, HDFS)
- โขContinuous data replication for live migrations
- โขSchema conversion and transformation
- โขData validation and checksum verification
- โขProgress monitoring and ETA estimation
- โขRollback and restart capabilities
โก Non-Functional Requirements
- โขMigrate 1PB+ with <1 hour cutover window
- โขZero data loss (exactly-once semantics)
- โขBandwidth efficient (compression, deduplication)
- โขResume from failure without re-transferring
- โขEncrypt data in transit and at rest
โ ๏ธ Constraints & Assumptions
- โขSource continues to change during migration; CDC must keep target within seconds/minutes of lag
- โขCutover window < 1 hour; rollback must be possible if validation fails
- โขNetwork bandwidth is finite (e.g., 10Gbps); must use compression and chunking with resume
- โขExactly-once semantics for change application (idempotency + ordering per primary key)
- โขData integrity must be provable (checksums, row counts, sampled deep validation)
- โขSecurity/compliance: encrypt in transit, least-privilege credentials, audit logs, and PII handling
- โขSystem must checkpoint frequently and resume without re-transferring terabytes
๐ Capacity Estimation
๐ฅ Users
Internal operations - enterprise customers
๐พ Storage
Source: 1PB+; Staging: 100TB; Target: 1PB+
โก QPS
CDC events: 100K/sec; Validation queries: 1K/sec
๐ Bandwidth
10 Gbps = 1.25 GB/sec = 108 TB/day
๐ Assumptions
- โข 1 PB data = 1,000,000 GB = 1,000 TB
- โข Compression ratio: 3:1 (transfer 333 TB)
- โข Network: 10 Gbps dedicated link
- โข Cutover window: 1 hour maximum
- โข CDC lag target: < 1 minute
- โข Validation: 100% checksums, 1% deep sampling
๐ก Key Concepts
CRITICAL
Change Data Capture
Capture ongoing changes from source transaction logs
HIGH
Chunked Transfer
Transfer data in chunks with resume capability
CRITICAL
Zero Downtime Migration
Keep source and target in sync during cutover
๐ก Interview Tips
- ๐กStart with the phases: initial load, CDC replication, validation, cutover
- ๐กEmphasize zero data loss and exactly-once semantics
- ๐กDiscuss checkpointing and resume capability early
- ๐กBe prepared to discuss CDC mechanisms (log-based vs trigger-based)
- ๐กKnow the tradeoffs between downtime and complexity
- ๐กDiscuss validation strategies in depth - this is often underestimated