โ† Back to All Questions
Very Hard~85 minData Processing

Design System to Migrate Large Data to Cloud

GoogleOracleAWSMicrosoftSnowflake

๐Ÿ“ Problem Description

Design a system to migrate petabytes of data from on-premises data centers to Google Cloud. The system must handle live migrations with minimal downtime, ensure data integrity, and provide progress tracking and rollback capabilities.

๐Ÿ‘ค Use Cases

1.
DBA wants to initiates migration so that data transfer begins with monitoring
2.
System wants to replicates changes so that live data kept in sync
3.
System wants to validates data so that checksums match, integrity confirmed
4.
Operator wants to performs cutover so that applications switch to cloud

โœ… Functional Requirements

  • โ€ขSupport various source systems (Oracle, MySQL, S3, HDFS)
  • โ€ขContinuous data replication for live migrations
  • โ€ขSchema conversion and transformation
  • โ€ขData validation and checksum verification
  • โ€ขProgress monitoring and ETA estimation
  • โ€ขRollback and restart capabilities

โšก Non-Functional Requirements

  • โ€ขMigrate 1PB+ with <1 hour cutover window
  • โ€ขZero data loss (exactly-once semantics)
  • โ€ขBandwidth efficient (compression, deduplication)
  • โ€ขResume from failure without re-transferring
  • โ€ขEncrypt data in transit and at rest

โš ๏ธ Constraints & Assumptions

  • โ€ขSource continues to change during migration; CDC must keep target within seconds/minutes of lag
  • โ€ขCutover window < 1 hour; rollback must be possible if validation fails
  • โ€ขNetwork bandwidth is finite (e.g., 10Gbps); must use compression and chunking with resume
  • โ€ขExactly-once semantics for change application (idempotency + ordering per primary key)
  • โ€ขData integrity must be provable (checksums, row counts, sampled deep validation)
  • โ€ขSecurity/compliance: encrypt in transit, least-privilege credentials, audit logs, and PII handling
  • โ€ขSystem must checkpoint frequently and resume without re-transferring terabytes

๐Ÿ“Š Capacity Estimation

๐Ÿ‘ฅ Users
Internal operations - enterprise customers
๐Ÿ’พ Storage
Source: 1PB+; Staging: 100TB; Target: 1PB+
โšก QPS
CDC events: 100K/sec; Validation queries: 1K/sec
๐ŸŒ Bandwidth
10 Gbps = 1.25 GB/sec = 108 TB/day
๐Ÿ“ Assumptions
  • โ€ข 1 PB data = 1,000,000 GB = 1,000 TB
  • โ€ข Compression ratio: 3:1 (transfer 333 TB)
  • โ€ข Network: 10 Gbps dedicated link
  • โ€ข Cutover window: 1 hour maximum
  • โ€ข CDC lag target: < 1 minute
  • โ€ข Validation: 100% checksums, 1% deep sampling

๐Ÿ’ก Key Concepts

CRITICAL
Change Data Capture
Capture ongoing changes from source transaction logs
HIGH
Chunked Transfer
Transfer data in chunks with resume capability
CRITICAL
Zero Downtime Migration
Keep source and target in sync during cutover

๐Ÿ’ก Interview Tips

  • ๐Ÿ’กStart with the phases: initial load, CDC replication, validation, cutover
  • ๐Ÿ’กEmphasize zero data loss and exactly-once semantics
  • ๐Ÿ’กDiscuss checkpointing and resume capability early
  • ๐Ÿ’กBe prepared to discuss CDC mechanisms (log-based vs trigger-based)
  • ๐Ÿ’กKnow the tradeoffs between downtime and complexity
  • ๐Ÿ’กDiscuss validation strategies in depth - this is often underestimated