Design Control Plane for Distributed Database - System Design Interview

📝 Problem Description

Design the control plane for a distributed database like CockroachDB or Spanner. The control plane manages cluster membership, schema changes, data placement, rebalancing, and provides a management API for operations.

👤 Use Cases

Node wants to joins cluster so that added to membership, receives data

DBA wants to adds column so that schema change propagated without downtime

System wants to detects imbalance so that triggers automatic rebalancing

Operator wants to upgrades version so that rolling upgrade without downtime

✅ Functional Requirements

•Cluster membership: node join/leave/failure detection
•Online schema changes (DDL) without downtime
•Data placement based on replication policies
•Automatic rebalancing when nodes added/removed
•Backup and restore orchestration
•Rolling upgrades without downtime

⚡ Non-Functional Requirements

•Converge to balanced state within minutes
•Handle 1000+ node clusters
•Schema changes propagate in < 1 minute
•Survive loss of minority nodes
•Zero data loss during rebalancing

⚠️ Constraints & Assumptions

•Control-plane metadata must be strongly consistent (no split-brain): all decisions go through consensus (Raft/Paxos)
•Support large clusters: 1000+ nodes, 100K+ ranges/shards, and frequent membership churn
•Heartbeats every ~3s; failure detection must avoid flapping (tunable suspicion timers)
•Online schema changes only: DDL must be backward/forward compatible across rolling upgrades
•Rebalancing must not violate replication policies (region/zone constraints) and must throttle to protect IO
•Admin APIs must be authenticated/authorized (RBAC) and fully audit-logged
•Network partitions are expected; minority partitions must not accept writes/placements (safety over liveness)

📊 Capacity Estimation

👥 Users

Internal service - manages cluster operations

💾 Storage

Metadata: ~10GB for large clusters

⚡ QPS

Heartbeats: 10K/sec; Admin ops: 100/sec

📐 Assumptions

• 1000 nodes in cluster
• Heartbeat every 3 seconds
• 100K ranges/shards across cluster
• Schema changes: ~10/day
• Rebalancing: ~1000 range moves/hour

💡 Key Concepts

CRITICAL

Control Plane

Centralized coordination for distributed database metadata and schema

HIGH

Online Schema Change

Multi-phase schema evolution without downtime

CRITICAL

Consensus Protocol

Raft/Paxos for distributed agreement on metadata

💡 Interview Tips

💡Start with the separation of control plane and data plane
💡Emphasize consensus protocols (Raft/Paxos) for metadata consistency
💡Discuss online schema changes - this is a key differentiator
💡Be prepared to discuss failure detection and recovery in depth
💡Know the tradeoffs between availability and consistency
💡Discuss how systems like CockroachDB, Spanner, or YugabyteDB handle these problems