← Back to All Questions
Very Hard~90 minDatabase Design

Design Control Plane for Distributed Database

NetflixCockroachDBMongoDBYugabytePlanetScale

πŸ“ Problem Description

Design the control plane for a distributed database like CockroachDB or Spanner. The control plane manages cluster membership, schema changes, data placement, rebalancing, and provides a management API for operations.

πŸ‘€ Use Cases

1.
Node wants to joins cluster so that added to membership, receives data
2.
DBA wants to adds column so that schema change propagated without downtime
3.
System wants to detects imbalance so that triggers automatic rebalancing
4.
Operator wants to upgrades version so that rolling upgrade without downtime

βœ… Functional Requirements

  • β€’Cluster membership: node join/leave/failure detection
  • β€’Online schema changes (DDL) without downtime
  • β€’Data placement based on replication policies
  • β€’Automatic rebalancing when nodes added/removed
  • β€’Backup and restore orchestration
  • β€’Rolling upgrades without downtime

⚑ Non-Functional Requirements

  • β€’Converge to balanced state within minutes
  • β€’Handle 1000+ node clusters
  • β€’Schema changes propagate in < 1 minute
  • β€’Survive loss of minority nodes
  • β€’Zero data loss during rebalancing

⚠️ Constraints & Assumptions

  • β€’Control-plane metadata must be strongly consistent (no split-brain): all decisions go through consensus (Raft/Paxos)
  • β€’Support large clusters: 1000+ nodes, 100K+ ranges/shards, and frequent membership churn
  • β€’Heartbeats every ~3s; failure detection must avoid flapping (tunable suspicion timers)
  • β€’Online schema changes only: DDL must be backward/forward compatible across rolling upgrades
  • β€’Rebalancing must not violate replication policies (region/zone constraints) and must throttle to protect IO
  • β€’Admin APIs must be authenticated/authorized (RBAC) and fully audit-logged
  • β€’Network partitions are expected; minority partitions must not accept writes/placements (safety over liveness)

πŸ“Š Capacity Estimation

πŸ‘₯ Users
Internal service - manages cluster operations
πŸ’Ύ Storage
Metadata: ~10GB for large clusters
⚑ QPS
Heartbeats: 10K/sec; Admin ops: 100/sec
πŸ“ Assumptions
  • β€’ 1000 nodes in cluster
  • β€’ Heartbeat every 3 seconds
  • β€’ 100K ranges/shards across cluster
  • β€’ Schema changes: ~10/day
  • β€’ Rebalancing: ~1000 range moves/hour

πŸ’‘ Key Concepts

CRITICAL
Control Plane
Centralized coordination for distributed database metadata and schema
HIGH
Online Schema Change
Multi-phase schema evolution without downtime
CRITICAL
Consensus Protocol
Raft/Paxos for distributed agreement on metadata

πŸ’‘ Interview Tips

  • πŸ’‘Start with the separation of control plane and data plane
  • πŸ’‘Emphasize consensus protocols (Raft/Paxos) for metadata consistency
  • πŸ’‘Discuss online schema changes - this is a key differentiator
  • πŸ’‘Be prepared to discuss failure detection and recovery in depth
  • πŸ’‘Know the tradeoffs between availability and consistency
  • πŸ’‘Discuss how systems like CockroachDB, Spanner, or YugabyteDB handle these problems