Design Feature Store for Machine Learning - System Design Interview

📝 Problem Description

Design a feature store that provides a centralized repository for ML features, serving both batch training and real-time inference with consistency guarantees. Support feature discovery, lineage tracking, and point-in-time correctness.

👤 Use Cases

Data Scientist wants to defines new feature so that feature registered and computed

Model wants to requests features for inference so that receives features in <10ms

Training Pipeline wants to fetches historical features so that point-in-time correct dataset

ML Engineer wants to discovers existing features so that reuses features across models

✅ Functional Requirements

•Define and register feature definitions (schemas)
•Ingest features from batch (Spark) and streaming (Kafka) sources
•Serve features for online inference (low latency)
•Serve features for offline training (batch retrieval)
•Point-in-time correct feature joins for training
•Feature discovery and documentation
•Feature lineage and data quality monitoring
•Feature sharing across teams and models

⚡ Non-Functional Requirements

•Online serving latency < 10ms (P99)
•Support 1M+ feature lookups per second
•Petabytes of historical feature data
•Consistency between online and offline stores
•Support 10,000+ feature definitions
•Handle late-arriving data correctly

⚠️ Constraints & Assumptions

•Point-in-time correctness is mandatory for training datasets (prevent data leakage)
•Online/offline consistency: definitions and transformations must be shared; drift must be detected
•Online P99 latency < 10ms (feature vector assembly included)
•High-cardinality entities (user_id/item_id) require sharding/partitioning and hot-key protection
•Feature freshness SLAs (e.g., < 5 minutes staleness for streaming features) must be enforced
•Schema evolution: add/rename/deprecate features without breaking existing models
•Backfills and re-computation must be supported without taking serving offline

📊 Capacity Estimation

👥 Users

1000 ML engineers, 10,000 models

💾 Storage

Online: 10TB; Offline: 1PB

⚡ QPS

Online reads: 1M/sec; Batch writes: 100K/sec

🌐 Bandwidth

Online: 1GB/sec; Batch ingestion: 10TB/day

📐 Assumptions

• 10,000 feature definitions
• Average feature: 100 bytes
• 100 features per inference request
• 30-day online retention, 3-year offline

💡 Key Concepts

CRITICAL

Feature Store

Centralized repository for ML features serving both training and inference

CRITICAL

Point-in-Time Correctness

Join features as they existed at training time to prevent data leakage

HIGH

Dual Store Architecture

Online store for low-latency serving, offline store for training

💡 Interview Tips

💡Start with the problem: training-serving skew
💡Discuss the dual-store architecture (online + offline)
💡Emphasize point-in-time correctness
💡Be prepared to discuss feature freshness requirements
💡Know the difference between batch and streaming features
💡Understand the role of feature stores in MLOps