📝 Problem Description
Design a feature store that provides a centralized repository for ML features, serving both batch training and real-time inference with consistency guarantees. Support feature discovery, lineage tracking, and point-in-time correctness.
👤 Use Cases
1.
Data Scientist wants to defines new feature so that feature registered and computed
2.
Model wants to requests features for inference so that receives features in <10ms
3.
Training Pipeline wants to fetches historical features so that point-in-time correct dataset
4.
ML Engineer wants to discovers existing features so that reuses features across models
✅ Functional Requirements
- •Define and register feature definitions (schemas)
- •Ingest features from batch (Spark) and streaming (Kafka) sources
- •Serve features for online inference (low latency)
- •Serve features for offline training (batch retrieval)
- •Point-in-time correct feature joins for training
- •Feature discovery and documentation
- •Feature lineage and data quality monitoring
- •Feature sharing across teams and models
⚡ Non-Functional Requirements
- •Online serving latency < 10ms (P99)
- •Support 1M+ feature lookups per second
- •Petabytes of historical feature data
- •Consistency between online and offline stores
- •Support 10,000+ feature definitions
- •Handle late-arriving data correctly
⚠️ Constraints & Assumptions
- •Point-in-time correctness is mandatory for training datasets (prevent data leakage)
- •Online/offline consistency: definitions and transformations must be shared; drift must be detected
- •Online P99 latency < 10ms (feature vector assembly included)
- •High-cardinality entities (user_id/item_id) require sharding/partitioning and hot-key protection
- •Feature freshness SLAs (e.g., < 5 minutes staleness for streaming features) must be enforced
- •Schema evolution: add/rename/deprecate features without breaking existing models
- •Backfills and re-computation must be supported without taking serving offline
📊 Capacity Estimation
👥 Users
1000 ML engineers, 10,000 models
💾 Storage
Online: 10TB; Offline: 1PB
⚡ QPS
Online reads: 1M/sec; Batch writes: 100K/sec
🌐 Bandwidth
Online: 1GB/sec; Batch ingestion: 10TB/day
📐 Assumptions
- • 10,000 feature definitions
- • Average feature: 100 bytes
- • 100 features per inference request
- • 30-day online retention, 3-year offline
💡 Key Concepts
CRITICAL
Feature Store
Centralized repository for ML features serving both training and inference
CRITICAL
Point-in-Time Correctness
Join features as they existed at training time to prevent data leakage
HIGH
Dual Store Architecture
Online store for low-latency serving, offline store for training
💡 Interview Tips
- 💡Start with the problem: training-serving skew
- 💡Discuss the dual-store architecture (online + offline)
- 💡Emphasize point-in-time correctness
- 💡Be prepared to discuss feature freshness requirements
- 💡Know the difference between batch and streaming features
- 💡Understand the role of feature stores in MLOps