← Back to All Questions
Very Hard~80 minData Processing

Design Feature Store for Machine Learning

UberAirbnbLinkedInSpotifyStripeTectonFeast

📝 Problem Description

Design a feature store that provides a centralized repository for ML features, serving both batch training and real-time inference with consistency guarantees. Support feature discovery, lineage tracking, and point-in-time correctness.

👤 Use Cases

1.
Data Scientist wants to defines new feature so that feature registered and computed
2.
Model wants to requests features for inference so that receives features in <10ms
3.
Training Pipeline wants to fetches historical features so that point-in-time correct dataset
4.
ML Engineer wants to discovers existing features so that reuses features across models

✅ Functional Requirements

  • Define and register feature definitions (schemas)
  • Ingest features from batch (Spark) and streaming (Kafka) sources
  • Serve features for online inference (low latency)
  • Serve features for offline training (batch retrieval)
  • Point-in-time correct feature joins for training
  • Feature discovery and documentation
  • Feature lineage and data quality monitoring
  • Feature sharing across teams and models

⚡ Non-Functional Requirements

  • Online serving latency < 10ms (P99)
  • Support 1M+ feature lookups per second
  • Petabytes of historical feature data
  • Consistency between online and offline stores
  • Support 10,000+ feature definitions
  • Handle late-arriving data correctly

⚠️ Constraints & Assumptions

  • Point-in-time correctness is mandatory for training datasets (prevent data leakage)
  • Online/offline consistency: definitions and transformations must be shared; drift must be detected
  • Online P99 latency < 10ms (feature vector assembly included)
  • High-cardinality entities (user_id/item_id) require sharding/partitioning and hot-key protection
  • Feature freshness SLAs (e.g., < 5 minutes staleness for streaming features) must be enforced
  • Schema evolution: add/rename/deprecate features without breaking existing models
  • Backfills and re-computation must be supported without taking serving offline

📊 Capacity Estimation

👥 Users
1000 ML engineers, 10,000 models
💾 Storage
Online: 10TB; Offline: 1PB
⚡ QPS
Online reads: 1M/sec; Batch writes: 100K/sec
🌐 Bandwidth
Online: 1GB/sec; Batch ingestion: 10TB/day
📐 Assumptions
  • 10,000 feature definitions
  • Average feature: 100 bytes
  • 100 features per inference request
  • 30-day online retention, 3-year offline

💡 Key Concepts

CRITICAL
Feature Store
Centralized repository for ML features serving both training and inference
CRITICAL
Point-in-Time Correctness
Join features as they existed at training time to prevent data leakage
HIGH
Dual Store Architecture
Online store for low-latency serving, offline store for training

💡 Interview Tips

  • 💡Start with the problem: training-serving skew
  • 💡Discuss the dual-store architecture (online + offline)
  • 💡Emphasize point-in-time correctness
  • 💡Be prepared to discuss feature freshness requirements
  • 💡Know the difference between batch and streaming features
  • 💡Understand the role of feature stores in MLOps