← Back to All Questions
Very Hard~85 minData Processing

Design AI/ML Model Inference Service

OpenAIGoogleMetaAmazonNVIDIAAnthropicHugging Face

📝 Problem Description

Design a scalable model inference service that serves machine learning models (including large language models) with low latency, high throughput, and efficient resource utilization. Support model versioning, A/B testing, and auto-scaling based on demand.

👤 Use Cases

1.
Application wants to sends inference request so that receives model prediction
2.
ML Engineer wants to deploys new model version so that model served with zero downtime
3.
System wants to detects high load so that auto-scales model replicas
4.
Data Scientist wants to runs A/B test so that traffic split between model versions

✅ Functional Requirements

  • Deploy and serve ML models via REST/gRPC APIs
  • Support multiple model versions simultaneously
  • A/B testing and canary deployments
  • Model warm-up and pre-loading
  • Request batching for throughput optimization
  • Streaming responses for LLMs
  • Model monitoring and alerting
  • Support for GPU and CPU inference

⚡ Non-Functional Requirements

  • P99 latency < 100ms for standard models
  • Support 100K+ inference requests per second
  • Efficient GPU utilization (>80%)
  • Auto-scale based on queue depth and latency
  • 99.9% availability
  • Support models up to 100B+ parameters

⚠️ Constraints & Assumptions

  • P99 latency budget: < 100ms for non-streaming models; streaming first-token < 300ms
  • Hard request limits: max input tokens, max output tokens, max concurrent streams per model replica
  • Multi-tenant isolation: per-tenant rate limits, quotas, and noisy-neighbor protection
  • GPU constraints: fixed VRAM per instance; model + KV-cache must fit (or use paging/offload)
  • Batching constraints: dynamic batching must not violate tail latency SLOs (batch window capped)
  • Model artifacts are immutable once deployed; rollbacks use version pinning (no in-place mutation)
  • Backpressure required: if queue depth is high, shed load or return 429/503 with retry-after
  • Observability required: request traces, token-level metrics, and per-model error budgets

📊 Capacity Estimation

👥 Users
10,000 ML applications, 1M model invocations/minute
💾 Storage
Models: 10PB; Inference logs: 1TB/day
⚡ QPS
Inference: 100K/sec; Model loads: 100/min
🌐 Bandwidth
Requests: 10GB/sec; Model loading: 100GB/sec peak
📐 Assumptions
  • Average model size: 1GB (ranges from 100MB to 100GB+)
  • Average inference latency: 50ms
  • GPU memory: 80GB per A100
  • LLM tokens: 1000 tokens/request average

💡 Key Concepts

CRITICAL
Dynamic Batching
Accumulate requests and batch for GPU efficiency
CRITICAL
Model Serving
Efficiently load and serve ML models on GPU/CPU
HIGH
Auto-scaling
Scale replicas based on queue depth and latency

💡 Interview Tips

  • 💡Start with the key challenge: GPU efficiency vs latency
  • 💡Discuss dynamic batching in detail - it is the core optimization
  • 💡Be prepared to discuss model parallelism strategies
  • 💡Know the numbers: GPU memory, inference latency, throughput
  • 💡Discuss the differences between serving traditional ML vs LLMs
  • 💡Understand KV cache and its memory implications for LLMs
  • 💡Be ready to discuss cost optimization strategies