Design AI/ML Model Inference Service - System Design Interview

📝 Problem Description

Design a scalable model inference service that serves machine learning models (including large language models) with low latency, high throughput, and efficient resource utilization. Support model versioning, A/B testing, and auto-scaling based on demand.

👤 Use Cases

Application wants to sends inference request so that receives model prediction

ML Engineer wants to deploys new model version so that model served with zero downtime

System wants to detects high load so that auto-scales model replicas

Data Scientist wants to runs A/B test so that traffic split between model versions

✅ Functional Requirements

•Deploy and serve ML models via REST/gRPC APIs
•Support multiple model versions simultaneously
•A/B testing and canary deployments
•Model warm-up and pre-loading
•Request batching for throughput optimization
•Streaming responses for LLMs
•Model monitoring and alerting
•Support for GPU and CPU inference

⚡ Non-Functional Requirements

•P99 latency < 100ms for standard models
•Support 100K+ inference requests per second
•Efficient GPU utilization (>80%)
•Auto-scale based on queue depth and latency
•99.9% availability
•Support models up to 100B+ parameters

⚠️ Constraints & Assumptions

•P99 latency budget: < 100ms for non-streaming models; streaming first-token < 300ms
•Hard request limits: max input tokens, max output tokens, max concurrent streams per model replica
•Multi-tenant isolation: per-tenant rate limits, quotas, and noisy-neighbor protection
•GPU constraints: fixed VRAM per instance; model + KV-cache must fit (or use paging/offload)
•Batching constraints: dynamic batching must not violate tail latency SLOs (batch window capped)
•Model artifacts are immutable once deployed; rollbacks use version pinning (no in-place mutation)
•Backpressure required: if queue depth is high, shed load or return 429/503 with retry-after
•Observability required: request traces, token-level metrics, and per-model error budgets

📊 Capacity Estimation

👥 Users

10,000 ML applications, 1M model invocations/minute

💾 Storage

Models: 10PB; Inference logs: 1TB/day

⚡ QPS

Inference: 100K/sec; Model loads: 100/min

🌐 Bandwidth

Requests: 10GB/sec; Model loading: 100GB/sec peak

📐 Assumptions

• Average model size: 1GB (ranges from 100MB to 100GB+)
• Average inference latency: 50ms
• GPU memory: 80GB per A100
• LLM tokens: 1000 tokens/request average

💡 Key Concepts

CRITICAL

Dynamic Batching

Accumulate requests and batch for GPU efficiency

CRITICAL

Model Serving

Efficiently load and serve ML models on GPU/CPU

HIGH

Auto-scaling

Scale replicas based on queue depth and latency

💡 Interview Tips

💡Start with the key challenge: GPU efficiency vs latency
💡Discuss dynamic batching in detail - it is the core optimization
💡Be prepared to discuss model parallelism strategies
💡Know the numbers: GPU memory, inference latency, throughput
💡Discuss the differences between serving traditional ML vs LLMs
💡Understand KV cache and its memory implications for LLMs
💡Be ready to discuss cost optimization strategies