← Back to All Questions
Very Hard~55 minInfrastructure

Design Distributed Tracing System

DatadogNew RelicUberGoogleAWS

📝 Problem Description

Design a distributed tracing system like Jaeger or Zipkin. Track requests across microservices, visualize service dependencies, identify latency bottlenecks, and enable debugging of distributed systems.

👤 Use Cases

1.
Developer wants to debugs slow request so that sees trace with all service calls
2.
Service wants to makes downstream call so that trace context propagated
3.
System wants to aggregates spans so that builds complete trace tree
4.
Operator wants to views service map so that sees dependencies and latencies

✅ Functional Requirements

  • Trace requests across services
  • Propagate trace context
  • Collect and aggregate spans
  • Search traces by service, operation, tags
  • Visualize trace timeline
  • Generate service dependency graph

⚡ Non-Functional Requirements

  • Handle 100K spans/sec
  • Trace retention 7 days
  • Query latency < 2 seconds
  • Minimal performance overhead (< 5%)

⚠️ Constraints & Assumptions

  • Must work with heterogeneous services
  • Trace context must propagate across network
  • Sampling required at high volume

📊 Capacity Estimation

👥 Users
1000 services, 1M requests/sec
💾 Storage
10TB (7-day retention)
⚡ QPS
Spans: 100K/sec, Queries: 100/sec
📐 Assumptions
  • 1000 microservices in production
  • 1M requests per second total
  • Average 10 spans per trace
  • 1% sampling rate for normal traffic
  • 100% sampling for errors
  • Average span size: 1KB

💡 Key Concepts

CRITICAL
Trace Context Propagation
Pass trace_id and span_id in HTTP headers (W3C Trace Context).
CRITICAL
Span
Unit of work with name, timestamps, parent, and tags.
HIGH
Sampling
Only collect subset of traces to reduce overhead.
HIGH
Trace Tree
Hierarchical structure of spans forming complete request path.

💡 Interview Tips

  • 💡Start with the core concepts: trace, span, context propagation
  • 💡Discuss sampling strategies - this is critical for scale
  • 💡Emphasize the three pillars: traces, metrics, logs
  • 💡Be prepared to discuss OpenTelemetry and its architecture
  • 💡Know the tradeoffs between sampling and completeness
  • 💡Understand how context propagates across service boundaries