Design Distributed Tracing System - System Design Interview

📝 Problem Description

Design a distributed tracing system like Jaeger or Zipkin. Track requests across microservices, visualize service dependencies, identify latency bottlenecks, and enable debugging of distributed systems.

👤 Use Cases

Developer wants to debugs slow request so that sees trace with all service calls

Service wants to makes downstream call so that trace context propagated

System wants to aggregates spans so that builds complete trace tree

Operator wants to views service map so that sees dependencies and latencies

✅ Functional Requirements

•Trace requests across services
•Propagate trace context
•Collect and aggregate spans
•Search traces by service, operation, tags
•Visualize trace timeline
•Generate service dependency graph

⚡ Non-Functional Requirements

•Handle 100K spans/sec
•Trace retention 7 days
•Query latency < 2 seconds
•Minimal performance overhead (< 5%)

⚠️ Constraints & Assumptions

•Must work with heterogeneous services
•Trace context must propagate across network
•Sampling required at high volume

📊 Capacity Estimation

👥 Users

1000 services, 1M requests/sec

💾 Storage

10TB (7-day retention)

⚡ QPS

Spans: 100K/sec, Queries: 100/sec

📐 Assumptions

• 1000 microservices in production
• 1M requests per second total
• Average 10 spans per trace
• 1% sampling rate for normal traffic
• 100% sampling for errors
• Average span size: 1KB

💡 Key Concepts

CRITICAL

Trace Context Propagation

Pass trace_id and span_id in HTTP headers (W3C Trace Context).

CRITICAL

Span

Unit of work with name, timestamps, parent, and tags.

HIGH

Sampling

Only collect subset of traces to reduce overhead.

HIGH

Trace Tree

Hierarchical structure of spans forming complete request path.

💡 Interview Tips

💡Start with the core concepts: trace, span, context propagation
💡Discuss sampling strategies - this is critical for scale
💡Emphasize the three pillars: traces, metrics, logs
💡Be prepared to discuss OpenTelemetry and its architecture
💡Know the tradeoffs between sampling and completeness
💡Understand how context propagates across service boundaries