📝 Problem Description
Design a web crawler that systematically browses the internet to index web pages. It should handle billions of pages, respect robots.txt, avoid duplicate content, and prioritize important pages. Used by search engines like Google.
👤 Use Cases
1.
Crawler wants to discovers new URL so that adds to frontier queue
2.
Crawler wants to fetches page so that extracts links and content
3.
Search Engine wants to requests fresh content so that crawler re-crawls pages
4.
Website wants to updates robots.txt so that crawler respects new rules
✅ Functional Requirements
- •Crawl web pages starting from seed URLs
- •Extract and follow links from pages
- •Parse and store page content
- •Respect robots.txt rules
- •Handle duplicate URLs and content
- •Prioritize crawling (importance, freshness)
- •Support incremental re-crawling
⚡ Non-Functional Requirements
- •Crawl 1 billion pages per day
- •Politeness: Max 1 request/second per domain
- •Distributed across thousands of machines
- •Fault tolerant - no lost work
- •Scalable to the size of the web
⚠️ Constraints & Assumptions
- •Web is constantly changing
- •Many pages are duplicates or near-duplicates
- •Must handle all content types (HTML, JS-rendered)
- •Rate limiting and IP blocking by websites
📊 Capacity Estimation
👥 Users
N/A (internal system)
💾 Storage
1 PB for content, 100TB for URL frontier
⚡ QPS
~12,000 page fetches per second
🌐 Bandwidth
100 Gbps total crawl bandwidth
📐 Assumptions
- • 1 billion pages per day
- • Average page size: 500KB
- • Average 50 links per page
- • 1000 crawler machines
💡 Key Concepts
CRITICAL
URL Frontier
Two-level queue: priority queues feed into per-domain politeness queues.
CRITICAL
Politeness
Rate limit requests to each domain. Wait between requests. Respect Crawl-delay.
HIGH
robots.txt
Standard for websites to specify crawl rules. Cache per domain.
HIGH
Content Fingerprinting
SimHash or MinHash to detect near-duplicate content.
MEDIUM
URL Normalization
Canonicalize URLs: lowercase, remove fragments, sort params.
💡 Interview Tips
- 💡Politeness is critical - don't overwhelm sites
- 💡URL deduplication is a major challenge
- 💡Discuss how to handle the long tail of the web
- 💡Cover failure handling and checkpointing