Design Web Crawler - System Design Interview

📝 Problem Description

Design a web crawler that systematically browses the internet to index web pages. It should handle billions of pages, respect robots.txt, avoid duplicate content, and prioritize important pages. Used by search engines like Google.

👤 Use Cases

Crawler wants to discovers new URL so that adds to frontier queue

Crawler wants to fetches page so that extracts links and content

Search Engine wants to requests fresh content so that crawler re-crawls pages

Website wants to updates robots.txt so that crawler respects new rules

✅ Functional Requirements

•Crawl web pages starting from seed URLs
•Extract and follow links from pages
•Parse and store page content
•Respect robots.txt rules
•Handle duplicate URLs and content
•Prioritize crawling (importance, freshness)
•Support incremental re-crawling

⚡ Non-Functional Requirements

•Crawl 1 billion pages per day
•Politeness: Max 1 request/second per domain
•Distributed across thousands of machines
•Fault tolerant - no lost work
•Scalable to the size of the web

⚠️ Constraints & Assumptions

•Web is constantly changing
•Many pages are duplicates or near-duplicates
•Must handle all content types (HTML, JS-rendered)
•Rate limiting and IP blocking by websites

📊 Capacity Estimation

👥 Users

N/A (internal system)

💾 Storage

1 PB for content, 100TB for URL frontier

⚡ QPS

~12,000 page fetches per second

🌐 Bandwidth

100 Gbps total crawl bandwidth

📐 Assumptions

• 1 billion pages per day
• Average page size: 500KB
• Average 50 links per page
• 1000 crawler machines

💡 Key Concepts

CRITICAL

URL Frontier

Two-level queue: priority queues feed into per-domain politeness queues.

CRITICAL

Politeness

Rate limit requests to each domain. Wait between requests. Respect Crawl-delay.

HIGH

robots.txt

Standard for websites to specify crawl rules. Cache per domain.

HIGH

Content Fingerprinting

SimHash or MinHash to detect near-duplicate content.

MEDIUM

URL Normalization

Canonicalize URLs: lowercase, remove fragments, sort params.

💡 Interview Tips

💡Politeness is critical - don't overwhelm sites
💡URL deduplication is a major challenge
💡Discuss how to handle the long tail of the web
💡Cover failure handling and checkpointing