Powered by GitBook

Design Web Crawler

Scenario

crawl 1.6m web pages / sec
crawl all of them every week
10p web page storage
average size of a web page = 10k

Service

Crawler, TaskService, StorageService

Storage

BigTable DB to store tasks
HDFS to store web pages
url, page_content

Approach 1: single threaded web crawler

Approach 2: multi-threaded web crawler

cons: context switch; limited port number; network bottleneck for single machine

Approach 3: distributed web crawler

queue 这个概念，通常是在算法中，也通常是在内存中的概念。
当一个 Queue 需要持久化到硬盘的时候，其中一种实现方式，就是使用数据库去模拟这个queue。task table就是在这个问题中我们专门开的一个table，来做为一个task的queue用的。

所以当你说 queue的时候，潜台词其实是在内存里，可以存储的量比较小。
当这个queue非常大的时候，就需要存在数据库里，那么就是 task table。(https://www.jiuzhang.com/qa/2678/)
task table: id, url, state, priority, availableTime

scale

how to handle slow select?
- sharding task table
how to handle update for failure?
- ie. content update, crawl failure
- exponential back-off
how to handle dead cycle
- use quota

results matching ""

No results matching ""