如何保存搜索引擎找到的bilion网站内容(google是怎么做的)

How to save content of bilion websites found by search engine (how google is doing it)

在他们最初的 Google Paper Sergey Brin 和 Lawrence Page 中解释说,他们没有 HTML 将抓取的网页内容直接保存在存储库中,因为他们想保存一些 HDD space .这是那一段:

4.2.2 Repository

The repository contains the full HTML of every web page. Each page is compressed using zlib (see RFC1950). The choice of compression technique is a tradeoff between speed and compression ratio. We chose zlib's speed over a significant improvement in compression offered by bzip. The compression rate of bzip was approximately 4 to 1 on the repository as compared to zlib's 3 to 1 compression. In the repository, the documents are stored one after the other and are prefixed by docID, length, and URL as can be seen in Figure 2. The repository requires no other data structures to be used in order to access it. This helps with data consistency and makes development much easier; we can rebuild all the other data structures from only the repository and a file which lists crawler errors.

显然他们使用压缩算法(在他们的例子中是 zlib)首先压缩数据,然后将其保存在存储库中。压缩数据实际上是可以直接保存在文件系统上的二进制数据。元数据(页面标题、页面大小、links 等)可以在文件系统上的二进制文件中以 link 保存在 DB 中。这听起来是个好主意,但如果我们谈论的是抓取数十亿页面的搜索引擎,那么这种保存数据的方式可能会有一些缺点。

今天最好的方法是什么?如果您想构建能够处理数百万网站内容的大型搜索引擎,您将在哪里以及如何保存已抓取页面的内容?HTML?

If you want to build big scale search engine that will handle content of milion of websites, where and how would you save HTML content of the crawled pages?

对于您所说的数据类型,最好的选择是使用一种分布式文件系统。 Google 自己创建了 Google File System,一个用于此目的的分布式容错文件系统。