Nutch 1.17 网络爬虫与存储优化

Question

我正在使用 Nutch 1.17 抓取超过百万个网站。我必须为此执行以下操作。

一次运行爬虫作为深度爬虫，因此它应该从给定的（100 万个）域中获取最大的 URL。第一次，您可以运行最多使用 48 小时。
在此之后，运行爬虫在 5 到 6 小时后具有相同的 100 万个域，并且只有 select 这些域上的新 URL。
作业完成后，在 Solr 中索引 URL
后面就不需要存储raw了HTML，所以为了节省存储空间（HDFS），只去掉raw数据，维护每个page的元数据，这样在下一个job的时候，应该避免重新再次获取页面（在预定时间之前）。

没有任何其他处理或 post 分析。现在，我可以选择使用中等规模的 Hadoop 集群（最多 30 台机器）。每台机器都有 16GB RAM、12 个内核和 2TB 存储空间。 Solr 机器也是相同的空间。现在，为了保持以上，我对以下内容感到好奇：

a. How to achieve above document crawl rate i.e., how many machines are enough ? 
b. Should I need to add more machines or is there any better solution ?
c. Is it possible to remove raw data from Nutch and keep metadata only ?
d. Is there any best strategy to achieve the above objectives.

Answer 1

a. How to achieve above document crawl rate i.e., how many machines are enough ?

假设选择了对同一域的连续提取之间的礼貌延迟：假设每个域和分钟可以提取 10 个页面，最大值。抓取速度为每小时 6 亿页 (10^6*10*60)。具有 360 个核心的集群应该足以接近这个速率。能否在 48 小时内将 100 万个域全部爬取完，取决于每个域的大小。请记住，提到的每个域每分钟 10 个页面的抓取速度，在 48 小时内每个域只能获取 10*60*48 = 28800 个页面。

c. Is it possible to remove raw data from Nutch and keep metadata only ?

一旦某个片段被编入索引，您就可以将其删除。 CrawlDb 足以决定在 100 万个主页之一上找到的 link 是否是新的。

After the job completion, index URLs in Solr

也许在每个循环后立即索引段。

b. Should I need to add more machines or is there any better solution ? d. Is there any best strategy to achieve the above objectives.

很大程度上取决于域的大小是否相似。如果它们显示 power-law 分布（这很可能），则您的域很少有数百万个页面（几乎没有穷尽地抓取），而域的长尾只有几个页面（最多几百页）。在这种情况下，您需要更少的资源，但需要更多的时间来达到预期的结果。

Nutch 1.17 网络爬虫与存储优化

Nutch 1.17 web crawling with storage optimization

solr

hadoop

nutch

hdfs

nutch2