Apache Nutch：获取 URL 列表而不是整个网络的内容

Apache Nutch: Get list of URLs and not content from the entire web

nutch

我是 apache Nutch 的新手。我的目标是从种子 URL 列表开始，并使用 Nutch 在大小限制（比如不超过 100 万或少于 1 TB 的数据）内提取尽可能多的 URL（和子 URL）。我不需要页面的内容，我只需要保存 URL。有什么办法吗？ Nutch 是合适的工具吗？

是的，您可以为此目的使用 Nutch，基本上 Nutch 可以满足您的所有需求。

您需要以任何一种方式解析获取的 HTML（以便发现新链接，当然还要重复该过程）。一种方法是使用 linkdb 命令将 Nutch 保存的 LinkDB 转储到文件中。我们可以使用 Nutch 1.x 可用的 indexer-links 插件将 inlinks/outlinks 索引到 Solr/ES.

在 Nutch 中，您可以控制每轮要处理多少个 URL，但这与获取的数据量几乎没有关系。所以你需要决定什么时候停止。

Apache Nutch：获取 URL 列表而不是整个网络的内容

Apache Nutch: Get list of URLs and not content from the entire web

nutch