运行nutch爬虫爬取到的数据存储在哪里？

Question

我是 Nutch 新手。我需要抓取网页（例如，几百个网页），读取抓取的数据并进行一些分析。

我遵循 link https://wiki.apache.org/nutch/NutchTutorial（并集成了 Solr，因为我将来可能需要搜索文本）和运行使用一些 URL 作为种子的爬网。

现在，我在本地计算机中找不到 text/html 数据。我在哪里可以找到数据以及以文本格式读取数据的最佳方式是什么？

版本

apache-nutch-1.9
solr-4.10.4

Answer 1

爬网结束后，您可以使用 bin/nutch 转储命令转储以普通 html 格式获取的所有网址。

用法如下：

$ bin/nutch dump [-h] [-mimetype <mimetype>] [-outputDir <outputDir>]
   [-segment <segment>]
 -h,--help                show this help message
 -mimetype <mimetype>     an optional list of mimetypes to dump, excluding
                      all others. Defaults to all.
 -outputDir <outputDir>   output directory (which will be created) to host
                      the raw data
 -segment <segment>       the segment(s) to use

例如，您可以做类似

的事情

$ bin/nutch dump -segment crawl/segments -outputDir crawl/dump/

这将在 -outputDir 位置创建一个新目录，并以 html 格式转储所有抓取的页面。

还有很多方法可以从 Nutch 中导出特定数据，请查看 https://wiki.apache.org/nutch/CommandLineOptions

运行nutch爬虫爬取到的数据存储在哪里？

Where is the crawled data stored when running nutch crawler?

web-crawler

nutch

版本