为什么我的 Apache Nutch warc 和 commoncrawldump 在抓取后会失败？

Question

我已经使用 Nutch 成功抓取了一个网站，现在我想根据结果创建一个 warc。但是，运行 warc 和 commoncrawldump 命令都失败了。此外，运行 bin/nutch dump -segement .... 在同一段文件夹上成功运行。

我正在使用 nutch v-1.17 和运行:

bin/nutch commoncrawldump -outputDir output/ -segment crawl/segments

来自 hadoop.log 的错误是 ERROR tools.CommonCrawlDataDumper - No segment directories found in my/path/ 尽管刚刚运行在那里爬行。

Answer 1

segments 文件夹中包含之前抛出错误的爬网片段。它们不包含所有段数据，因为我认为抓取 cancelled/finished 早。这导致整个过程失败。删除所有这些文件并重新开始解决了这个问题。

Why does my Apache Nutch warc and commoncrawldump fail after crawl?