获取 Nutch 2.3.1 抓取页面的原始 html

Question

我想使用多个网页训练 NLP 模型以获得良好的精度。由于我没有网页，我正在考虑在 Amazon EMR 上使用网络爬虫。我想使用遵守 robots.txt 规则的分布式、可扩展和可扩展的开源解决方案。经过一番研究，我决定采用 Apache Nutch。

我发现 Nutch 的主要贡献者 Julien Nioche 的 this video 对于入门特别有用。虽然我使用了最新的 Hadoop (Amazon 2.7.3) 和 Nutch (2.3.1) 版本，但我还是成功地完成了一个小示例工作。

不幸的是，我找不到从 Nutch 的输出中检索原始 html 文件的简单方法。在寻找此问题的解决方案时，我发现了一些其他有用的资源（除了 Nutch 自己的 wiki and tutorial 页面之外）。

其中一些（如this answer or this page）建议实现一个新的插件（或修改现有插件）：总体思路是添加几行在将任何获取的 html 页面发送到段之前实际将其内容保存到文件中的代码数量。

其他人（如 this answer）建议实施一个简单的 post 处理工具，该工具可以访问段，遍历那里包含的所有记录并保存出现的任何记录的内容成为文件的 html 页。

这些资源都包含（或多或少精确的）说明和代码示例，但我尝试运行它们时运气不佳，因为它们指的是非常旧的 Nutch 版本。此外，由于缺少 resources/documentation.

，我所有尝试使它们适应 Nuth 2.3.1 的尝试都失败了

例如，我在HtmlParser（parse-html插件的核心）的末尾附加了以下代码，但是保存在指定文件夹中的所有文件都是空的：

String html = root.toString();
if (html == null) {
    byte[] bytes = content.getContent();
    try {
      html = new String(bytes, encoding);
    } catch (UnsupportedEncodingException e) {
        LOG.trace(e.getMessage(), e);
    }
}
if (html != null) {
    html = html.trim();
    if (!html.isEmpty()) {
        if (dumpFolder == null) {
            String currentUsersHomeFolder = System.getProperty("user.home");
            currentUsersHomeFolder = "/Users/stefano";
            dumpFolder = currentUsersHomeFolder + File.separator + "nutch_dump";
            new File(dumpFolder).mkdir();
        }
        try {
            String filename = base.toString().replaceAll("\P{LD}", "_");
            if (!filename.toLowerCase().endsWith(".htm") && !filename.toLowerCase().endsWith(".html")) {
                filename += ".html";
            }
            System.out.println(">> " + dumpFolder+ File.separator +filename);
            PrintWriter writer = new PrintWriter(dumpFolder + File.separator + filename, encoding);
            writer.write(html);
            writer.close();
        } catch (Exception e) {
            LOG.trace(e.getMessage(), e);
        }
    }
}

在另一种情况下，我得到了以下错误（我喜欢它，因为它提到了序言，但它也让我感到困惑）：

[Fatal Error] data:1:1: Content is not allowed in prolog.

所以，在考虑将我的设置降级到 Nutch 1.x 之前，我的问题是：你们中有没有人不得不面对这个问题并成功解决了最新版本的 Nutch？

如果是这样，您能否与社区分享或至少提供一些有用的解决方案？

非常感谢！

PS：如果您想知道如何正确地将 Nutch 源代码打开到 IntelliJ 中，this answer 实际上可能会为您指明正确的方向。

Answer 1

很高兴您发现该视频很有用。如果你只是需要网页来训练一个 NLP 模型，为什么不使用 CommonCrawl 数据集呢？它包含数十亿个页面，是免费的并且可以为您省去大规模网络爬行的麻烦？

现在回答你的问题，你可以写一个自定义的IndexWriter，把页面的内容写到你想要的任何地方。我不使用 Nutch 2.x，因为我更喜欢 1.x，因为它速度更快、功能更多且更易于使用（老实说，我实际上更喜欢 StormCrawler，但我有偏见）。 Nutch 1.x 有一个 WARCExporter class，它可以生成与 CommonCrawl 使用的相同 WARC 格式的数据转储；还有另一个 class 用于以各种格式导出。

Answer 2

您可以通过编辑 Nutch 代码保存原始 HTML 首先运行 eclipse 中的 nutch 跟随 https://wiki.apache.org/nutch/RunNutchInEclipse

完成运行在 eclipse 中编辑文件 FetcherReducer.java 后，将此代码添加到输出方法中，运行 ant eclipse 再次重建 class

最后，原始 html 将添加到您数据库中的 reportUrl 列

if (content != null) {
    ByteBuffer raw = fit.page.getContent();
    if (raw != null) {
        ByteArrayInputStream arrayInputStream = new ByteArrayInputStream(raw.array(), raw.arrayOffset() + raw.position(), raw.remaining());
        Scanner scanner = new Scanner(arrayInputStream);
        scanner.useDelimiter("\Z");//To read all scanner content in one String
        String data = "";
        if (scanner.hasNext()) {
            data = scanner.next();
        }
        fit.page.setReprUrl(StringUtil.cleanField(data));
        scanner.close();
    } 
}

获取 Nutch 2.3.1 抓取页面的原始 html

Obtain the raw html of pages fetched by Nutch 2.3.1

html

web-crawler

nutch

web-scraping

hadoop2