如何在没有 readseg 命令的情况下从 Nutch 段中读取

Question

我正在使用 Nutch 来抓取一些网站，正是我在抓取 this site。

我已经 these five segments 找到了所有文件（大约 10.000 个文件）。现在我想使用 readseg 命令处理没有的文档内容，即不将段转储为纯文本。

为此，只有每个段的子目录content对我有用（标签和文档的内容）。

我发现在 content 目录中还有两个容器：data 和 index。但是我还没有找到对它们的任何解释，我该如何阅读它们来处理里面的内容。这个问题我也找了一些pointers，但是算法思路还没看懂

Nutch 段中的内容是如何存储的，如何读取？如果想给出一个简短的例子（但不是必需的），我已经给出了集合网站和细分。

Answer 1

您需要对这些内容做什么？例如，您可以编写一个自定义的 IndexWriter。它将在索引步骤期间被调用，并允许您访问内容。或者查看 'dump' 命令 (org.apache.nutch.tools.FileDumper) 并修改代码。

顺便说一句 'Hadoop the Definitive Guide' 作者 Tom White 有一个关于 Nutch 数据结构的精彩章节。

如果您想对页面进行进一步处理，如 NLP 或分类，Behemoth 可用于将 Nutch 片段转换为 HDFS 上的 'neutral' 数据结构，然后可以使用各种方法对其进行处理工具。

Answer 2

根据@JulienNioche 的回复，这是我的实现。

// file is the root directory of the segments.
private static void indexSegments(File file)
        throws IOException, IllegalAccessException, InstantiationException {
    // Do not try to index files that cannot be read.
    if (file.canRead() & file.isDirectory()) {
        // List with all the segments.
        File[] segmentDirs = file.listFiles();
        if (segmentDirs == null) {
            System.err.println("No segment directories found in '" +
                                file.getAbsolutePath() + "'");
            return;
        }
        Configuration conf = NutchConfiguration.create();
        FileSystem fs = FileSystem.get(conf);
        // Index all the segments.
        for (File segment : segmentDirs) {
            /* Only the content of the documents managed in
             * the segment is useful for the system. */
            String segmentData = segment.getAbsolutePath() + "/" +
                    Content.DIR_NAME + "/part-00000/data";
            if (!new File(segmentData).exists()) {
                System.out.println("Skipping segment: '" + segment.getName() +
                                   "': no data directory present.");
                continue;
            }
            SequenceFile.Reader reader =
                    new SequenceFile.Reader(fs, new Path(segmentData), conf);
            Writable key = (Writable) reader.getKeyClass().newInstance();
            // Index all the documents managed in the current segment.
            while (reader.next(key)) {
                Content content = new Content();
                reader.getCurrentValue(content);
                String url = key.toString();
                String baseName = FilenameUtils.getBaseName(url);
                String extension = FilenameUtils.getExtension(url);
                // Skips the document if it's not a XML file.
                String mimeType = new Tika().detect(content.getContent());
                if (mimeType == null | !mimeType.equals(MediaType.APPLICATION_XML.toString())) {
                    System.out.println("Skipping document: '" + baseName +
                                       "': not a XML file.");
                    continue;
                }
                /* Content of the document. */
                ByteArrayInputStream bas = new ByteArrayInputStream(content.getContent());
                int n = bas.available();
                byte[] bytes = new byte[n];
                bas.read(bytes, 0, n);
                bas.close();
                String docContent = new String(bytes, StandardCharsets.UTF_8);
                // TODO: Do what you want with the content.
            }
        }
    }
}

Answer 3

我知道这是一个旧问题，但我在试图找到同一个问题的答案时偶然发现了它。我搜索了一些答案，想出了这个简单的 java 循环来获取片段内容.键 class 是读取索引和数据文件的 org.apache.hadoop.io.MapFile.Reader。免责声明我是 nutch 和 hadoop 的新手，但这对我有用。

private void readContent(Path[] segmentPaths) throws Exception {
    
    String[] fileTypes = {"content", "crawl_fetch", "parse_data", "parse_text"};
    String partR = "part-r-00000";
    
    for (Path path : segmentPaths) {
        for (String type : fileTypes) {
            Path file = new Path(path, type + "/" + partR);
            MapFile.Reader reader = new MapFile.Reader(file, conf);
            
            WritableComparable key = (WritableComparable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);
            Writable value = (Writable) ReflectionUtils.newInstance(reader.getValueClass(), conf);
            while (reader.next(key, value)) {
                System.out.printf("%s\t%s\n", key, value);
            }
            reader.close();
        }
        
    }
}

如何在没有 readseg 命令的情况下从 Nutch 段中读取

How to read from Nutch segments without readseg command

java

web-crawler

nutch