Hadoop mapreduce CSV 作为关键字：word

Question

我找不到我的问题的答案，如果有类似的post请指给我。

我有一个 CSV 文件，我正在尝试对其执行 mapreduce，CSV 的格式为两列：书名 |概要。我希望能够对每本书执行 mapreduce 并对每本书中的单词进行计数，因此，我希望输出为：Book Title : Token.

到目前为止，我已尝试使用以下代码来实现此目的：

    String firstBook = null;
    while (itr.hasMoreTokens()) {
        String secondBook = itr.nextToken();
        if (firstBook != null) {
              word.set(firstBook + ":" + secondBook);
              context.write(word, one);
        }
        firstBook = secondBook;
      }

这有时会输出以下内容；单词：标题

此外，它限制了我可以做的分析，因为这是我想用来在每个概要中执行双字母分析的逻辑。

有没有一种方法可以隔离每本书的标题，只需对 CSV 的 'synopsis' 列执行 mapreduce？如果是这样，我将如何做到这一点并获得所需的输出？

非常感谢。

更新

代码修改自Hadoops wordcount example, the only change is in the "map" section and is shown above. You can find the input data here。

CSV 文件的表示：

Book title, Synopsis
A short history of nearly everything, Bill Byrson describes himself as a reluctant traveller...
Reclaiming economic development, There is no alternative to neoliberal economics - or so it appeared...

-> 注意我已经缩短了概要。

Answer 1

thus, I would like the output to be: Book Title : Token.

如果您复制了字数统计示例，则您只写了每两个标记后跟数字 1。看起来您不是在使用标题，而只是在提要中使用标记。但是你已经切断了你得到分词器的部分，所以很难说。

注意：如果书名包含逗号，您将使用当前方法将部分书名作为概要的一部分。如果可能，您应该使标题列被引用，或者更好的是，如果该分隔符将成为至少第一列的一部分，则不要在列之间使用逗号（或任何其他常见的分隔符）。

perform an analysis of bigrams in each synopsis.

如果您想进行此类分析，我建议您先清理列 - 删除大写和标点符号。词干提取也可能产生更好的输出。

Is there a way that I can isolate each book title

当然，针对特定书籍的第一列放置一个 if 语句，并且只在该条件下写入上下文

否则，如果您的映射器只将书名写为键，那么它们将作为 reduce 函数的一部分被隔离

Answer 2

这是通过使用 "KeyValueTextInputFormat" class 解决的，这里有几个专门与此 class 相关的教程。这使我能够分离 CSV 文件，从而产生一个键：值对（在我的例子中，书名：概要）。然后，您可以在 "value" 上正常执行 reduce，并将其作为 "key : token".

传递到 reduce 阶段

Hadoop mapreduce CSV 作为关键字：word

Hadoop mapreduce CSV as key : word

java

hadoop

mapreduce