Spack [Scala]：通过键减少嵌套元组值

Question

假设我有一个带有名为 mention_rdd 的 RDD 的 Spark Scala 程序，如下所示：

(name, (filename, sum))
...
(Maria, (file0, 3))
(John, (file0, 1))
(Maria, (file1, 6))
(Maria, (file2, 1))
(John, (file2, 3))
...

我们有文件名和每个名称出现的次数。

我想为每个名称减少并查找出现次数最多的文件名。例如：

(name, (filename, max(sum))
...
(Maria, (file1, 6))
(John, (file2, 3))
...

我试图自己访问 RDD 的 (filename,sum) 元组，所以我可以从那里减少 name （由于错误说我无法从 mention_rdd 因为 (String,Int) 不是 TraversableOnce 类型):

val output = mention_rdd.flatMap(file_counts => file_counts._2.map(file_counts._2._1, file_counts._2._2))   
        .reduceByKey((a, b) => if (a > b) a else b)

但是我得到一个错误提示 value map is not a member of (String, Int)

这可以在 Spark 中完成吗？如果是这样，怎么办？我的方法从一开始就存在缺陷吗？

Answer 1

为什么不只是：

val output = mention_rdd.reduceByKey {
  case ((file1, sum1), (file2, sum2)) =>
    if (sum2 >= sum1) (file2, sum2)
    else (file1, sum1)
}

Spack [Scala]：通过键减少嵌套元组值

Spack [Scala]: Reduce a nested tuple value by key

reduce

scala

mapreduce

apache-spark

rdd