元组 RDD 的 SortByValue

Question

最近有人要求我（在 class 作业中）找出 RDD 中出现次数最多的 10 个词。我提交了一个工作解决方案，看起来像

wordsRdd
  .map(x => (x, 1))
  .reduceByKey(_ + _)
  .map(case (x, y) => (y, x))
  .sortByKey(false)
  .map(case (x, y) => (y, x))
  .take(10)

基本上，我交换了元组，按键排序，然后再次交换。然后最后取10。我觉得重复交换不是很优雅。

所以我想知道是否有更优雅的方法来做到这一点。

我搜索并发现有人使用 Scala implicits 将 RDD 转换为 Scala 序列，然后执行 sortByValue，但我不想将 RDD 转换为 Scala Seq，因为这会破坏 RDD 的分布式特性。

那么有没有更好的方法呢？

Answer 1

这个怎么样：

wordsRdd.
    map(x => (x, 1)).
    reduceByKey(_ + _).
    takeOrdered(10)(Ordering.by(-1 * _._2))

或者更详细一点：

object WordCountPairsOrdering extends Ordering[(String, Int)] {
    def compare(a: (String, Int), b: (String, Int)) = b._2.compare(a._2)
}

wordsRdd.
    map(x => (x, 1)).
    reduceByKey(_ + _).
    takeOrdered(10)(WordCountPairsOrdering)

元组 RDD 的 SortByValue

SortByValue for a RDD of tuples

scala

apache-spark

rdd