高效计算spark中的top-k元素
Efficiently calculate top-k elements in spark
我的数据框类似于:
+---+-----+-----+
|key|thing|value|
+---+-----+-----+
| u1| foo| 1|
| u1| foo| 2|
| u1| bar| 10|
| u2| foo| 10|
| u2| foo| 2|
| u2| bar| 10|
+---+-----+-----+
并希望获得以下结果:
+---+-----+---------+----+
|key|thing|sum_value|rank|
+---+-----+---------+----+
| u1| bar| 10| 1|
| u1| foo| 3| 2|
| u2| foo| 12| 1|
| u2| bar| 10| 2|
+---+-----+---------+----+
目前有类似的代码:
val df = Seq(("u1", "foo", 1), ("u1", "foo", 2), ("u1", "bar", 10), ("u2", "foo", 10), ("u2", "foo", 2), ("u2", "bar", 10)).toDF("key", "thing", "value")
// calculate sums per key and thing
val aggregated = df.groupBy("key", "thing").agg(sum("value").alias("sum_value"))
// get topk items per key
val k = lit(10)
val topk = aggregated.withColumn("rank", rank over Window.partitionBy("key").orderBy(desc("sum_value"))).filter('rank < k)
但是,这段代码非常低效。 window 函数生成 总订单 项并导致 巨大的随机播放 。
如何更有效地计算前 k 项?
也许使用近似函数,即类似于 https://datasketches.github.io/ or https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html
的草图
RDD 的救援
aggregated.as[(String, String, Long)].rdd.groupBy(_._1).map{ case (thing, it) => (thing, it.map(e=> (e._2, e._3)).toList.sortBy(sorter => sorter._2).take(1))}.toDF.show
+---+----------+
| _1| _2|
+---+----------+
| u1| [[foo,3]]|
| u2|[[bar,10]]|
+---+----------+
这很可能会根据评论中的建议得到改进。 IE。不是从 aggregated
开始,而是 df
。这可能类似于:
df.as[(String, String, Long)].rdd.groupBy(_._1).map{case (thing, it) => {
val aggregatedInner = it.groupBy(e=> (e._2)).mapValues(events=> events.map(value => value._3).sum)
val topk = aggregatedInner.toArray.sortBy(sorter=> sorter._2).take(1)
(thing, topk)
}}.toDF.show
这是推荐系统的经典算法。
case class Rating(thing: String, value: Int) extends Ordered[Rating] {
def compare(that: Rating): Int = -this.value.compare(that.value)
}
case class Recommendation(key: Int, ratings: Seq[Rating]) {
def keep(n: Int) = this.copy(ratings = ratings.sorted.take(n))
}
val TOPK = 10
df.groupBy('key)
.agg(collect_list(struct('thing, 'value)) as "ratings")
.as[Recommendation]
.map(_.keep(TOPK))
您还可以在以下位置查看源代码:
- Spotify 大数据 Rosetta 代码/
TopItemsPerUser.scala
,这里有几个针对 Spark 或 Scio 的解决方案
- Spark MLLib /
TopByKeyAggregator.scala
,被认为是使用他们的推荐算法时的最佳实践,但看起来他们的示例仍然使用 RDD
。
import org.apache.spark.mllib.rdd.MLPairRDDFunctions._
sc.parallelize(Array(("u1", ("foo", 1)), ("u1", ("foo", 2)), ("u1", ("bar", 10)), ("u2", ("foo", 10)),
("u2", ("foo", 2)), ("u2", ("bar", 10))))
.topByKey(10)(Ordering.by(_._2))
我的数据框类似于:
+---+-----+-----+
|key|thing|value|
+---+-----+-----+
| u1| foo| 1|
| u1| foo| 2|
| u1| bar| 10|
| u2| foo| 10|
| u2| foo| 2|
| u2| bar| 10|
+---+-----+-----+
并希望获得以下结果:
+---+-----+---------+----+
|key|thing|sum_value|rank|
+---+-----+---------+----+
| u1| bar| 10| 1|
| u1| foo| 3| 2|
| u2| foo| 12| 1|
| u2| bar| 10| 2|
+---+-----+---------+----+
目前有类似的代码:
val df = Seq(("u1", "foo", 1), ("u1", "foo", 2), ("u1", "bar", 10), ("u2", "foo", 10), ("u2", "foo", 2), ("u2", "bar", 10)).toDF("key", "thing", "value")
// calculate sums per key and thing
val aggregated = df.groupBy("key", "thing").agg(sum("value").alias("sum_value"))
// get topk items per key
val k = lit(10)
val topk = aggregated.withColumn("rank", rank over Window.partitionBy("key").orderBy(desc("sum_value"))).filter('rank < k)
但是,这段代码非常低效。 window 函数生成 总订单 项并导致 巨大的随机播放 。
如何更有效地计算前 k 项? 也许使用近似函数,即类似于 https://datasketches.github.io/ or https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html
的草图RDD 的救援
aggregated.as[(String, String, Long)].rdd.groupBy(_._1).map{ case (thing, it) => (thing, it.map(e=> (e._2, e._3)).toList.sortBy(sorter => sorter._2).take(1))}.toDF.show
+---+----------+
| _1| _2|
+---+----------+
| u1| [[foo,3]]|
| u2|[[bar,10]]|
+---+----------+
这很可能会根据评论中的建议得到改进。 IE。不是从 aggregated
开始,而是 df
。这可能类似于:
df.as[(String, String, Long)].rdd.groupBy(_._1).map{case (thing, it) => {
val aggregatedInner = it.groupBy(e=> (e._2)).mapValues(events=> events.map(value => value._3).sum)
val topk = aggregatedInner.toArray.sortBy(sorter=> sorter._2).take(1)
(thing, topk)
}}.toDF.show
这是推荐系统的经典算法。
case class Rating(thing: String, value: Int) extends Ordered[Rating] {
def compare(that: Rating): Int = -this.value.compare(that.value)
}
case class Recommendation(key: Int, ratings: Seq[Rating]) {
def keep(n: Int) = this.copy(ratings = ratings.sorted.take(n))
}
val TOPK = 10
df.groupBy('key)
.agg(collect_list(struct('thing, 'value)) as "ratings")
.as[Recommendation]
.map(_.keep(TOPK))
您还可以在以下位置查看源代码:
- Spotify 大数据 Rosetta 代码/
TopItemsPerUser.scala
,这里有几个针对 Spark 或 Scio 的解决方案 - Spark MLLib /
TopByKeyAggregator.scala
,被认为是使用他们的推荐算法时的最佳实践,但看起来他们的示例仍然使用RDD
。
import org.apache.spark.mllib.rdd.MLPairRDDFunctions._
sc.parallelize(Array(("u1", ("foo", 1)), ("u1", ("foo", 2)), ("u1", ("bar", 10)), ("u2", ("foo", 10)),
("u2", ("foo", 2)), ("u2", ("bar", 10))))
.topByKey(10)(Ordering.by(_._2))