将 scala 数据集中的每个参数值筛选为前 2 个案例 类
filter list to first 2 case classes per parameter value in scala dataset
我有一个这样的 spark 数据集:
+--------+--------------------+
| uid| recommendations|
+--------+--------------------+
|41344966|[[2133, red]...|
|41345063|[[11353, red...|
|41346177|[[2996, yellow]...|
|41349171|[[8477, green]...|
res98: org.apache.spark.sql.Dataset[userItems] = [uid: int, recommendations: array<struct<iid:int,color:string>>]
我想过滤每个推荐数组,以包含每种颜色的前两个。伪例子:
[(13,'red'), (4,'green'), (8,'red'), (2,'red'), (10, 'yellow')]
会变成
[(13,'red'), (4,'green'), (8,'red'), (10, 'yellow')]
如何在 Scala 中使用数据集有效地执行此操作?是否有使用 reduceGroups
?
之类的优雅解决方案
我目前拥有的:
case class itemData (iid: Int, color: String)
val filterList = (recs: Array[itemData], filterAttribute, maxCount) => {
// filter the list somehow... using the max count and attribute
})
dataset.map(d => filterList(d.recommendations, "color", 2))
你可以展开推荐,然后根据uid和颜色创建一个行号分区,最后过滤掉大于2的行号。代码应该如下所示。希望对你有帮助。
//Creating Test Data
val df = Seq((13,"red"), (4,"green"), (8,"red"), (2,"red"), (10, "yellow")).toDF("iid", "color")
.withColumn("uid", lit(41344966))
.groupBy("uid").agg(collect_list(struct("iid", "color")).as("recommendations"))
df.show(false)
+--------+----------------------------------------------------+
|uid |recommendations |
+--------+----------------------------------------------------+
|41344966|[[13,red], [4,green], [8,red], [2,red], [10,yellow]]|
+--------+----------------------------------------------------+
val filterDF = df.withColumn("rec", explode(col("recommendations")))
.withColumn("iid", col("rec.iid"))
.withColumn("color", col("rec.color"))
.drop("recommendations", "rec")
.withColumn("rownum",
row_number().over(Window.partitionBy("uid", "color").orderBy(col("iid").desc)))
.filter(col("rownum") <= 2)
.groupBy("uid").agg(collect_list(struct("iid", "color")).as("recommendations"))
filterDF.show(false)
+--------+-------------------------------------------+
|uid |recommendations |
+--------+-------------------------------------------+
|41344966|[[4,green], [13,red], [8,red], [10,yellow]]|
+--------+-------------------------------------------+
我有一个这样的 spark 数据集:
+--------+--------------------+
| uid| recommendations|
+--------+--------------------+
|41344966|[[2133, red]...|
|41345063|[[11353, red...|
|41346177|[[2996, yellow]...|
|41349171|[[8477, green]...|
res98: org.apache.spark.sql.Dataset[userItems] = [uid: int, recommendations: array<struct<iid:int,color:string>>]
我想过滤每个推荐数组,以包含每种颜色的前两个。伪例子:
[(13,'red'), (4,'green'), (8,'red'), (2,'red'), (10, 'yellow')]
会变成
[(13,'red'), (4,'green'), (8,'red'), (10, 'yellow')]
如何在 Scala 中使用数据集有效地执行此操作?是否有使用 reduceGroups
?
我目前拥有的:
case class itemData (iid: Int, color: String)
val filterList = (recs: Array[itemData], filterAttribute, maxCount) => {
// filter the list somehow... using the max count and attribute
})
dataset.map(d => filterList(d.recommendations, "color", 2))
你可以展开推荐,然后根据uid和颜色创建一个行号分区,最后过滤掉大于2的行号。代码应该如下所示。希望对你有帮助。
//Creating Test Data
val df = Seq((13,"red"), (4,"green"), (8,"red"), (2,"red"), (10, "yellow")).toDF("iid", "color")
.withColumn("uid", lit(41344966))
.groupBy("uid").agg(collect_list(struct("iid", "color")).as("recommendations"))
df.show(false)
+--------+----------------------------------------------------+
|uid |recommendations |
+--------+----------------------------------------------------+
|41344966|[[13,red], [4,green], [8,red], [2,red], [10,yellow]]|
+--------+----------------------------------------------------+
val filterDF = df.withColumn("rec", explode(col("recommendations")))
.withColumn("iid", col("rec.iid"))
.withColumn("color", col("rec.color"))
.drop("recommendations", "rec")
.withColumn("rownum",
row_number().over(Window.partitionBy("uid", "color").orderBy(col("iid").desc)))
.filter(col("rownum") <= 2)
.groupBy("uid").agg(collect_list(struct("iid", "color")).as("recommendations"))
filterDF.show(false)
+--------+-------------------------------------------+
|uid |recommendations |
+--------+-------------------------------------------+
|41344966|[[4,green], [13,red], [8,red], [10,yellow]]|
+--------+-------------------------------------------+