将 Scala 数据框列合并为单个案例 class
combine scala dataframe columns into single case class
我有一个如下所示的数据框:
+--------+-----+--------------------+
| uid| iid| color|
+--------+-----+--------------------+
|41344966| 1305| red|
|41344966| 1305| green|
我想尽可能高效地完成此操作:
+--------+--------------------+
| uid| recommendations|
+--------+--------------------+
|41344966| [[2174, red...|
|41345063| [[2174, green...|
|41346177| [[2996, orange...|
|41349171| [[2174, purple...|
res98: org.apache.spark.sql.Dataset[userRecs] = [uid: int, recommendations: array<struct<iid:int,color:string>>]
所以我想按 uid 将记录分组到一个对象数组中。每个对象都是一个 class 参数 iid 和 color.
case class itemData (iid: Int, color: String)
case class userRecs (uid: Int, recommendations: Array[itemData])
这是你想要的吗?
scala> case class itemData (iid: Int, color: String)
defined class itemData
scala> case class userRecs (uid: Int, recommendations: Array[itemData])
defined class userRecs
scala> val df = spark.createDataset(Seq(
(41344966,1305,"red"),
(41344966,1305,"green"),
(41344966,2174,"red"),
(41345063,2174,"green"),
(41346177,2996,"orange"),
(41349171,2174,"purple")
)).toDF("uid", "iid", "color")
df: org.apache.spark.sql.DataFrame = [uid: int, iid: int ... 1 more field]
scala> (df.select($"uid", struct($"iid",$"color").as("itemData"))
.groupBy("uid")
.agg(collect_list("itemData").as("recommendations"))
.as[userRecs]
.show())
+--------+--------------------+
| uid| recommendations|
+--------+--------------------+
|41344966|[[1305, red], [13...|
|41345063| [[2174, green]]|
|41346177| [[2996, orange]]|
|41349171| [[2174, purple]]|
+--------+--------------------+
我有一个如下所示的数据框:
+--------+-----+--------------------+
| uid| iid| color|
+--------+-----+--------------------+
|41344966| 1305| red|
|41344966| 1305| green|
我想尽可能高效地完成此操作:
+--------+--------------------+
| uid| recommendations|
+--------+--------------------+
|41344966| [[2174, red...|
|41345063| [[2174, green...|
|41346177| [[2996, orange...|
|41349171| [[2174, purple...|
res98: org.apache.spark.sql.Dataset[userRecs] = [uid: int, recommendations: array<struct<iid:int,color:string>>]
所以我想按 uid 将记录分组到一个对象数组中。每个对象都是一个 class 参数 iid 和 color.
case class itemData (iid: Int, color: String)
case class userRecs (uid: Int, recommendations: Array[itemData])
这是你想要的吗?
scala> case class itemData (iid: Int, color: String)
defined class itemData
scala> case class userRecs (uid: Int, recommendations: Array[itemData])
defined class userRecs
scala> val df = spark.createDataset(Seq(
(41344966,1305,"red"),
(41344966,1305,"green"),
(41344966,2174,"red"),
(41345063,2174,"green"),
(41346177,2996,"orange"),
(41349171,2174,"purple")
)).toDF("uid", "iid", "color")
df: org.apache.spark.sql.DataFrame = [uid: int, iid: int ... 1 more field]
scala> (df.select($"uid", struct($"iid",$"color").as("itemData"))
.groupBy("uid")
.agg(collect_list("itemData").as("recommendations"))
.as[userRecs]
.show())
+--------+--------------------+
| uid| recommendations|
+--------+--------------------+
|41344966|[[1305, red], [13...|
|41345063| [[2174, green]]|
|41346177| [[2996, orange]]|
|41349171| [[2174, purple]]|
+--------+--------------------+