如何根据匹配字段将数据框的几列转换为数组
How to transfrom a few columns of a dataframe based on a metching field to an array
我有一个看起来像这样的数据框
+---+----+------+-------+------+
| Id|fomrid|values|occ| comments
+---+----+------+-------+------+
| 1| x1 | 22.0| 1| text1|
| 1| x1 | test| 2| text2 |
| 1| x1 | 11| 3| text3 |
| 1| x2 | 21 | 0 | text4 |
| 2| p1 | 1 | 1| text5 |
+---+----+------+-------+------+
如何将其转换为以下数据框?本质上,我想创建一个值列表和基于 formId 的 occ。
+---+------+--------------+--------+------+
| Id|fomrid|List_values |List_occ| comments
+---+------+--------------+--------+------+
| 1| x1 |[22.0, test,11]|[1,2,3]| text1|
| 1| x2 | [21] | [0] | text4 |
| 2| p1 | [1] | [1] | text5 |
+---+-----+---------------+-------+-------+
您可以使用 collect_list
来实现。
使用火花sql
创建临时视图并运行在您的 spark 会话中执行此操作
input_df.createOrReplaceTempView("my_temp_table_or_view")
output_df = sparkSession.sql("<insert sql below here>")
SELECT
Id,
fomrid,
collect_list(values) as List_values,
collect_list(occ) as List_occ,
MIN(comments) as comments
FROM
my_temp_table_or_view
GROUP BY
Id, formrid
使用 pyspark api
from pyspark.sql import functions as F
output_df = (
input_df.groupBy(["Id","fomrid"])
.agg(
F.collect_list("values").alias("List_values"),
F.collect_list("occ").alias("List_occ"),
F.min("comments").alias("comments")
)
)
使用 Scala
val output_df = input_df.groupBy("Id","fomrid")
.agg(
collect_list("values").alias("List_values"),
collect_list("occ").alias("List_occ"),
min("comments").alias("comments")
)
让我知道这是否适合你。
我有一个看起来像这样的数据框
+---+----+------+-------+------+
| Id|fomrid|values|occ| comments
+---+----+------+-------+------+
| 1| x1 | 22.0| 1| text1|
| 1| x1 | test| 2| text2 |
| 1| x1 | 11| 3| text3 |
| 1| x2 | 21 | 0 | text4 |
| 2| p1 | 1 | 1| text5 |
+---+----+------+-------+------+
如何将其转换为以下数据框?本质上,我想创建一个值列表和基于 formId 的 occ。
+---+------+--------------+--------+------+
| Id|fomrid|List_values |List_occ| comments
+---+------+--------------+--------+------+
| 1| x1 |[22.0, test,11]|[1,2,3]| text1|
| 1| x2 | [21] | [0] | text4 |
| 2| p1 | [1] | [1] | text5 |
+---+-----+---------------+-------+-------+
您可以使用 collect_list
来实现。
使用火花sql
创建临时视图并运行在您的 spark 会话中执行此操作
input_df.createOrReplaceTempView("my_temp_table_or_view")
output_df = sparkSession.sql("<insert sql below here>")
SELECT
Id,
fomrid,
collect_list(values) as List_values,
collect_list(occ) as List_occ,
MIN(comments) as comments
FROM
my_temp_table_or_view
GROUP BY
Id, formrid
使用 pyspark api
from pyspark.sql import functions as F
output_df = (
input_df.groupBy(["Id","fomrid"])
.agg(
F.collect_list("values").alias("List_values"),
F.collect_list("occ").alias("List_occ"),
F.min("comments").alias("comments")
)
)
使用 Scala
val output_df = input_df.groupBy("Id","fomrid")
.agg(
collect_list("values").alias("List_values"),
collect_list("occ").alias("List_occ"),
min("comments").alias("comments")
)
让我知道这是否适合你。