如何向pyspark df添加一列，数据格式应该是一个列表，并且来自原始table的分组数据

Question

我是 pyspark 的新手，不确定是否有简单的方法。

我有一个 df 与人们的兴趣例如：

name	interest
A	gym
A	food
A	games
B	games

从这个 df，我想创建一个新的如下：

name	interests
A	gym;food;games
B	games

有人可以帮忙吗？如果我对问题的解释不够清楚，请提前道歉。

Answer 1

schema = X.schema
X_pd = X.toPandas()
_X = spark.createDataFrame(X_pd,schema=schema)
del X_pd

Answer 2

您可以使用 concat_ws and collect_list 来自 pyspark.sql.functions:

from pyspark.sql import functions as F

df.groupBy("name").agg(
  F.concat_ws(";", F.collect_list("interest")
             ).alias("interest")).show(truncate=False)

打印：

+----+--------------+
|name|interest      |
+----+--------------+
|A   |gym;food;games|
|B   |games         |
+----+--------------+

记得将它分配回一个新的数据框

concat_ws：使用给定的分隔符将多个输入字符串列连接成一个字符串列。
collect_list:

如何向pyspark df添加一列，数据格式应该是一个列表，并且来自原始table的分组数据

How to add a column to pyspark df, the data format should be a list, and come from grouped data from the raw table

python

dataframe

pyspark

apache-spark-sql