根据 ID 组合多行中的文本，但删除连续的重复条目

Question

我有一个数据框

id	col1
1	aa
3	uy
1	bb
1	cr
1	cr
1	cr
1	qe
2	yt
2	yt
3	uy
4	po
1	cr

我能够使用 collect_Set 方法像这样组合它们

df  = df.groupby("id").agg(f.concat_ws(", ", f.collect_list(df.col1)).alias('col1'))

id	col1
1	aa,bb,cr,cr,cr,qe,cr
2	yt,yt
3	uy,uy
4	po

但我希望我的最终输出删除连续的重复项目，像这样

id	col1
1	aa,bb,cr,qe,cr
2	yt
3	uy
4	po

Answer 1

w=Window.partitionBy('id')

df= (df.withColumn('index',  monotonically_increasing_id())#Create an index to orderBy
        .withColumn('index',lag('col1').over(w.orderBy('index'))).na.fill('')#Create a column to use in filter
       .where(col('col1')!=col('index')).drop('index')#filter
       .groupBy('id').agg(array_join(collect_list('col1'),',').alias('col1'))#groupby, collect_list and the array_join the outcome
     .show())

+---+--------------+
| id|          col1|
+---+--------------+
|  1|aa,bb,cr,qe,cr|
|  2|            yt|
|  3|            uy|
|  4|            po|
+---+--------------+

根据 ID 组合多行中的文本，但删除连续的重复条目

Combine text from multiple rows based on IDs but remove consecutive duplicate entry

group-by

collect

dataframe

pyspark