根据 ID 组合多行中的文本,但删除连续的重复条目
Combine text from multiple rows based on IDs but remove consecutive duplicate entry
我有一个数据框
id
col1
1
aa
3
uy
1
bb
1
cr
1
cr
1
cr
1
qe
2
yt
2
yt
3
uy
4
po
1
cr
我能够使用 collect_Set 方法像这样组合它们
df = df.groupby("id").agg(f.concat_ws(", ", f.collect_list(df.col1)).alias('col1'))
id
col1
1
aa,bb,cr,cr,cr,qe,cr
2
yt,yt
3
uy,uy
4
po
但我希望我的最终输出删除连续的重复项目,像这样
id
col1
1
aa,bb,cr,qe,cr
2
yt
3
uy
4
po
w=Window.partitionBy('id')
df= (df.withColumn('index', monotonically_increasing_id())#Create an index to orderBy
.withColumn('index',lag('col1').over(w.orderBy('index'))).na.fill('')#Create a column to use in filter
.where(col('col1')!=col('index')).drop('index')#filter
.groupBy('id').agg(array_join(collect_list('col1'),',').alias('col1'))#groupby, collect_list and the array_join the outcome
.show())
+---+--------------+
| id| col1|
+---+--------------+
| 1|aa,bb,cr,qe,cr|
| 2| yt|
| 3| uy|
| 4| po|
+---+--------------+
我有一个数据框
id | col1 |
---|---|
1 | aa |
3 | uy |
1 | bb |
1 | cr |
1 | cr |
1 | cr |
1 | qe |
2 | yt |
2 | yt |
3 | uy |
4 | po |
1 | cr |
我能够使用 collect_Set 方法像这样组合它们
df = df.groupby("id").agg(f.concat_ws(", ", f.collect_list(df.col1)).alias('col1'))
id | col1 |
---|---|
1 | aa,bb,cr,cr,cr,qe,cr |
2 | yt,yt |
3 | uy,uy |
4 | po |
但我希望我的最终输出删除连续的重复项目,像这样
id | col1 |
---|---|
1 | aa,bb,cr,qe,cr |
2 | yt |
3 | uy |
4 | po |
w=Window.partitionBy('id')
df= (df.withColumn('index', monotonically_increasing_id())#Create an index to orderBy
.withColumn('index',lag('col1').over(w.orderBy('index'))).na.fill('')#Create a column to use in filter
.where(col('col1')!=col('index')).drop('index')#filter
.groupBy('id').agg(array_join(collect_list('col1'),',').alias('col1'))#groupby, collect_list and the array_join the outcome
.show())
+---+--------------+
| id| col1|
+---+--------------+
| 1|aa,bb,cr,qe,cr|
| 2| yt|
| 3| uy|
| 4| po|
+---+--------------+