需要合并行内的数据并删除 PySpark 中每行内的重复数据
Need to union data within rows and get rid of duplicate data within each row in PySpark
我在 Databricks 中有一个 PySpark 数据框,它产生以下输出:
Field Name
A, B
C,E
A,D,F
B,C,G
我需要找到一种方法来合并这些行并删除重复值以创建如下内容:
Field Name
A, B, C, D, E, F, G
有什么想法吗?
您可以执行分组 collect_set。
df2 = df.groupBy(F.lit(1)).agg(F.collect_set('Field Name').alias('Field Name'))
尝试以下操作 -
data = [("A,B",), ("C,E",), ("A,D,F",), ("B,C,G",)]
df = spark.createDataFrame(data=data,schema=["Field_Name",])
df.show()
+----------+
|Field_Name|
+----------+
| A,B|
| C,E|
| A,D,F|
| B,C,G|
+----------+
from pyspark.sql.functions import *
from pyspark.sql.types import *
import re
remove_dupes = udf(lambda row: set(row), StringType())
comma_rep = udf(lambda x: re.sub(',$|^,','', x))
(df.withColumn("Field_Name", collect_list(col("Field_Name")))
.select(comma_rep(regexp_replace(regexp_replace(remove_dupes(array_join("Field_Name", "")), "\]", ""), "\[", "")).alias("Field_Name"))
.select(split(col("Field_Name"), ", ").alias("Field_Name"))
.select(explode("Field_Name")).filter("col != ''")
.select(array_join(collect_list("col"), ",").alias("Field_Name"))
).show(truncate=False)
+-------------+
|Field_Name |
+-------------+
|A,B,C,D,E,F,G|
+-------------+
我在 Databricks 中有一个 PySpark 数据框,它产生以下输出:
Field Name |
---|
A, B |
C,E |
A,D,F |
B,C,G |
我需要找到一种方法来合并这些行并删除重复值以创建如下内容:
Field Name |
---|
A, B, C, D, E, F, G |
有什么想法吗?
您可以执行分组 collect_set。
df2 = df.groupBy(F.lit(1)).agg(F.collect_set('Field Name').alias('Field Name'))
尝试以下操作 -
data = [("A,B",), ("C,E",), ("A,D,F",), ("B,C,G",)]
df = spark.createDataFrame(data=data,schema=["Field_Name",])
df.show()
+----------+
|Field_Name|
+----------+
| A,B|
| C,E|
| A,D,F|
| B,C,G|
+----------+
from pyspark.sql.functions import *
from pyspark.sql.types import *
import re
remove_dupes = udf(lambda row: set(row), StringType())
comma_rep = udf(lambda x: re.sub(',$|^,','', x))
(df.withColumn("Field_Name", collect_list(col("Field_Name")))
.select(comma_rep(regexp_replace(regexp_replace(remove_dupes(array_join("Field_Name", "")), "\]", ""), "\[", "")).alias("Field_Name"))
.select(split(col("Field_Name"), ", ").alias("Field_Name"))
.select(explode("Field_Name")).filter("col != ''")
.select(array_join(collect_list("col"), ",").alias("Field_Name"))
).show(truncate=False)
+-------------+
|Field_Name |
+-------------+
|A,B,C,D,E,F,G|
+-------------+