在 pyspark 中删除所有重复实例
drop all instances of duplicates in pyspark
我试着搜索这个,但我能找到的最接近的是 this。但它并没有给我我想要的。
我想删除数据框中所有重复的实例。
例如,如果我有一个数据框
Col1 Col2 Col3
Alice Girl April
Jean Boy Aug
Jean Boy Sept
我想根据 Col1 和 Col2 删除 all 重复项,以便我得到
Col1 Col2 Col3
Alice Girl April
有什么办法吗?
此外,如果我有大量这样的列:
Col1 Col2 Col3 .... Col n
Alice Girl April .... Apple
Jean Boy Aug .... Orange
Jean Boy Sept .... Banana
如何仅按 Col1 和 Col2 分组,但仍保留其余列?
谢谢
from pyspark.sql import functions as F
# Sample Dataframe
df = sqlContext.createDataFrame([
["Alice", "Girl","April"],
["Jean","Boy", "Aug"],
["Jean","Boy","Sept"]
],
["Col1","Col2", "Col3"])
# Group by on required column and select rows where count is 1.
df2 = (df
.groupBy(["col1", "col2"])
.agg(
F.count(F.lit(1)).alias('count'),
F.max("col3").alias("col3")).where("count = 1")).drop("count")
df2.show(10, False)
输出:
+-----+----+-----+
|col1 |col2|col3 |
+-----+----+-----+
|Alice|Girl|April|
+-----+----+-----+
对编辑版本的回应
df = sqlContext.createDataFrame([
["Alice", "Girl","April", "April"],
["Jean","Boy", "Aug", "XYZ"],
["Jean","Boy","Sept", "IamBatman"]
],
["col1","col2", "col3", "newcol"])
groupingcols = ["col1", "col2"]
othercolumns = [F.max(col).alias(col) for col in df.columns if col not in groupingcols]
df2 = (df
.groupBy(groupingcols)
.agg(F.count(F.lit(1)).alias('count'), *othercolumns)
.where("count = 1")
.drop("count"))
df2.show(10, False)
输出:
+-----+----+-----+------+
|col1 |col2|col3 |newcol|
+-----+----+-----+------+
|Alice|Girl|April|April |
+-----+----+-----+------+
我试着搜索这个,但我能找到的最接近的是 this。但它并没有给我我想要的。 我想删除数据框中所有重复的实例。 例如,如果我有一个数据框
Col1 Col2 Col3
Alice Girl April
Jean Boy Aug
Jean Boy Sept
我想根据 Col1 和 Col2 删除 all 重复项,以便我得到
Col1 Col2 Col3
Alice Girl April
有什么办法吗?
此外,如果我有大量这样的列:
Col1 Col2 Col3 .... Col n
Alice Girl April .... Apple
Jean Boy Aug .... Orange
Jean Boy Sept .... Banana
如何仅按 Col1 和 Col2 分组,但仍保留其余列?
谢谢
from pyspark.sql import functions as F
# Sample Dataframe
df = sqlContext.createDataFrame([
["Alice", "Girl","April"],
["Jean","Boy", "Aug"],
["Jean","Boy","Sept"]
],
["Col1","Col2", "Col3"])
# Group by on required column and select rows where count is 1.
df2 = (df
.groupBy(["col1", "col2"])
.agg(
F.count(F.lit(1)).alias('count'),
F.max("col3").alias("col3")).where("count = 1")).drop("count")
df2.show(10, False)
输出:
+-----+----+-----+
|col1 |col2|col3 |
+-----+----+-----+
|Alice|Girl|April|
+-----+----+-----+
对编辑版本的回应
df = sqlContext.createDataFrame([
["Alice", "Girl","April", "April"],
["Jean","Boy", "Aug", "XYZ"],
["Jean","Boy","Sept", "IamBatman"]
],
["col1","col2", "col3", "newcol"])
groupingcols = ["col1", "col2"]
othercolumns = [F.max(col).alias(col) for col in df.columns if col not in groupingcols]
df2 = (df
.groupBy(groupingcols)
.agg(F.count(F.lit(1)).alias('count'), *othercolumns)
.where("count = 1")
.drop("count"))
df2.show(10, False)
输出:
+-----+----+-----+------+
|col1 |col2|col3 |newcol|
+-----+----+-----+------+
|Alice|Girl|April|April |
+-----+----+-----+------+