Pyspark 删除重复的 base 2 列
Pyspark remove duplicates base 2 columns
我在 pyspark 中有下一个 df:
+---------+----------+--------+-----+----------+------+
|firstname|middlename|lastname| ncf| date|salary|
+---------+----------+--------+-----+----------+------+
| James| | V|36636|2021-09-03| 3000| remove
| Michael| Rose| |40288|2021-09-10| 4000|
| Robert| |Williams|42114|2021-08-03| 4000|
| Maria| Anne| Jones|39192|2021-05-13| 4000|
| Jen| Mary| Brown| |2020-09-03| -1|
| James| | Smith|36636|2021-09-03| 3000| remove
| James| | Smith|36636|2021-09-04| 3000|
+---------+----------+--------+-----+----------+------+
我需要删除 ncf 和日期相等的行。 df 结果将是:
+---------+----------+--------+-----+----------+------+
|firstname|middlename|lastname| ncf| date|salary|
+---------+----------+--------+-----+----------+------+
| Michael| Rose| |40288|2021-09-10| 4000|
| Robert| |Williams|42114|2021-08-03| 4000|
| Maria| Anne| Jones|39192|2021-05-13| 4000|
| Jen| Mary| Brown| |2020-09-03| -1|
| James| | Smith|36636|2021-09-04| 3000|
+---------+----------+--------+-----+----------+------+
您可以使用window函数来统计是否有两行或更多行符合您的条件
from pyspark.sql import functions as F
from pyspark.sql import Window as W
df.withColumn('duplicated', F.count('*').over(W.partitionBy('ncf', 'date').orderBy(F.lit(1))) > 1)
# +---------+----------+--------+-----+----------+------+----------+
# |firstname|middlename|lastname| ncf| date|salary|duplicated|
# +---------+----------+--------+-----+----------+------+----------+
# | Jen| Mary| Brown| |2020-09-03| -1| false|
# | James| | V|36636|2021-09-03| 3000| true|
# | James| | Smith|36636|2021-09-03| 3000| true|
# | Michael| Rose| |40288|2021-09-10| 4000| false|
# | Robert| |Williams|42114|2021-08-03| 4000| false|
# | James| | Smith|36636|2021-09-04| 3000| false|
# | Maria| Anne| Jones|39192|2021-05-13| 4000| false|
# +---------+----------+--------+-----+----------+------+----------+
您现在可以使用 duplicated
根据需要过滤行。
dropDuplicates 方法有助于删除列子集中的重复项。
df.dropDuplicates(['ncf', 'date'])
我在 pyspark 中有下一个 df:
+---------+----------+--------+-----+----------+------+
|firstname|middlename|lastname| ncf| date|salary|
+---------+----------+--------+-----+----------+------+
| James| | V|36636|2021-09-03| 3000| remove
| Michael| Rose| |40288|2021-09-10| 4000|
| Robert| |Williams|42114|2021-08-03| 4000|
| Maria| Anne| Jones|39192|2021-05-13| 4000|
| Jen| Mary| Brown| |2020-09-03| -1|
| James| | Smith|36636|2021-09-03| 3000| remove
| James| | Smith|36636|2021-09-04| 3000|
+---------+----------+--------+-----+----------+------+
我需要删除 ncf 和日期相等的行。 df 结果将是:
+---------+----------+--------+-----+----------+------+
|firstname|middlename|lastname| ncf| date|salary|
+---------+----------+--------+-----+----------+------+
| Michael| Rose| |40288|2021-09-10| 4000|
| Robert| |Williams|42114|2021-08-03| 4000|
| Maria| Anne| Jones|39192|2021-05-13| 4000|
| Jen| Mary| Brown| |2020-09-03| -1|
| James| | Smith|36636|2021-09-04| 3000|
+---------+----------+--------+-----+----------+------+
您可以使用window函数来统计是否有两行或更多行符合您的条件
from pyspark.sql import functions as F
from pyspark.sql import Window as W
df.withColumn('duplicated', F.count('*').over(W.partitionBy('ncf', 'date').orderBy(F.lit(1))) > 1)
# +---------+----------+--------+-----+----------+------+----------+
# |firstname|middlename|lastname| ncf| date|salary|duplicated|
# +---------+----------+--------+-----+----------+------+----------+
# | Jen| Mary| Brown| |2020-09-03| -1| false|
# | James| | V|36636|2021-09-03| 3000| true|
# | James| | Smith|36636|2021-09-03| 3000| true|
# | Michael| Rose| |40288|2021-09-10| 4000| false|
# | Robert| |Williams|42114|2021-08-03| 4000| false|
# | James| | Smith|36636|2021-09-04| 3000| false|
# | Maria| Anne| Jones|39192|2021-05-13| 4000| false|
# +---------+----------+--------+-----+----------+------+----------+
您现在可以使用 duplicated
根据需要过滤行。
dropDuplicates 方法有助于删除列子集中的重复项。
df.dropDuplicates(['ncf', 'date'])