DataFrame 删除另一个 DataFrame 中存在的行
DataFrame remove rows existing in another DataFrame
我有两个数据框:
df1:
+----------+-------------+-------------+--------------+---------------+
|customerId| fullName| telephone1| telephone2| email|
+----------+-------------+-------------+--------------+---------------+
| 201534|MARIO JIMENEZ|01722-3500391|+5215553623333|ascencio@my.com|
| 879535| MARIO LOPEZ|01722-3500377|+5215553623333| asceloe@my.com|
+----------+-------------+-------------+--------------+---------------+
df2:
+----------+-------------+-------------+--------------+---------------+
|customerId| fullName| telephone1| telephone2| email|
+----------+-------------+-------------+--------------+---------------+
| 201534|MARIO JIMENEZ|01722-3500391|+5215553623333|ascencio@my.com|
| 201536| ROBERT MITZ|01722-3500377|+5215553623333| asceloe@my.com|
| 201537| MARY ENG|01722-3500127|+5215553623111|generic1@my.com|
| 201538| RICK BURT|01722-3500983|+5215553623324|generic2@my.com|
| 201539| JHON DOE|01722-3502547|+5215553621476|generic3@my.com|
+----------+-------------+-------------+--------------+---------------+
我需要从 df1 中获取第三个 DataFrame,不存在于 df2 中。
像这样:
+----------+-------------+-------------+--------------+---------------+
|customerId| fullName| telephone1| telephone2| email|
+----------+-------------+-------------+--------------+---------------+
| 879535| MARIO LOPEZ|01722-3500377|+5215553623333| asceloe@my.com|
+----------+-------------+-------------+--------------+---------------+
正确的做法是什么?
我已经尝试过以下方法:
diff = df2.join(df1, df2['customerId'] != df1['customerId'],"left")
diff = df1.subtract(df2)
diff = df1[~ df1['customerId'].isin(df2['customerId'])]
但是它们不起作用,有什么建议吗?
使用pyspark
:
您可以使用 collect()
:
创建包含来自 DF2
的 customerId 的列表
from pyspark.sql.types import *
id_df2 = [id[0] for id in df2.select('customerId').distinct().collect()]
然后使用 isin
和否定 ~
:
过滤您的 DF1
customerId
diff = df1.where(~col('customerId').isin(id_df2))
您可以将 merge
与 indicator=True
一起使用:
df3 = df1.merge(df2, on=df1.columns.tolist(), how='left', indicator=True)
df3 = df3[df3['_merge'] == 'left_only'].drop(columns='_merge')
输出:
>>> df3
customerId fullName telephone1 telephone2 email
1 879535 MARIO LOPEZ 01722-3500377 5215553623333 asceloe@my.com
我有两个数据框:
df1:
+----------+-------------+-------------+--------------+---------------+
|customerId| fullName| telephone1| telephone2| email|
+----------+-------------+-------------+--------------+---------------+
| 201534|MARIO JIMENEZ|01722-3500391|+5215553623333|ascencio@my.com|
| 879535| MARIO LOPEZ|01722-3500377|+5215553623333| asceloe@my.com|
+----------+-------------+-------------+--------------+---------------+
df2:
+----------+-------------+-------------+--------------+---------------+
|customerId| fullName| telephone1| telephone2| email|
+----------+-------------+-------------+--------------+---------------+
| 201534|MARIO JIMENEZ|01722-3500391|+5215553623333|ascencio@my.com|
| 201536| ROBERT MITZ|01722-3500377|+5215553623333| asceloe@my.com|
| 201537| MARY ENG|01722-3500127|+5215553623111|generic1@my.com|
| 201538| RICK BURT|01722-3500983|+5215553623324|generic2@my.com|
| 201539| JHON DOE|01722-3502547|+5215553621476|generic3@my.com|
+----------+-------------+-------------+--------------+---------------+
我需要从 df1 中获取第三个 DataFrame,不存在于 df2 中。
像这样:
+----------+-------------+-------------+--------------+---------------+
|customerId| fullName| telephone1| telephone2| email|
+----------+-------------+-------------+--------------+---------------+
| 879535| MARIO LOPEZ|01722-3500377|+5215553623333| asceloe@my.com|
+----------+-------------+-------------+--------------+---------------+
正确的做法是什么?
我已经尝试过以下方法:
diff = df2.join(df1, df2['customerId'] != df1['customerId'],"left")
diff = df1.subtract(df2)
diff = df1[~ df1['customerId'].isin(df2['customerId'])]
但是它们不起作用,有什么建议吗?
使用pyspark
:
您可以使用 collect()
:
DF2
的 customerId 的列表
from pyspark.sql.types import *
id_df2 = [id[0] for id in df2.select('customerId').distinct().collect()]
然后使用 isin
和否定 ~
:
DF1
customerId
diff = df1.where(~col('customerId').isin(id_df2))
您可以将 merge
与 indicator=True
一起使用:
df3 = df1.merge(df2, on=df1.columns.tolist(), how='left', indicator=True)
df3 = df3[df3['_merge'] == 'left_only'].drop(columns='_merge')
输出:
>>> df3
customerId fullName telephone1 telephone2 email
1 879535 MARIO LOPEZ 01722-3500377 5215553623333 asceloe@my.com