如何找到一个数据框相对于另一个 df 的补充？

how to find complement of a dataframe with respect of another df?

join
pyspark

我想找到 df1 中不包含 id 表格 df2 的所有行。在pandas我可以通过下面的代码

df1.merge(df2, on='id', how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']

如何在 pyspark 中完成？

使用left_anti join

`df1`

df1 = spark.createDataFrame([
    (1, 'a'),
    (1, 'b'),
    (1, 'c'),
    (2, 'd'),
    (2, 'e'),
    (3, 'f'),
], ['id', 'col'])

+---+---+
| id|col|
+---+---+
|  1|  a|
|  1|  b|
|  1|  c|
|  2|  d|
|  2|  e|
|  3|  f|
+---+---+

`df2`

df2 = spark.createDataFrame([
    (1, 'a'),
    (1, 'b'),
    (1, 'c'),
], ['id', 'col'])

+---+---+
| id|col|
+---+---+
|  1|  a|
|  1|  b|
|  1|  c|
+---+---+

`left_anti`加入

df1.join(df2, on=['id'], how='left_anti').show()

+---+---+
| id|col|
+---+---+
|  2|  d|
|  2|  e|
|  3|  f|
+---+---+