我如何从 PySpark DataFrame 中减去 N 特定行？

Question

我有一个数据框，nsdf，我想对其中的 5% 进行采样。 nsdf 看起来像这样：

col1
8
7
7
8
7
8
8
7
(... and so on)

我这样采样 nsdf:

sdf = nsdf.sample(0.05)

然后我想从 nsdf 中删除 sdf 中的行。现在，在这里我想我可以使用 nsdf.subtract(sdf)，但这会删除 nsdf 中匹配 sdf 中任何行的所有行。例如，如果 sdf 包含

col1
7
8

然后 nsdf 中的每个 行将被删除，因为它们都是 7 或 8。有没有办法删除只有出现在sdf中的7's/8（或其他）的数量？更具体地说，在这个例子中，我想得到一个包含相同数据但少一个 7 和一个 8 的 nsdf。

Answer 1

subtract is to remove all instances of a row in the left dataframe if present in the right dataframe. What you are looking for is exceptAll 的行为。

示例：

数据设置

df = spark.createDataFrame([(7,), (8,), (7,), (8,)], ("col1", ))


df1 = spark.createDataFrame([(7,), (8,)], ("col1", ))

df.exceptAll(df1).show()

+----+
|col1|
+----+
|   7|
|   8|
+----+

df2 = spark.createDataFrame([(7,), (7,), (8,)], ("col1", ))

df.exceptAll(df2).show()

+----+
|col1|
+----+
|   8|
+----+