如何按 PySpark 中的第二列分组过滤数据框

How filter dataframe by groupby second column in PySpark

我有一个包含列 'households, people, flag' 的 df,我想将数据框过滤到至少包含一个标志的家庭。我理解逻辑,但不确定如何编码,有人可以帮忙吗?对于下面的示例,输出将删除家庭 2。

逻辑: df = df.filter(all rows in households where at least one row in that household contains 'flag'==1)

Example dataframe:
| Household| Person|flag|
| -------- | ----- | -- |
| 1        | Oliver|    |
| 1        | Jonny | 1  | 
| 2        | David |    |
| 2        | Mary  |    |
| 3        | Lizzie|    |
| 3        | Peter | 1  |

过滤和groupBy得到想要的Household并进行内连接得到最终结果。

df.join(df.filter("flag = '1'").select('Household').distinct(), ['Household'], 'inner').show()

+---------+------+----+
|Household|Person|flag|
+---------+------+----+
|        1|Oliver|null|
|        1| Jonny|   1|
|        3|Lizzie|null|
|        3| Peter|   1|
+---------+------+----+