如何按 PySpark 中的第二列分组过滤数据框
How filter dataframe by groupby second column in PySpark
我有一个包含列 'households, people, flag' 的 df,我想将数据框过滤到至少包含一个标志的家庭。我理解逻辑,但不确定如何编码,有人可以帮忙吗?对于下面的示例,输出将删除家庭 2。
逻辑:
df = df.filter(all rows in households where at least one row in that household contains 'flag'==1)
Example dataframe:
| Household| Person|flag|
| -------- | ----- | -- |
| 1 | Oliver| |
| 1 | Jonny | 1 |
| 2 | David | |
| 2 | Mary | |
| 3 | Lizzie| |
| 3 | Peter | 1 |
过滤和groupBy
得到想要的Household
并进行内连接得到最终结果。
df.join(df.filter("flag = '1'").select('Household').distinct(), ['Household'], 'inner').show()
+---------+------+----+
|Household|Person|flag|
+---------+------+----+
| 1|Oliver|null|
| 1| Jonny| 1|
| 3|Lizzie|null|
| 3| Peter| 1|
+---------+------+----+
我有一个包含列 'households, people, flag' 的 df,我想将数据框过滤到至少包含一个标志的家庭。我理解逻辑,但不确定如何编码,有人可以帮忙吗?对于下面的示例,输出将删除家庭 2。
逻辑:
df = df.filter(all rows in households where at least one row in that household contains 'flag'==1)
Example dataframe:
| Household| Person|flag|
| -------- | ----- | -- |
| 1 | Oliver| |
| 1 | Jonny | 1 |
| 2 | David | |
| 2 | Mary | |
| 3 | Lizzie| |
| 3 | Peter | 1 |
过滤和groupBy
得到想要的Household
并进行内连接得到最终结果。
df.join(df.filter("flag = '1'").select('Household').distinct(), ['Household'], 'inner').show()
+---------+------+----+
|Household|Person|flag|
+---------+------+----+
| 1|Oliver|null|
| 1| Jonny| 1|
| 3|Lizzie|null|
| 3| Peter| 1|
+---------+------+----+