使用另一个列表作为 polars 中的布尔掩码的过滤器列表

Question

我有一个包含两列的极坐标数据框，其中两列都是列表。

df = pl.DataFrame({
    'a': [[True, False], [False, True]],
    'b': [['name1', 'name2'], ['name3', 'name4']]
})
df
shape: (2, 2)
┌───────────────┬────────────────────┐
│ a             ┆ b                  │
│ ---           ┆ ---                │
│ list[bool]    ┆ list[str]          │
╞═══════════════╪════════════════════╡
│ [true, false] ┆ ["name1", "name2"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [false, true] ┆ ["name3", "name4"] │
└───────────────┴────────────────────┘

我想使用列 a 作为布尔掩码来过滤列 b。 a 列中每个列表的长度始终与 b 列中每个列表的长度相同。

我可以考虑使用 explode，然后过滤、聚合并执行 join，但在某些情况下，连接列不可用，我宁愿避免使用此方法简单。

是否有任何其他方法可以使用另一个列表作为布尔掩码来过滤列表？我试过使用.arr.eval，但它似乎不接受涉及其他列的操作。

如有任何帮助，我们将不胜感激！

Answer 1

这不是最理想的解决方案，因为我们对数据进行了整理，以便为每个列表分解成其元素建立一个组。然后我们再次按该组分组并应用过滤器。

df = pl.DataFrame({
    'a': [[True, False], [False, True]],
    'b': [['name1', 'name2'], ['name3', 'name4']]
})

(df.with_row_count()
   .explode(["a", "b"])
   .groupby("row_nr")
   .agg([
       pl.col("b").filter(pl.col("a"))
   ])
)


shape: (2, 2)
┌────────┬───────────┐
│ row_nr ┆ b         │
│ ---    ┆ ---       │
│ u32    ┆ list[str] │
╞════════╪═══════════╡
│ 1      ┆ ["name4"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 0      ┆ ["name1"] │
└────────┴───────────┘

也许我们可以在 polars 中想出更好的东西。如果 arr.eval 可以访问其他列就好了。待定！

编辑 02-06-2022

在polars-0.13.41，这不会像您想象的那么昂贵。 Polars 知道 row_count 已排序并在整个查询中保持排序。列表列的分解也是免费的。

当 polars 知道 groupby 键已排序时，groupby 操作将快约 15 倍。

在上面的查询中，您只需支付：

展开行数
对排序后的键进行分组（超级快）
遍历列表（这是我们无论如何都需要支付的费用）。

为确保它运行快速，您可以运行使用 POLARS_VERBOSE=1 进行查询。这会将以下文本写入标准错误：

could fast explode column a
could fast explode column b
keys/aggregates are not partitionable: running default HASH AGGREGATION
groupby keys are sorted; running sorted key fast path

使用另一个列表作为 polars 中的布尔掩码的过滤器列表

Filter list using another list as a boolean mask in polars

python

dataframe

python-polars

编辑 02-06-2022