懒惰过滤器取决于上一行（Polars Python）

Question

我正在使用 Python Polars，我有一个 table 这样的 :

Column1	Column2
id1	1
id1	1
id1	2
id1	1
id1	1
id1	2
id1	3

我想使用 Polars Lazy API，在 column2 前一个元素与实际的 column2 元素不同时得到结果。所以操作后的结果看起来像这样：

Column1	Column2
id1	1
id1	2
id1	1
id1	2
id1	3

谢谢！

Answer 1

使用 shift 表达式。

import polars as pl

df = pl.DataFrame(
    {"Column1": ["id1"] * 7, "Column2": [1, 1, 2, 1, 1, 2, 3]}).lazy()

df.filter(pl.col("Column2") != pl.col("Column2").shift(periods=1)).collect()

shape: (5, 2)
┌─────────┬─────────┐
│ Column1 ┆ Column2 │
│ ---     ┆ ---     │
│ str     ┆ i64     │
╞═════════╪═════════╡
│ id1     ┆ 1       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ id1     ┆ 2       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ id1     ┆ 1       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ id1     ┆ 2       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ id1     ┆ 3       │
└─────────┴─────────┘

您可以在此处找到有关这些选项的文档： https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.Expr.shift.html#polars.Expr.shift

请注意，您可以使用 periods 参数反转移位方向和计数。

还有一个版本 shift_and_fill 将填充由于转换而创建的 None 值。

Answer 2

让我详细说明 shift 和 shift_and_fill 是如何工作的。这些的使用归结为策略（和了解您的数据）。

使用shift

让我们从这个数据集开始：

import polars as pl
df = pl.DataFrame({"row_num": range(1, 8),
                   "Column2": [1, 2, 3, 3, 4, 5, 4]}).lazy()
df.collect()

shape: (7, 2)
┌─────────┬─────────┐
│ row_num ┆ Column2 │
│ ---     ┆ ---     │
│ i64     ┆ i64     │
╞═════════╪═════════╡
│ 1       ┆ 1       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2       ┆ 2       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3       ┆ 3       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 4       ┆ 3       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 5       ┆ 4       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 6       ┆ 5       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 7       ┆ 4       │
└─────────┴─────────┘

现在，让我们创建中间列以查看这些函数的工作原理。

(df
    .with_column(pl.col("Column2").shift().alias("Column2_shifted"))
    .with_column((pl.col("Column2") != pl.col("Column2_shifted")).alias("not_eq_result"))
).collect()

shape: (7, 4)
┌─────────┬─────────┬─────────────────┬───────────────┐
│ row_num ┆ Column2 ┆ Column2_shifted ┆ not_eq_result │
│ ---     ┆ ---     ┆ ---             ┆ ---           │
│ i64     ┆ i64     ┆ i64             ┆ bool          │
╞═════════╪═════════╪═════════════════╪═══════════════╡
│ 1       ┆ 1       ┆ null            ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2       ┆ 2       ┆ 1               ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3       ┆ 3       ┆ 2               ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4       ┆ 3       ┆ 3               ┆ false         │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5       ┆ 4       ┆ 3               ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 6       ┆ 5       ┆ 4               ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 7       ┆ 4       ┆ 5               ┆ true          │
└─────────┴─────────┴─────────────────┴───────────────┘

注意在第一行中，Column2_shifted 如何在第一行中有一个 null（实际上，None）值。

但更重要的是，(pl.col("Column2") != pl.col("Column2_shifted")) 的结果第一行是 True。

因此，只要 Column2 中不允许 null 值，第一行 将被包括在内。您不需要单独将数据集的第一行连接到您的结果。

注意：实际上，您不需要这些中间列。您可以简单地使用 .filter(pl.col("Column2") != pl.col("Column2").shift())。中间列在这里仅用于解释目的。

使用shift_and_fill

如果 None/null 值在 Column2 中是允许的，那么您可以尝试使用 shift_and_fill 并选择在 fill_value 中不允许的值Column2.

例如，如果你知道Column2中不允许出现负数，你就可以使用这个逻辑。

(df .with_column(pl.col("Column2").shift_and_fill(periods=1, fill_value=-1).alias("Column2_shifted")) .with_column((pl.col("Column2") != pl.col("Column2_shifted")).alias("not_eq_result")) ).collect()

shape: (7, 4) ┌─────────┬─────────┬─────────────────┬───────────────┐ │ row_num ┆ Column2 ┆ Column2_shifted ┆ not_eq_result │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 ┆ bool │ ╞═════════╪═════════╪═════════════════╪═══════════════╡ │ 1 ┆ 1 ┆ -1 ┆ true │ ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 2 ┆ 2 ┆ 1 ┆ true │ ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 3 ┆ 3 ┆ 2 ┆ true │ ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 4 ┆ 3 ┆ 3 ┆ false │ ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 5 ┆ 4 ┆ 3 ┆ true │ ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 6 ┆ 5 ┆ 4 ┆ true │ ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 7 ┆ 4 ┆ 5 ┆ true │ └─────────┴─────────┴─────────────────┴───────────────┘

使用此策略，第一行始终包含在内，而无需单独将第一行连接到您的结果。那是因为您故意选择了一个 fill_value，它永远不会匹配 Column2.
中的任何值
将is_first添加到表达式

如果您不确定 Column2（甚至 None）中允许哪些值，那么我建议将 is_first 附加到您的表达式中（而不是将第一行连接到你的结果数据集）：

(df .with_column(pl.col("Column2").shift().alias("Column2_shifted")) .with_column((pl.col("Column2").is_first() | (pl.col("Column2") != pl.col("Column2_shifted"))).alias("not_eq_result")) ).collect()

shape: (7, 4) ┌─────────┬─────────┬─────────────────┬───────────────┐ │ row_num ┆ Column2 ┆ Column2_shifted ┆ not_eq_result │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 ┆ bool │ ╞═════════╪═════════╪═════════════════╪═══════════════╡ │ 1 ┆ 1 ┆ null ┆ true │ ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 2 ┆ 2 ┆ 1 ┆ true │ ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 3 ┆ 3 ┆ 2 ┆ true │ ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 4 ┆ 3 ┆ 3 ┆ false │ ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 5 ┆ 4 ┆ 3 ┆ true │ ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 6 ┆ 5 ┆ 4 ┆ true │ ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 7 ┆ 4 ┆ 5 ┆ true │ └─────────┴─────────┴─────────────────┴───────────────┘

这会强制您的第一行求值为 True，仅仅是因为它是第一行。（请非常小心表达式中的嵌套括号 - 否则您可能得不到预期的结果。

这有助于澄清事情吗？

懒惰过滤器取决于上一行（Polars Python）

Lazy filter depending on the previous line (Polars Python)

python-polars