groupby 结果在 polars 中不一致

groupby result is inconsistent in polars

在用户指南中,有一个例子:

from datetime import date

def compute_age() -> pl.Expr:
    return date(2021, 1, 1).year - pl.col("birthday").dt.year()

def avg_birthday(gender: str) -> pl.Expr:
    return compute_age().filter(
            pl.col("gender") == gender
        ).mean().alias(f"avg {gender} birthday")


q = (
    datasetn.lazy()
    .groupby(["state"])
    .agg(
        [
            avg_birthday("M"), 
            avg_birthday("F"),
            (pl.col("gender") == "M").count().alias("# male"), 
            (pl.col("gender") == "F").sum().alias("# female"),
        ]
    )
)
df = q.collect()
df

结果不一致。 例如:运行 第一次:

state avg M birthday avg F birthday # male # female
str f64 f64 u32 u32
ME 58.0 67.5 4 2
AZ 60.375 59.666667 11 3
VT 78.333333 null 3 0
GU 40.0 null 1 0
KS 54.2 41.0 6 1
LA 58.0 40.0 8 1

例如:运行第二次:

state avg M birthday avg F birthday # male # female
str f64 f64 u32 u32
NC 56.181818 69.0 15 4
MA 60.0 56.25 11 4
CO 57.428571 49.5 9 2
IA 70.0 52.75 6 4
CA 57.323529 67.75 54 20
ME 58.0 67.5 4 2
NV 55.5 61.75 6 4

我猜可能是并联引起的吧? 这是错误还是功能? 如何保持结果一致?

groupby 上使用 maintain_order=True

maintain_order: Make sure that the order of the groups remain consistent. This is more expensive than a default groupby. Note that this only works in expression aggregations.

(顺便说一句,我不确定您的 post 中的 squeeze=True 参数是从哪里得到的。)

.groupby(["state"], squeeze=True)