groupby 结果在 polars 中不一致
groupby result is inconsistent in polars
在用户指南中,有一个例子:
from datetime import date
def compute_age() -> pl.Expr:
return date(2021, 1, 1).year - pl.col("birthday").dt.year()
def avg_birthday(gender: str) -> pl.Expr:
return compute_age().filter(
pl.col("gender") == gender
).mean().alias(f"avg {gender} birthday")
q = (
datasetn.lazy()
.groupby(["state"])
.agg(
[
avg_birthday("M"),
avg_birthday("F"),
(pl.col("gender") == "M").count().alias("# male"),
(pl.col("gender") == "F").sum().alias("# female"),
]
)
)
df = q.collect()
df
结果不一致。
例如:运行 第一次:
state
avg M birthday
avg F birthday
# male
# female
str
f64
f64
u32
u32
ME
58.0
67.5
4
2
AZ
60.375
59.666667
11
3
VT
78.333333
null
3
0
GU
40.0
null
1
0
KS
54.2
41.0
6
1
LA
58.0
40.0
8
1
例如:运行第二次:
state
avg M birthday
avg F birthday
# male
# female
str
f64
f64
u32
u32
NC
56.181818
69.0
15
4
MA
60.0
56.25
11
4
CO
57.428571
49.5
9
2
IA
70.0
52.75
6
4
CA
57.323529
67.75
54
20
ME
58.0
67.5
4
2
NV
55.5
61.75
6
4
我猜可能是并联引起的吧?
这是错误还是功能?
如何保持结果一致?
在 groupby
上使用 maintain_order=True
。
maintain_order: Make sure that the order of the groups remain consistent. This is more expensive than a default groupby. Note that this only works in expression aggregations.
(顺便说一句,我不确定您的 post 中的 squeeze=True
参数是从哪里得到的。)
.groupby(["state"], squeeze=True)
在用户指南中,有一个例子:
from datetime import date
def compute_age() -> pl.Expr:
return date(2021, 1, 1).year - pl.col("birthday").dt.year()
def avg_birthday(gender: str) -> pl.Expr:
return compute_age().filter(
pl.col("gender") == gender
).mean().alias(f"avg {gender} birthday")
q = (
datasetn.lazy()
.groupby(["state"])
.agg(
[
avg_birthday("M"),
avg_birthday("F"),
(pl.col("gender") == "M").count().alias("# male"),
(pl.col("gender") == "F").sum().alias("# female"),
]
)
)
df = q.collect()
df
结果不一致。 例如:运行 第一次:
state | avg M birthday | avg F birthday | # male | # female |
---|---|---|---|---|
str | f64 | f64 | u32 | u32 |
ME | 58.0 | 67.5 | 4 | 2 |
AZ | 60.375 | 59.666667 | 11 | 3 |
VT | 78.333333 | null | 3 | 0 |
GU | 40.0 | null | 1 | 0 |
KS | 54.2 | 41.0 | 6 | 1 |
LA | 58.0 | 40.0 | 8 | 1 |
例如:运行第二次:
state | avg M birthday | avg F birthday | # male | # female |
---|---|---|---|---|
str | f64 | f64 | u32 | u32 |
NC | 56.181818 | 69.0 | 15 | 4 |
MA | 60.0 | 56.25 | 11 | 4 |
CO | 57.428571 | 49.5 | 9 | 2 |
IA | 70.0 | 52.75 | 6 | 4 |
CA | 57.323529 | 67.75 | 54 | 20 |
ME | 58.0 | 67.5 | 4 | 2 |
NV | 55.5 | 61.75 | 6 | 4 |
我猜可能是并联引起的吧? 这是错误还是功能? 如何保持结果一致?
在 groupby
上使用 maintain_order=True
。
maintain_order: Make sure that the order of the groups remain consistent. This is more expensive than a default groupby. Note that this only works in expression aggregations.
(顺便说一句,我不确定您的 post 中的 squeeze=True
参数是从哪里得到的。)
.groupby(["state"], squeeze=True)