扩展 zscore np.vectorize 而不是应用
Expanding zscore np.vectorize rather than apply
我正在尝试对组内的某些变量在 pandas 中扩展 zscore,我注意到使用 apply
函数作为自定义函数相当慢。
这是我的数据的最小可重现版本,它仍然具有代表性:
import datetime
import pandas as pd
import numpy as np
import timeit
df = pd.DataFrame(
{"date": pd.date_range(start=datetime.datetime(2000,1,1), periods=10000),
"ids" : ["cat"]*5000 + ["dog"]*5000,
"x" : [random.random() for _ in range(10000)],
"y" : [random.random()*100 for _ in range(10000)],
}
)
>>> df.head()
date ids x y
0 2000-01-01 cat 0.947039 76.592064
1 2000-01-02 cat 0.200761 89.646584
2 2000-01-03 cat 0.305686 38.416170
3 2000-01-04 cat 0.080183 84.889605
4 2000-01-05 cat 0.258639 9.046614
按时间顺序,这是我使用内置 .sum()
得到的结果
>>> timeit.timeit('df.set_index("date").groupby("ids")[["x","y"]].expanding(5).sum().reset_index()', number=10, globals=globals())
0.12283363699680194
虽然 zscore 函数做的工作比求和要多得多,所以我预计它会慢一点——这真的很慢:
>>> zscore = lambda x: (x.values[-1] - x.mean())/x.std()
>>> timeit.timeit('df.set_index("date").groupby("ids")[["x","y"]].expanding(5).apply(zscore).reset_index()', number=10, globals=globals())
91.14701056003105
哇!那要慢很多。
所以虽然输出是我正在寻找的:
>>> df.set_index("date").groupby("ids")[["x","y"]].expanding(5).apply(zscore).reset_index()
ids date x y
0 cat 2000-01-01 NaN NaN
1 cat 2000-01-02 NaN NaN
2 cat 2000-01-03 NaN NaN
3 cat 2000-01-04 NaN NaN
4 cat 2000-01-05 -0.293887 -1.457395
... ... ... ... ...
9995 dog 2027-05-14 0.711095 -0.373902
9996 dog 2027-05-15 -0.929957 0.708371
9997 dog 2027-05-16 -1.668474 1.434254
9998 dog 2027-05-17 -0.059490 -1.721237
9999 dog 2027-05-18 -0.551626 1.015764
[10000 rows x 4 columns]
...我确实希望矢量化或潜在转换比这更快。
但是,我对这段代码进行矢量化的幼稚尝试:
np.vectorize(zscore)(df.set_index("date").groupby("ids")[["x","y"]].expanding(5))
Fails with:
f"'{type(self).__name__}' object has no attribute '{attr}'"
AttributeError: 'ExpandingGroupby' object has no attribute 'values'
如何将此 .apply(func)
转换为向量化?
将您的变换重新处理为矢量(每组):
(df.set_index("date")
.groupby("ids")[["x","y"]]
.transform(lambda d: (d-d.expanding(5).mean())/d.expanding(5).std())
)
或者使用函数:
def expanding_zscore(d, window=5):
return (d-d.expanding(window).mean())/d.expanding(window).std()
(df.set_index("date")
.groupby("ids")[["x","y"]]
.transform(expanding_zscore, window=5)
)
输出:
x y
date
2000-01-01 NaN NaN
2000-01-02 NaN NaN
2000-01-03 NaN NaN
2000-01-04 NaN NaN
2000-01-05 0.797018 0.845773
... ... ...
2027-05-14 -1.216591 -0.121771
2027-05-15 -1.550736 1.191920
2027-05-16 -1.659481 -0.304257
2027-05-17 0.295209 -0.521772
2027-05-18 1.702968 -0.462038
我正在尝试对组内的某些变量在 pandas 中扩展 zscore,我注意到使用 apply
函数作为自定义函数相当慢。
这是我的数据的最小可重现版本,它仍然具有代表性:
import datetime
import pandas as pd
import numpy as np
import timeit
df = pd.DataFrame(
{"date": pd.date_range(start=datetime.datetime(2000,1,1), periods=10000),
"ids" : ["cat"]*5000 + ["dog"]*5000,
"x" : [random.random() for _ in range(10000)],
"y" : [random.random()*100 for _ in range(10000)],
}
)
>>> df.head()
date ids x y
0 2000-01-01 cat 0.947039 76.592064
1 2000-01-02 cat 0.200761 89.646584
2 2000-01-03 cat 0.305686 38.416170
3 2000-01-04 cat 0.080183 84.889605
4 2000-01-05 cat 0.258639 9.046614
按时间顺序,这是我使用内置 .sum()
得到的结果>>> timeit.timeit('df.set_index("date").groupby("ids")[["x","y"]].expanding(5).sum().reset_index()', number=10, globals=globals())
0.12283363699680194
虽然 zscore 函数做的工作比求和要多得多,所以我预计它会慢一点——这真的很慢:
>>> zscore = lambda x: (x.values[-1] - x.mean())/x.std()
>>> timeit.timeit('df.set_index("date").groupby("ids")[["x","y"]].expanding(5).apply(zscore).reset_index()', number=10, globals=globals())
91.14701056003105
哇!那要慢很多。
所以虽然输出是我正在寻找的:
>>> df.set_index("date").groupby("ids")[["x","y"]].expanding(5).apply(zscore).reset_index()
ids date x y
0 cat 2000-01-01 NaN NaN
1 cat 2000-01-02 NaN NaN
2 cat 2000-01-03 NaN NaN
3 cat 2000-01-04 NaN NaN
4 cat 2000-01-05 -0.293887 -1.457395
... ... ... ... ...
9995 dog 2027-05-14 0.711095 -0.373902
9996 dog 2027-05-15 -0.929957 0.708371
9997 dog 2027-05-16 -1.668474 1.434254
9998 dog 2027-05-17 -0.059490 -1.721237
9999 dog 2027-05-18 -0.551626 1.015764
[10000 rows x 4 columns]
...我确实希望矢量化或潜在转换比这更快。
但是,我对这段代码进行矢量化的幼稚尝试:
np.vectorize(zscore)(df.set_index("date").groupby("ids")[["x","y"]].expanding(5))
Fails with:
f"'{type(self).__name__}' object has no attribute '{attr}'"
AttributeError: 'ExpandingGroupby' object has no attribute 'values'
如何将此 .apply(func)
转换为向量化?
将您的变换重新处理为矢量(每组):
(df.set_index("date")
.groupby("ids")[["x","y"]]
.transform(lambda d: (d-d.expanding(5).mean())/d.expanding(5).std())
)
或者使用函数:
def expanding_zscore(d, window=5):
return (d-d.expanding(window).mean())/d.expanding(window).std()
(df.set_index("date")
.groupby("ids")[["x","y"]]
.transform(expanding_zscore, window=5)
)
输出:
x y
date
2000-01-01 NaN NaN
2000-01-02 NaN NaN
2000-01-03 NaN NaN
2000-01-04 NaN NaN
2000-01-05 0.797018 0.845773
... ... ...
2027-05-14 -1.216591 -0.121771
2027-05-15 -1.550736 1.191920
2027-05-16 -1.659481 -0.304257
2027-05-17 0.295209 -0.521772
2027-05-18 1.702968 -0.462038