扩展 zscore np.vectorize 而不是应用

Expanding zscore np.vectorize rather than apply

我正在尝试对组内的某些变量在 pandas 中扩展 zscore,我注意到使用 apply 函数作为自定义函数相当慢。

这是我的数据的最小可重现版本,它仍然具有代表性:

import datetime
import pandas as pd
import numpy as np
import timeit


df = pd.DataFrame(
  {"date": pd.date_range(start=datetime.datetime(2000,1,1), periods=10000),
   "ids" : ["cat"]*5000 + ["dog"]*5000,
   "x" : [random.random() for _ in range(10000)],
   "y" : [random.random()*100 for _ in range(10000)],   
  }
)

>>> df.head()
        date  ids         x          y
0 2000-01-01  cat  0.947039  76.592064
1 2000-01-02  cat  0.200761  89.646584
2 2000-01-03  cat  0.305686  38.416170
3 2000-01-04  cat  0.080183  84.889605
4 2000-01-05  cat  0.258639   9.046614

按时间顺序,这是我使用内置 .sum()

得到的结果
>>> timeit.timeit('df.set_index("date").groupby("ids")[["x","y"]].expanding(5).sum().reset_index()', number=10, globals=globals())
0.12283363699680194

虽然 zscore 函数做的工作比求和要多得多,所以我预计它会慢一点——这真的很慢:

>>> zscore = lambda x: (x.values[-1] - x.mean())/x.std()
>>> timeit.timeit('df.set_index("date").groupby("ids")[["x","y"]].expanding(5).apply(zscore).reset_index()', number=10, globals=globals())
91.14701056003105

哇!那要慢很多。

所以虽然输出是我正在寻找的:

>>> df.set_index("date").groupby("ids")[["x","y"]].expanding(5).apply(zscore).reset_index()
      ids       date         x         y
0     cat 2000-01-01       NaN       NaN
1     cat 2000-01-02       NaN       NaN
2     cat 2000-01-03       NaN       NaN
3     cat 2000-01-04       NaN       NaN
4     cat 2000-01-05 -0.293887 -1.457395
...   ...        ...       ...       ...
9995  dog 2027-05-14  0.711095 -0.373902
9996  dog 2027-05-15 -0.929957  0.708371
9997  dog 2027-05-16 -1.668474  1.434254
9998  dog 2027-05-17 -0.059490 -1.721237
9999  dog 2027-05-18 -0.551626  1.015764

[10000 rows x 4 columns]

...我确实希望矢量化或潜在转换比这更快。

但是,我对这段代码进行矢量化的幼稚尝试:

np.vectorize(zscore)(df.set_index("date").groupby("ids")[["x","y"]].expanding(5))

Fails with:

    f"'{type(self).__name__}' object has no attribute '{attr}'"
AttributeError: 'ExpandingGroupby' object has no attribute 'values'

如何将此 .apply(func) 转换为向量化?

将您的变换重新处理为矢量(每组):

(df.set_index("date")
   .groupby("ids")[["x","y"]]
   .transform(lambda d: (d-d.expanding(5).mean())/d.expanding(5).std())
)

或者使用函数:

def expanding_zscore(d, window=5):
    return (d-d.expanding(window).mean())/d.expanding(window).std()

(df.set_index("date")
   .groupby("ids")[["x","y"]]
   .transform(expanding_zscore, window=5)
)

输出:

                   x         y
date                          
2000-01-01       NaN       NaN
2000-01-02       NaN       NaN
2000-01-03       NaN       NaN
2000-01-04       NaN       NaN
2000-01-05  0.797018  0.845773
...              ...       ...
2027-05-14 -1.216591 -0.121771
2027-05-15 -1.550736  1.191920
2027-05-16 -1.659481 -0.304257
2027-05-17  0.295209 -0.521772
2027-05-18  1.702968 -0.462038