扩展 zscore np.vectorize 而不是应用

Question

我正在尝试对组内的某些变量在 pandas 中扩展 zscore，我注意到使用 apply 函数作为自定义函数相当慢。

这是我的数据的最小可重现版本，它仍然具有代表性：

import datetime
import pandas as pd
import numpy as np
import timeit


df = pd.DataFrame(
  {"date": pd.date_range(start=datetime.datetime(2000,1,1), periods=10000),
   "ids" : ["cat"]*5000 + ["dog"]*5000,
   "x" : [random.random() for _ in range(10000)],
   "y" : [random.random()*100 for _ in range(10000)],   
  }
)

>>> df.head()
        date  ids         x          y
0 2000-01-01  cat  0.947039  76.592064
1 2000-01-02  cat  0.200761  89.646584
2 2000-01-03  cat  0.305686  38.416170
3 2000-01-04  cat  0.080183  84.889605
4 2000-01-05  cat  0.258639   9.046614

按时间顺序，这是我使用内置 .sum()

得到的结果

>>> timeit.timeit('df.set_index("date").groupby("ids")[["x","y"]].expanding(5).sum().reset_index()', number=10, globals=globals())
0.12283363699680194

虽然 zscore 函数做的工作比求和要多得多，所以我预计它会慢一点——这真的很慢：

>>> zscore = lambda x: (x.values[-1] - x.mean())/x.std()
>>> timeit.timeit('df.set_index("date").groupby("ids")[["x","y"]].expanding(5).apply(zscore).reset_index()', number=10, globals=globals())
91.14701056003105

哇！那要慢很多。

所以虽然输出是我正在寻找的：

>>> df.set_index("date").groupby("ids")[["x","y"]].expanding(5).apply(zscore).reset_index()
      ids       date         x         y
0     cat 2000-01-01       NaN       NaN
1     cat 2000-01-02       NaN       NaN
2     cat 2000-01-03       NaN       NaN
3     cat 2000-01-04       NaN       NaN
4     cat 2000-01-05 -0.293887 -1.457395
...   ...        ...       ...       ...
9995  dog 2027-05-14  0.711095 -0.373902
9996  dog 2027-05-15 -0.929957  0.708371
9997  dog 2027-05-16 -1.668474  1.434254
9998  dog 2027-05-17 -0.059490 -1.721237
9999  dog 2027-05-18 -0.551626  1.015764

[10000 rows x 4 columns]

...我确实希望矢量化或潜在转换比这更快。

但是，我对这段代码进行矢量化的幼稚尝试：

np.vectorize(zscore)(df.set_index("date").groupby("ids")[["x","y"]].expanding(5))

Fails with:

    f"'{type(self).__name__}' object has no attribute '{attr}'"
AttributeError: 'ExpandingGroupby' object has no attribute 'values'

如何将此 .apply(func) 转换为向量化？

Answer 1

将您的变换重新处理为矢量（每组）：

(df.set_index("date")
   .groupby("ids")[["x","y"]]
   .transform(lambda d: (d-d.expanding(5).mean())/d.expanding(5).std())
)

或者使用函数：

def expanding_zscore(d, window=5):
    return (d-d.expanding(window).mean())/d.expanding(window).std()

(df.set_index("date")
   .groupby("ids")[["x","y"]]
   .transform(expanding_zscore, window=5)
)

输出：

                   x         y
date                          
2000-01-01       NaN       NaN
2000-01-02       NaN       NaN
2000-01-03       NaN       NaN
2000-01-04       NaN       NaN
2000-01-05  0.797018  0.845773
...              ...       ...
2027-05-14 -1.216591 -0.121771
2027-05-15 -1.550736  1.191920
2027-05-16 -1.659481 -0.304257
2027-05-17  0.295209 -0.521772
2027-05-18  1.702968 -0.462038

扩展 zscore np.vectorize 而不是应用

Expanding zscore np.vectorize rather than apply

python

numpy

vectorization

pandas