Pandas 如何在幕后计算指数移动平均线？

Question

我正在尝试比较 pandas EMA performance to numba 性能。

一般来说，如果函数已经内置了 pandas，我不会编写函数，因为 pandas 总是比我手写的慢速 python 函数快；例如 quantile, sort values 等。我相信这是因为 pandas 的大部分内容都是在幕后用 C 编码的，而且 pandas .apply() 方法比显式 python for 循环由于矢量化（但如果这不是真的，我愿意解释）。但是在这里，为了计算 EMA，我发现使用 numba 的效果远远优于 pandas.

我编码的EMA由

定义

S_t = Y_1, t = 1

S_t = alpha*Y_t + (1 - alpha)*S_{t-1}, t > 1

其中Y_t是时间序列在t时刻的值，S_t是移动平均在t时刻的值，alpha是平滑参数。

代码如下

from numba import jit
import pandas as pd
import numpy as np

@jit
def ewm(arr, alpha):
    """
    Calculate the EMA of an array arr
    :param arr: numpy array of floats
    :param alpha: float between 0 and 1
    :return: numpy array of floats
    """
    # initialise ewm_arr
    ewm_arr = np.zeros_like(arr)
    ewm_arr[0] = arr[0]
    for t in range(1,arr.shape[0]):
        ewm_arr[t] = alpha*arr[t] + (1 - alpha)*ewm_arr[t-1]

    return ewm_arr

# initialize array and dataframe randomly
a = np.random.random(10000)
df = pd.DataFrame(a)

%timeit df.ewm(com=0.5, adjust=False).mean()
>>> 1000 loops, best of 3: 1.77 ms per loop

%timeit ewm(a, 0.5)
>>> 10000 loops, best of 3: 34.8 µs per loop

我们看到 hand the hand 编码的 ewm 函数比 pandas ewm 方法快大约 50 倍。

numba 的性能可能也优于其他各种 pandas 方法，具体取决于人们如何对其功能进行编码。但在这里，我感兴趣的是 numba 在计算指数移动平均线方面如何优于 pandas。 pandas 正在做什么（不做什么）让它变慢 - 或者在这种情况下 numba 只是非常快？ pandas 如何计算 EMA？

Answer 1

But here I am interested in how numba outperforms Pandas in calculating exponential moving averages.

您的版本似乎更快，仅仅是因为您向它传递的是 NumPy 数组而不是 Pandas 数据结构：

>>> s = pd.Series(np.random.random(10000))

>>> %timeit ewm(s, alpha=0.5)
82 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit ewm(s.values, alpha=0.5)
26 µs ± 193 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

>>> %timeit s.ewm(alpha=0.5).mean()
852 µs ± 5.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

总的来说，比较 NumPy 与 Pandas 操作是一样的。后者建立在前者之上，几乎总是以速度换取灵活性。（但是，考虑到这一点，Pandas 仍然很快，并且随着时间的推移越来越依赖 Cython 操作。）我不确定 numba/jit 在 NumPy 中表现更好的具体是什么.但是，如果您使用 Pandas 系列比较这两个函数，Pandas 本身会更快。

How does Pandas compute EMAs under the hood?

当你调用df.ewm()时（还没有调用.mean()或.cov()等方法），中间结果是真正的classEWM在 pandas/core/window.py.

中找到

>>> ewm = pd.DataFrame().ewm(alpha=0.1)
>>> type(ewm)
<class 'pandas.core.window.EWM'>

无论您传递 com、span、halflife 还是 alpha，Pandas 都会 map this back to a com 并使用它。

当您调用方法本身时，例如 ewm.mean()，它映射到 ._apply(), which in this case serves as a router 到适当的 Cython 函数：

cfunc = getattr(_window, func, None)

在.mean()的情况下，func是"ewma"。 _window 是 Cython 模块 pandas/libs/window.pyx。

这将带您进入事物的核心，在函数 ewma()，这是大部分工作发生的地方：

weighted_avg = ((old_wt * weighted_avg) +
                (new_wt * cur)) / (old_wt + new_wt)

如果您想要更公平的比较，请直接使用底层 NumPy 值调用此函数：

>>> from pandas._libs.window import ewma                                                                                                                 
>>> %timeit ewma(s.values, 0.4, 0, 0, 0)                                                                                                                 
513 µs ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

（记住，它只需要一个 com；为此，您可以使用 pandas.core.window._get_center_of_mass()。

Pandas 如何在幕后计算指数移动平均线？

How does Pandas compute exponential moving averages under the hood?

python

arrays

time

pandas

numba