为什么 pandas rolling 使用单维 ndarray

Question

我有动力使用 pandas rolling 特征来执行滚动多因素回归（这个问题 NOT 关于滚动多因素回归).我希望我能够在 df.rolling(2) 之后使用 apply 并获取结果 pd.DataFrame 用 .values 提取 ndarray 并执行必要的矩阵乘法。结果并非如此。

这是我发现的：

import pandas as pd
import numpy as np

np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(5, 2).round(2), columns=['A', 'B'])
X = np.random.rand(2, 1).round(2)

对象是什么样子的：

print "\ndf = \n", df
print "\nX = \n", X
print "\ndf.shape =", df.shape, ", X.shape =", X.shape

df = 
      A     B
0  0.44  0.41
1  0.46  0.47
2  0.46  0.02
3  0.85  0.82
4  0.78  0.76

X = 
[[ 0.93]
 [ 0.83]]

df.shape = (5, 2) , X.shape = (2L, 1L)

矩阵乘法正常运行：

df.values.dot(X)

array([[ 0.7495],
       [ 0.8179],
       [ 0.4444],
       [ 1.4711],
       [ 1.3562]])

使用 apply 逐行执行点积的行为符合预期：

df.apply(lambda x: x.values.dot(X)[0], axis=1)

0    0.7495
1    0.8179
2    0.4444
3    1.4711
4    1.3562
dtype: float64

Groupby -> Apply 的行为符合我的预期：

df.groupby(level=0).apply(lambda x: x.values.dot(X)[0, 0])

0    0.7495
1    0.8179
2    0.4444
3    1.4711
4    1.3562
dtype: float64

但是当我运行:

df.rolling(1).apply(lambda x: x.values.dot(X))

我得到：

AttributeError: 'numpy.ndarray' object has no attribute 'values'

好的，所以 pandas 在其 rolling 实现中直接使用 ndarray。我能应付。我们不使用 .values 来获取 ndarray，而是尝试：

df.rolling(1).apply(lambda x: x.dot(X))

shapes (1,) and (2,1) not aligned: 1 (dim 0) != 2 (dim 0)

等等！什么？！

所以我创建了一个自定义函数来查看滚动在做什么。

def print_type_sum(x):
    print type(x), x.shape
    return x.sum()

然后运行:

print df.rolling(1).apply(print_type_sum)

<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
      A     B
0  0.44  0.41
1  0.46  0.47
2  0.46  0.02
3  0.85  0.82
4  0.78  0.76

我的结果pd.DataFrame也是一样的，很好。但是它打印出了 10 个单维 ndarray 对象。 rolling(2)

呢

print df.rolling(2).apply(print_type_sum)

<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
      A     B
0   NaN   NaN
1  0.90  0.88
2  0.92  0.49
3  1.31  0.84
4  1.63  1.58

同样的事情，期待输出但它打印了 8 个 ndarray 个对象。 rolling 正在为每一列生成长度为 window 的单维 ndarray，这与我预期的形状 (window, len(df.columns)).[=41 的 ndarray 不同=]

问题是为什么？

我现在没有办法轻松运行滚动多因素回归。

Answer 1

我想分享我为解决此问题所做的工作。

给定一个 pd.DataFrame 和一个 window，我使用 np.dstack () 生成一个堆叠的 ndarray。然后我将它转换为 pd.Panel 并使用 pd.Panel.to_frame 将其转换为 pd.DataFrame。此时，我有一个 pd.DataFrame，其索引相对于原始 pd.DataFrame 有一个额外的级别，新级别包含有关每个滚动周期的信息。例如，如果 roll window 为 3，则新索引级别将包含 [0, 1, 2]。每个时期一个项目。我现在可以 groupby level=0 和 return groupby 对象。这现在给了我一个我可以更直观地操作的对象。

滚动功能

import pandas as pd
import numpy as np

def roll(df, w):
    roll_array = np.dstack([df.values[i:i+w, :] for i in range(len(df.index) - w + 1)]).T
    panel = pd.Panel(roll_array, 
                     items=df.index[w-1:],
                     major_axis=df.columns,
                     minor_axis=pd.Index(range(w), name='roll'))
    return panel.to_frame().unstack().T.groupby(level=0)

示范[=38=]

np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(5, 2).round(2), columns=['A', 'B'])

print df

      A     B
0  0.44  0.41
1  0.46  0.47
2  0.46  0.02
3  0.85  0.82
4  0.78  0.76

让我们sum

rolled_df = roll(df, 2)

print rolled_df.sum()

major     A     B
1      0.90  0.88
2      0.92  0.49
3      1.31  0.84
4      1.63  1.58

要深入了解，我们可以看到结构：

print rolled_df.apply(lambda x: x)

major      A     B
  roll            
1 0     0.44  0.41
  1     0.46  0.47
2 0     0.46  0.47
  1     0.46  0.02
3 0     0.46  0.02
  1     0.85  0.82
4 0     0.85  0.82
  1     0.78  0.76

但是我构建这个滚动多因素回归的目的是什么？但我现在将满足于矩阵乘法。

X = np.array([2, 3])

print rolled_df.apply(lambda df: pd.Series(df.values.dot(X))) 

      0     1
1  2.11  2.33
2  2.33  0.98
3  0.98  4.16
4  4.16  3.84

Answer 2

使用，这是一个向量化的方法-

get_sliding_window(df, 2).dot(X) # window size = 2

运行时测试 -

In [101]: df = pd.DataFrame(np.random.rand(5, 2).round(2), columns=['A', 'B'])

In [102]: X = np.array([2, 3])

In [103]: rolled_df = roll(df, 2)

In [104]: %timeit rolled_df.apply(lambda df: pd.Series(df.values.dot(X)))
100 loops, best of 3: 5.51 ms per loop

In [105]: %timeit get_sliding_window(df, 2).dot(X)
10000 loops, best of 3: 43.7 µs per loop

验证结果 -

In [106]: rolled_df.apply(lambda df: pd.Series(df.values.dot(X)))
Out[106]: 
      0     1
1  2.70  4.09
2  4.09  2.52
3  2.52  1.78
4  1.78  3.50

In [107]: get_sliding_window(df, 2).dot(X)
Out[107]: 
array([[ 2.7 ,  4.09],
       [ 4.09,  2.52],
       [ 2.52,  1.78],
       [ 1.78,  3.5 ]])

那里有很大的改进，我希望在更大的阵列上能保持明显！

Answer 3

对上述答案进行了以下修改，因为我需要 return 整个滚动 window 就像 pd.DataFrame.rolling()

中所做的那样

def roll(df, w):
    roll_array = np.dstack([df.values[i:i+w, :] for i in range(len(df.index) - w + 1)]).T
    roll_array_full_window = np.vstack((np.empty((w-1 ,len(df.columns), w)), roll_array))
    panel = pd.Panel(roll_array_full_window, 
                 items=df.index,
                 major_axis=df.columns,
                 minor_axis=pd.Index(range(w), name='roll'))
    return panel.to_frame().unstack().T.groupby(level=0)

Answer 4

Since pandas v0.23 it is now possible to pass a Series instead of a ndarray to Rolling.apply()。只需设置 raw=False.

raw : bool, default None

False : passes each row or column as a Series to the function.

True or None : the passed function will receive ndarray objects instead. If you are just applying a NumPy reduction function this will achieve much better performance. The raw parameter is required and will show a FutureWarning if not passed. In the future raw will default to False.

New in version 0.23.0.

如前所述；如果您只需要一个维度，则直接传递它显然效率更高。这可能是您问题的答案； Rolling.apply() 最初是为了传递 ndarray 而构建的，只是因为这是最有效的。

为什么 pandas rolling 使用单维 ndarray

why does pandas rolling use single dimension ndarray

python

group-by

numpy

pandas

pandas-groupby

问题是为什么？

滚动功能