Python 为每列获取滚动 Window 的 k 个最大值

Question

我有一个作为 Pandas DataFrame 的数据集，我试图在几个不同大小的滚动 windows 中获取 k 最大值。

简化问题：

import pandas as pd
import numpy as np
np.random.seed(42)

def GenerateData(N=20, num_cols=2):
    X = pd.DataFrame(np.random.rand(N, num_cols))
    return X
X = GenerateData()

## >>> X.head()
##    0         1
## 0  0.971595  0.329454
## 1  0.187766  0.138250
## 2  0.573455  0.976918
## 3  0.207987  0.672529
## 4  0.271034  0.549839

然后我的目标是在每列的每个滚动 window 中获得 k 最大值。因此，如果 k_largest=3 和滚动 window 尺寸是 windows=[4,7]，我们想要尺寸 4 和 7 的 windows 的 3 个最大值。我目前这样做的方式是

def GetKLargestForWindow(windows=[4,7], k_largest=3, raw=False):
    laggedVals = []
    for L in windows:
        for k in range(k_largest):
            x_k_max = X.rolling(L).apply(lambda c: sorted(c, reverse=True)[k], raw=raw)
            x_k_max = x_k_max.add_prefix( f'W{L}_{k+1}_' )
            laggedVals.append( x_k_max )
    laggedVals = pd.concat(laggedVals, axis=1).sort_index(axis=1)
    return laggedVals
laggedVals = GetKLargestForWindow()

## >>> laggedVals.shape
## (20,12)

## >>> laggedVals.columns
## Index(['W4_1_0', 'W4_1_1', 'W4_2_0', 'W4_2_1', \
##  'W4_3_0', 'W4_3_1', 'W7_1_0','W7_1_1', \
##  'W7_2_0', 'W7_2_1', 'W7_3_0', 'W7_3_1'],dtype='object')

请注意，本例中总共应有 12 列。那里的列名表示 W{window_size}_{j}_{col}，其中 j=1、2、3 对应于每列每个 window 大小的 3 个最大值。

但是我的数据集非常大，我正在寻找一种更有效的方法来执行此操作，因为代码需要很长时间才能完成运行。有什么建议吗？

基准：

import timeit
## >>> timeit.timeit('GetKLargestForWindow()', globals=globals(), number=1000)
## 15.590040199999976

## >>> timeit.timeit('GetKLargestForWindow(raw=True)', globals=globals(), number=1000)
## 6.497314199999892

编辑

我已经解决了这个问题——通过在应用程序中设置 raw=True -最大功能。

Answer 1

您可以使用 pandas 内置函数滚动 (more info here)。这接受一个数据框并根据 pandas 内置或自己定义的函数（应用）应用 roling window 计算。它只需要一个整数作为 window 或 window BaseIndexer 子类。我相信在这里您可以为多列指定多个 windows，但我发现循环列更容易。

X = pd.DataFrame([[((-1)**i) * i*10, ((-1)**i) * -i*5] for i in range(20)])
x = pd.DataFrame() #Emtpy dataframe, here roling window will be stored
windows = [4,7]
k = 3
for window, colname in zip(windows,X.columns):
    x[colname] = X[colname].rolling(window).max()

print(x.nlargest(k,columns=x.columns)) #find max k values

结果

19  180.0  95.0
18  180.0  85.0
17  160.0  85.0
16  160.0  75.0
0     NaN   NaN
1     NaN   NaN
2     NaN   NaN

Answer 2

一如既往，如果你想要速度，请尽可能使用 numpy。 Python 与 numpy 向量化代码相比，循环非常慢：

from numpy.lib.stride_tricks import sliding_window_view

def GetKLargestForWindow_CodeDifferent(windows=[4,7], k_largest=3):
    n_row, n_col = X.shape

    data = []
    for w in windows:
        # Create a rolling view of size w for each column in the dataframe
        view = sliding_window_view(X, w, axis=0)
        # Sort each view, reverse it (so largest first), and take the first
        # k_largest elements
        view = np.sort(view)[..., ::-1][..., :k_largest]
        # Reshape the numpy array for easy conversion into a dataframe
        view = np.reshape(view, (n_row - w + 1, -1))
        # We know the first `w - 1` rows are all NaN since there are not enough
        # data for the rolling operation
        data.append(np.vstack([
            np.zeros((w - 1, view.shape[1])) + np.nan,
            view
        ]))

    # `data` is shaped in this order
    cols_1 = [f"W{w}_{k+1}_{col}" for w in windows for col in range(n_col) for k in range(k_largest)]
    # But we want the columns in this order for easy comparison with the original code
    cols_2 = [f"W{w}_{k+1}_{col}" for w in windows for k in range(k_largest) for col in range(n_col)]
    
    return pd.DataFrame(np.hstack(data), columns=cols_1)[cols_2]

首先，让我们比较一下结果：

X = GenerateData(100_000, 2)
a = GetKLargestForWindow(raw=True)
b = GetKLargestForWindow_CodeDifferent()

assert a.compare(b).empty, "a and b are not the same"

接下来，让我们对它们进行基准测试：

%timeit GetKLargestForWindow(raw=True)
5.31 s ± 128 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit GetKLargestForWindow_CodeDifferent()
54.1 ms ± 761 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Python 为每列获取滚动 Window 的 k 个最大值

Python Get k largest values of a Rolling Window for each column

python

dataframe

pandas

rolling-computation

编辑