在 Pandas 中基于另一列的值减去滚动 window 均值,没有循环

Subtracting a rolling window mean based on value from one column based on another without loops in Pandas

我不确定我在做什么这个词是什么,但我不能只使用 pandas 滚动 (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html) 函数,因为 window就数据库索引而言不是固定大小。我正在尝试这样做:

我有一个包含列 UT(以小时为单位的时间,但不是日期时间对象)和 WINDS 的数据框,我想添加第三列,该列减去时间 12 小时内的所有 WINDS 值的平均值UT 列。目前,我是这样做的:

rolsub = []

for i in df['UT']:
    df1 = df[ (df['UT'] > (i-12)) & (df['UT'] < (i+12)) ]
    df2 = df[df['UT'] == i]
    rolsub +=  [float(df2['WINDS'] - df1['WINDS'].mean())]

df['WIND_SUB'] = rolsub

这工作正常,但它花费的时间太长,因为我的数据框有数万个条目。一定有更好的方法来做到这一点,对吧?请帮忙!

如果我没理解错的话,你可以创建一个假的 DatetimeIndex 用于滚动。

示例数据:

import pandas as pd

df = pd.DataFrame({'UT':[0.5, 1, 2, 8, 9, 12, 13, 14, 15, 24, 60, 61, 63, 100],
                   'WINDS':[1, 1, 10, 1, 1, 1, 5, 5, 5, 5, 5, 1, 1, 10]})

print(df)
       UT  WINDS
0     0.5      1
1     1.0      1
2     2.0     10
3     8.0      1
4     9.0      1
5    12.0      1
6    13.0      5
7    14.0      5
8    15.0      5
9    24.0      5
10   60.0      5
11   61.0      1
12   63.0      1
13  100.0     10

代码:

# Fake DatetimeIndex.
df['dt'] = pd.to_datetime('today').normalize() + pd.to_timedelta(df['UT'], unit='h')
df = df.set_index('dt')

df['WINDS_SUB'] = df['WINDS'] - df['WINDS'].rolling('24h', center=True, closed='neither').mean()

print(df)

给出:

                        UT  WINDS  WINDS_SUB
dt                                          
2022-05-11 00:30:00    0.5      1  -1.500000
2022-05-11 01:00:00    1.0      1  -1.500000
2022-05-11 02:00:00    2.0     10   7.142857
2022-05-11 08:00:00    8.0      1  -2.333333
2022-05-11 09:00:00    9.0      1  -2.333333
2022-05-11 12:00:00   12.0      1  -2.333333
2022-05-11 13:00:00   13.0      5   0.875000
2022-05-11 14:00:00   14.0      5   1.714286
2022-05-11 15:00:00   15.0      5   1.714286
2022-05-12 00:00:00   24.0      5   0.000000
2022-05-13 12:00:00   60.0      5   2.666667
2022-05-13 13:00:00   61.0      1  -1.333333
2022-05-13 15:00:00   63.0      1  -1.333333
2022-05-15 04:00:00  100.0     10   0.000000

这个小型测试集的结果与您的代码输出相匹配。这假设 UT 代表从某个开始时间点开始的小时数,通过查看您的解决方案似乎就是这种情况。

运行时间:

我在以下 df 上测试了 30,000 行:

import numpy as np

df = pd.DataFrame({'UT':range(30000),
                   'WINDS':np.full(30000, 1)})

def loop(df):
    rolsub = []

    for i in df['UT']:
        df1 = df[ (df['UT'] > (i-12)) & (df['UT'] < (i+12)) ]
        df2 = df[df['UT'] == i]
        rolsub +=  [float(df2['WINDS'] - df1['WINDS'].mean())]

    df['WIND_SUB'] = rolsub

def vector(df):
    df['dt'] = pd.to_datetime('today').normalize() + pd.to_timedelta(df['UT'], unit='h')
    df = df.set_index('dt')

    df['WINDS_SUB'] = df['WINDS'] - df['WINDS'].rolling('24h', center=True, closed='neither').mean()

    return df

# 10.1 s ± 171 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit loop(df)

# 1.69 ms ± 71.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit vector(df)

所以它快了大约 5,000 倍。