Pandas 每 30 分钟计算一次平均值,间隔为 +-10 分钟

Pandas calculate mean value every 30 minutes in an interval of +-10 minutes

我有一个这样的数据框:

df = pd.DataFrame(
    {
        "observation_time": ["2021-11-24 10:10:03+00:00", "2021-11-24 10:20:02+00:00", "2021-11-24 10:30:03+00:00", "2021-11-24 10:40:02+00:00", "2021-11-24 10:50:02+00:00", "2021-11-24 11:00:05+00:00", "2021-11-24 11:10:03+00:00", "2021-11-24 11:20:02+00:00", "2021-11-24 11:30:03+00:00", "2021-11-24 11:40:02+00:00"], 
        "temp": [7.22, 7.33, 7.44, 7.5, 7.5, 7.5, 7.44, 7.61, 7.67, 7.78]
    }
)
           observation_time  temp
0 2021-11-24 10:10:03+00:00  7.22
1 2021-11-24 10:20:02+00:00  7.33
2 2021-11-24 10:30:03+00:00  7.44
3 2021-11-24 10:40:02+00:00  7.50
4 2021-11-24 10:50:02+00:00  7.50
5 2021-11-24 11:00:05+00:00  7.50
6 2021-11-24 11:10:03+00:00  7.44
7 2021-11-24 11:20:02+00:00  7.61
8 2021-11-24 11:30:03+00:00  7.67
9 2021-11-24 11:40:02+00:00  7.78

这个数据帧只是一个例子,不能保证数据帧每 10 分钟有一个时间点,我可以每分钟或很长时间没有数据。

我想计算从“00”开始每 30 分钟 +-10 分钟间隔内的平均值,在本例中为“10:00:00”。

我正在尝试使用 Grouper:

df.groupby(pd.Grouper(key="observation_time", freq="30Min", offset="0m", label="right")).mean()

这给了我这个结果:

                                temp
observation_time                   
2021-11-24 10:30:00+00:00  7.275000
2021-11-24 11:00:00+00:00  7.480000
2021-11-24 11:30:00+00:00  7.516667
2021-11-24 12:00:00+00:00  7.725000

从时间的角度来看这很好,但当然它计算的是 30 分钟间隔内的平均值。

相反,我想计算 +-10 分钟间隔内的平均值。

例如,对于 2021-11-24 10:30:00+00:00,平均值是在 2021-11-24 10:20:00+00:002021-11-24 10:40:00+00:00 之间的区间内 temp 的所有值中计算的,在这种情况下是 7.337.44 均值是 7.385.

最终结果应该是这样的:

                               temp
observation_time                   
2021-11-24 10:30:00+00:00  7.385
2021-11-24 11:00:00+00:00  7.5
2021-11-24 11:30:00+00:00  7.64

有什么想法吗?谢谢

编辑:下面的答案假设每一行对应于 10 分钟的间隔。如果您有 unevenly-spaced 数据,我们必须手动对数据集进行分箱以获得所需的输出:

import numpy as np

# the sampling will be computed in +/- 10 minutes from the bin
sampling_interval = np.timedelta64(10, 'm')

# get 30 minutes bins
bins_interval = "30min"
bins = df['observation_time'].dt.floor(bins_interval).unique()

avg_values = []
for grouped_bin in bins:
    # subset the dataframe in the binned intervals
    subset = df[df['observation_time'].between(
        grouped_bin - sampling_interval, 
        grouped_bin + sampling_interval
    )]
    
    avg_values.append({
        'observation_time': grouped_bin,
        'temp': subset['temp'].mean()
    })

averaged_df = pd.DataFrame(avg_values)

我不确定这是最“pythonic”的方式,但我会这样处理问题:

# we create an empty dictionary in which we'll store the computed avgs
# to turn into a DataFrame later
avg_values = []

# we iterate over the DataFrame starting at index 1 and skipping 3 rows at a time
for idx in range(1, len(df.index), 3):
    # store the observation time in a separate variable
    observation_time = df.loc[idx, 'observation_time']
    # compute the mean between the rows before the current one, the
    # current one, and the next one
    avg_in_interval = np.nanmean([
        df.loc[idx-1, 'temp'] if idx > 0 else np.nan,
        df.loc[idx, 'temp'],
        df.loc[idx+1, 'temp'] if idx < len(df.index)-1 else np.nan
    ])
    # we append the two variables to the dictionary
    avg_values.append({'observation_time': observation_time, 'temp': avg_in_interval})

# new DataFrame
averaged_df = pd.DataFrame(avg_values)

或者,以更紧凑和通用的方式,您可以配置平均的间隔宽度,

interval_width = 3 # assuming it is an odd number
starting_idx = interval_width // 2
avg_values = []

for idx in range(starting_idx, len(df.index), interval_width):
    avg_values.append({
        'observation_time': df.loc[idx, 'observation_time'],
        'temp': np.mean(df.iloc[idx-starting_idx:idx+starting_idx]['temp'])
    })

averaged_df = pd.DataFrame(avg_values)

你也可以把它变成一个函数来保持你的代码干净:

def get_averaged_df(df, interval_width: int):
    if interval_width % 2 == 0:
        raise Error("interval_width must be an odd integer")

    starting_idx = interval_width // 2
    avg_values = []

    for idx in range(starting_idx, len(df.index), interval_width):
        avg_values.append({
            'observation_time': df.loc[idx, 'observation_time'],
            'temp': np.mean(df.iloc[idx-starting_idx:idx+starting_idx]['temp'])
        })

    return pd.DataFrame(avg_values)


averaged_df = get_averaged_df(df, 3)