Pandas 每 30 分钟计算一次平均值,间隔为 +-10 分钟
Pandas calculate mean value every 30 minutes in an interval of +-10 minutes
我有一个这样的数据框:
df = pd.DataFrame(
{
"observation_time": ["2021-11-24 10:10:03+00:00", "2021-11-24 10:20:02+00:00", "2021-11-24 10:30:03+00:00", "2021-11-24 10:40:02+00:00", "2021-11-24 10:50:02+00:00", "2021-11-24 11:00:05+00:00", "2021-11-24 11:10:03+00:00", "2021-11-24 11:20:02+00:00", "2021-11-24 11:30:03+00:00", "2021-11-24 11:40:02+00:00"],
"temp": [7.22, 7.33, 7.44, 7.5, 7.5, 7.5, 7.44, 7.61, 7.67, 7.78]
}
)
observation_time temp
0 2021-11-24 10:10:03+00:00 7.22
1 2021-11-24 10:20:02+00:00 7.33
2 2021-11-24 10:30:03+00:00 7.44
3 2021-11-24 10:40:02+00:00 7.50
4 2021-11-24 10:50:02+00:00 7.50
5 2021-11-24 11:00:05+00:00 7.50
6 2021-11-24 11:10:03+00:00 7.44
7 2021-11-24 11:20:02+00:00 7.61
8 2021-11-24 11:30:03+00:00 7.67
9 2021-11-24 11:40:02+00:00 7.78
这个数据帧只是一个例子,不能保证数据帧每 10 分钟有一个时间点,我可以每分钟或很长时间没有数据。
我想计算从“00”开始每 30 分钟 +-10 分钟间隔内的平均值,在本例中为“10:00:00”。
我正在尝试使用 Grouper
:
df.groupby(pd.Grouper(key="observation_time", freq="30Min", offset="0m", label="right")).mean()
这给了我这个结果:
temp
observation_time
2021-11-24 10:30:00+00:00 7.275000
2021-11-24 11:00:00+00:00 7.480000
2021-11-24 11:30:00+00:00 7.516667
2021-11-24 12:00:00+00:00 7.725000
从时间的角度来看这很好,但当然它计算的是 30 分钟间隔内的平均值。
相反,我想计算 +-10 分钟间隔内的平均值。
例如,对于 2021-11-24 10:30:00+00:00
,平均值是在 2021-11-24 10:20:00+00:00
和 2021-11-24 10:40:00+00:00
之间的区间内 temp
的所有值中计算的,在这种情况下是 7.33
和 7.44
均值是 7.385
.
最终结果应该是这样的:
temp
observation_time
2021-11-24 10:30:00+00:00 7.385
2021-11-24 11:00:00+00:00 7.5
2021-11-24 11:30:00+00:00 7.64
有什么想法吗?谢谢
编辑:下面的答案假设每一行对应于 10 分钟的间隔。如果您有 unevenly-spaced 数据,我们必须手动对数据集进行分箱以获得所需的输出:
import numpy as np
# the sampling will be computed in +/- 10 minutes from the bin
sampling_interval = np.timedelta64(10, 'm')
# get 30 minutes bins
bins_interval = "30min"
bins = df['observation_time'].dt.floor(bins_interval).unique()
avg_values = []
for grouped_bin in bins:
# subset the dataframe in the binned intervals
subset = df[df['observation_time'].between(
grouped_bin - sampling_interval,
grouped_bin + sampling_interval
)]
avg_values.append({
'observation_time': grouped_bin,
'temp': subset['temp'].mean()
})
averaged_df = pd.DataFrame(avg_values)
我不确定这是最“pythonic”的方式,但我会这样处理问题:
# we create an empty dictionary in which we'll store the computed avgs
# to turn into a DataFrame later
avg_values = []
# we iterate over the DataFrame starting at index 1 and skipping 3 rows at a time
for idx in range(1, len(df.index), 3):
# store the observation time in a separate variable
observation_time = df.loc[idx, 'observation_time']
# compute the mean between the rows before the current one, the
# current one, and the next one
avg_in_interval = np.nanmean([
df.loc[idx-1, 'temp'] if idx > 0 else np.nan,
df.loc[idx, 'temp'],
df.loc[idx+1, 'temp'] if idx < len(df.index)-1 else np.nan
])
# we append the two variables to the dictionary
avg_values.append({'observation_time': observation_time, 'temp': avg_in_interval})
# new DataFrame
averaged_df = pd.DataFrame(avg_values)
或者,以更紧凑和通用的方式,您可以配置平均的间隔宽度,
interval_width = 3 # assuming it is an odd number
starting_idx = interval_width // 2
avg_values = []
for idx in range(starting_idx, len(df.index), interval_width):
avg_values.append({
'observation_time': df.loc[idx, 'observation_time'],
'temp': np.mean(df.iloc[idx-starting_idx:idx+starting_idx]['temp'])
})
averaged_df = pd.DataFrame(avg_values)
你也可以把它变成一个函数来保持你的代码干净:
def get_averaged_df(df, interval_width: int):
if interval_width % 2 == 0:
raise Error("interval_width must be an odd integer")
starting_idx = interval_width // 2
avg_values = []
for idx in range(starting_idx, len(df.index), interval_width):
avg_values.append({
'observation_time': df.loc[idx, 'observation_time'],
'temp': np.mean(df.iloc[idx-starting_idx:idx+starting_idx]['temp'])
})
return pd.DataFrame(avg_values)
averaged_df = get_averaged_df(df, 3)
我有一个这样的数据框:
df = pd.DataFrame(
{
"observation_time": ["2021-11-24 10:10:03+00:00", "2021-11-24 10:20:02+00:00", "2021-11-24 10:30:03+00:00", "2021-11-24 10:40:02+00:00", "2021-11-24 10:50:02+00:00", "2021-11-24 11:00:05+00:00", "2021-11-24 11:10:03+00:00", "2021-11-24 11:20:02+00:00", "2021-11-24 11:30:03+00:00", "2021-11-24 11:40:02+00:00"],
"temp": [7.22, 7.33, 7.44, 7.5, 7.5, 7.5, 7.44, 7.61, 7.67, 7.78]
}
)
observation_time temp
0 2021-11-24 10:10:03+00:00 7.22
1 2021-11-24 10:20:02+00:00 7.33
2 2021-11-24 10:30:03+00:00 7.44
3 2021-11-24 10:40:02+00:00 7.50
4 2021-11-24 10:50:02+00:00 7.50
5 2021-11-24 11:00:05+00:00 7.50
6 2021-11-24 11:10:03+00:00 7.44
7 2021-11-24 11:20:02+00:00 7.61
8 2021-11-24 11:30:03+00:00 7.67
9 2021-11-24 11:40:02+00:00 7.78
这个数据帧只是一个例子,不能保证数据帧每 10 分钟有一个时间点,我可以每分钟或很长时间没有数据。
我想计算从“00”开始每 30 分钟 +-10 分钟间隔内的平均值,在本例中为“10:00:00”。
我正在尝试使用 Grouper
:
df.groupby(pd.Grouper(key="observation_time", freq="30Min", offset="0m", label="right")).mean()
这给了我这个结果:
temp
observation_time
2021-11-24 10:30:00+00:00 7.275000
2021-11-24 11:00:00+00:00 7.480000
2021-11-24 11:30:00+00:00 7.516667
2021-11-24 12:00:00+00:00 7.725000
从时间的角度来看这很好,但当然它计算的是 30 分钟间隔内的平均值。
相反,我想计算 +-10 分钟间隔内的平均值。
例如,对于 2021-11-24 10:30:00+00:00
,平均值是在 2021-11-24 10:20:00+00:00
和 2021-11-24 10:40:00+00:00
之间的区间内 temp
的所有值中计算的,在这种情况下是 7.33
和 7.44
均值是 7.385
.
最终结果应该是这样的:
temp
observation_time
2021-11-24 10:30:00+00:00 7.385
2021-11-24 11:00:00+00:00 7.5
2021-11-24 11:30:00+00:00 7.64
有什么想法吗?谢谢
编辑:下面的答案假设每一行对应于 10 分钟的间隔。如果您有 unevenly-spaced 数据,我们必须手动对数据集进行分箱以获得所需的输出:
import numpy as np
# the sampling will be computed in +/- 10 minutes from the bin
sampling_interval = np.timedelta64(10, 'm')
# get 30 minutes bins
bins_interval = "30min"
bins = df['observation_time'].dt.floor(bins_interval).unique()
avg_values = []
for grouped_bin in bins:
# subset the dataframe in the binned intervals
subset = df[df['observation_time'].between(
grouped_bin - sampling_interval,
grouped_bin + sampling_interval
)]
avg_values.append({
'observation_time': grouped_bin,
'temp': subset['temp'].mean()
})
averaged_df = pd.DataFrame(avg_values)
我不确定这是最“pythonic”的方式,但我会这样处理问题:
# we create an empty dictionary in which we'll store the computed avgs
# to turn into a DataFrame later
avg_values = []
# we iterate over the DataFrame starting at index 1 and skipping 3 rows at a time
for idx in range(1, len(df.index), 3):
# store the observation time in a separate variable
observation_time = df.loc[idx, 'observation_time']
# compute the mean between the rows before the current one, the
# current one, and the next one
avg_in_interval = np.nanmean([
df.loc[idx-1, 'temp'] if idx > 0 else np.nan,
df.loc[idx, 'temp'],
df.loc[idx+1, 'temp'] if idx < len(df.index)-1 else np.nan
])
# we append the two variables to the dictionary
avg_values.append({'observation_time': observation_time, 'temp': avg_in_interval})
# new DataFrame
averaged_df = pd.DataFrame(avg_values)
或者,以更紧凑和通用的方式,您可以配置平均的间隔宽度,
interval_width = 3 # assuming it is an odd number
starting_idx = interval_width // 2
avg_values = []
for idx in range(starting_idx, len(df.index), interval_width):
avg_values.append({
'observation_time': df.loc[idx, 'observation_time'],
'temp': np.mean(df.iloc[idx-starting_idx:idx+starting_idx]['temp'])
})
averaged_df = pd.DataFrame(avg_values)
你也可以把它变成一个函数来保持你的代码干净:
def get_averaged_df(df, interval_width: int):
if interval_width % 2 == 0:
raise Error("interval_width must be an odd integer")
starting_idx = interval_width // 2
avg_values = []
for idx in range(starting_idx, len(df.index), interval_width):
avg_values.append({
'observation_time': df.loc[idx, 'observation_time'],
'temp': np.mean(df.iloc[idx-starting_idx:idx+starting_idx]['temp'])
})
return pd.DataFrame(avg_values)
averaged_df = get_averaged_df(df, 3)