Python - 时间加权平均值 Pandas,按时间间隔分组
Python - Time-weighted average Pandas, grouped by time interval
我在 Pandas DataFrame 中有一个时间序列。时间戳可以是不均匀的(每 1-5 分钟一个),但总是每 5 分钟一个(以 0,5,10,15,20,25,30,35,40,45,50 结尾的时间戳,55).
示例:
2017-01-01 2:05:00 32.90
2017-01-01 2:07:30 29.83
2017-01-01 2:10:00 45.76
2017-01-01 2:15:00 16.22
2017-01-01 2:20:00 17.33
2017-01-01 2:25:00 23.40
2017-01-01 2:28:45 150.12
2017-01-01 2:30:00 100.29
2017-01-01 2:35:00 38.45
2017-01-01 2:40:00 67.12
2017-01-01 2:45:00 20.00
2017-01-01 2:50:00 58.41
2017-01-01 2:55:00 58.32
2017-01-01 3:00:00 59.89
我想获取 15 分钟块的时间加权平均值。时间戳直接在15分钟标记上的行(分钟以0,15,30,45结尾的时间戳)结束一个区间,所以分组如下:
Group 1 (interval 2017-01-01 2:00:00):
2017-01-01 2:05:00 32.90
2017-01-01 2:07:30 29.83
2017-01-01 2:10:00 45.76
2017-01-01 2:15:00 16.22
Group 2 (interval 2017-01-01 2:15:00):
2017-01-01 2:20:00 17.33
2017-01-01 2:25:00 23.40
2017-01-01 2:28:45 150.12
2017-01-01 2:30:00 100.29
Group 3 (interval 2017-01-01 2:30:00):
2017-01-01 2:35:00 38.45
2017-01-01 2:40:00 67.12
2017-01-01 2:45:00 20.00
Group 4 (interval 2017-01-01 2:45:00):
2017-01-01 2:50:00 58.41
2017-01-01 2:55:00 58.32
2017-01-01 3:00:00 59.89
平均值必须是时间加权的,因此不仅仅是一组中所有值的标准平均值。
例如,第 2 组的时间加权平均值不是 72.785,它是所有 4 个值的常规平均值。相反,它应该是:
(5 minutes / 15 minutes) * 17.33 = 5.776667 ==> The 5 minutes is taken from the difference between this timestamp and the previous timestamp
+(5 minutes / 15 minutes) * 23.40 = 7.8
+(3.75 minutes / 15 minutes) * 150.12 = 37.53
+(1.25 minutes / 15 minutes) * 100.29 = 8.3575
= **59.46417**
另外,理想情况下,15 分钟是参数化的,因为这可能会在未来更改为 60 分钟(每小时),但我认为这不是问题。
此外,性能在这方面非常重要。由于我的数据集大约有 10k 行,因此逐条遍历每条记录会非常慢。
我尝试查看 Pandas 的 df.rolling() 函数,但无法弄清楚如何将其直接应用于我的特定场景。
非常感谢您的帮助!
更新 1:
根据Simon的精彩解决方案,我稍微修改了一下。
我对其进行了一些调整以适应我的具体情况:
def func(df):
if df.size == 0: return
timestep = 15*60
indexes = df.index - (df.index[-1] - pd.Timedelta(seconds=timestep))
seconds = indexes.seconds
weight = [seconds[n]/timestep if n == 0 else (seconds[n] - seconds[n - 1])/timestep
for n, k in enumerate(seconds)]
return np.sum(weight*df.values)
这是为了处理可能为空的 15 分钟间隔(数据库中缺少行)
这个很棘手。我很乐意看到另一位评论者更有效地做到这一点,因为我有一种预感,有更好的方法来做到这一点。
我还跳过了一个部分,即参数化 15 分钟的值,但我在评论中指出了您可以如何操作。这留作 reader 的练习:D 虽然它应该被参数化,因为现在有很多随机的 '*15' 和 '*60' 值散布在这个地方,看起来很笨拙。
我也累了,老婆要看电影,所以没整理代码。它有点混乱,应该写得更干净——这可能值得也可能不值得,这取决于其他人是否可以用 6 行代码重做这一切。如果明天早上仍未得到答复,我会重新审视并做得更好。
更新了更好的解决方案1
def func(df):
timestep = 15*60
seconds = (df.index.minute*60+df.index.second)-timestep
weight = [k/timestep if n == 0 else (seconds[n] - seconds[n - 1])/timestep
for n, k in enumerate(seconds)]
return np.sum(weight*df.values)
df.resample('15min', closed='right').apply(func)
设第一列的标签为ts
,下一列的标签为value
def tws(df, lenght):
df['ts'] = pd.to_datetime(df['ts'])
interval =[0]
df1 = df
for i in range(1,len(df1)):
interval.append(((df1.loc[i, 'ts']-df1.loc[i-1, 'ts']).days * 24 * 60 +(df1.loc[i, 'ts']-df1.loc[i-1, 'ts']).seconds)/60)
df1['time_interval']= interval
start = pd.to_datetime('2017-01-01 2:00:00')
TWS = []
ave = 0
for i in range(1, len(df1)+1):
try:
if df1.loc[i, 'ts']<= (start+timedelta(minutes = lenght)):
ave = ave+df1.loc[i, 'value']*df1.loc[i,'time_interval']
else:
TWS.append(ave/lenght)
ave = df1.loc[i, 'value']*df1.loc[i,'time_interval']
start = df1.loc[i-1,'ts']
except :
TWS.append(ave/lenght)
return TWS
tws(df,15)
输出的是每个区间的加权时间平均值的列表
另一种选择是将值乘以刻度之间的小数时间,然后将结果相加。以下函数采用具有值和请求的索引的系列或数据框。:
import numpy as np
import pandas as pd
def resample_time_weighted_mean(x, target_index, closed=None, label=None):
shift = 1 if closed == "right" else -1
fill = "bfill" if closed == "right" else "ffill"
# Determine length of each interval (daylight saving aware)
extended_index = target_index.union(
[target_index[0] - target_index.freq, target_index[-1] + target_index.freq]
)
interval_lengths = -extended_index.to_series().diff(periods=shift)
# Create a combined index of the source index and target index and reindex to combined index
combined_index = x.index.union(extended_index)
x = x.reindex(index=combined_index, method=fill)
interval_lengths = interval_lengths.reindex(index=combined_index, method=fill)
# Determine weights of each value and multiply source values
weights = -x.index.to_series().diff(periods=shift) / interval_lengths
x = x.mul(weights, axis=0)
# Resample to new index, the final reindex is necessary because resample
# might return more rows based on the frequency
return (
x.resample(target_index.freq, closed=closed, label=label)
.sum()
.reindex(target_index)
)
将其应用于示例数据:
x = pd.Series(
[
32.9,
29.83,
45.76,
16.22,
17.33,
23.4,
150.12,
100.29,
38.45,
67.12,
20.0,
58.41,
58.32,
59.89,
],
index=pd.to_datetime(
[
"2017-01-01 2:05:00",
"2017-01-01 2:07:30",
"2017-01-01 2:10:00",
"2017-01-01 2:15:00",
"2017-01-01 2:20:00",
"2017-01-01 2:25:00",
"2017-01-01 2:28:45",
"2017-01-01 2:30:00",
"2017-01-01 2:35:00",
"2017-01-01 2:40:00",
"2017-01-01 2:45:00",
"2017-01-01 2:50:00",
"2017-01-01 2:55:00",
"2017-01-01 3:00:00",
]
),
)
opts = dict(closed="right", label="right")
resample_time_weighted_mean(
x, pd.DatetimeIndex(x.resample("15T", **opts).groups.keys(), freq="infer"), **opts
)
哪个returns:
2017-01-01 02:15:00 18.005000
2017-01-01 02:30:00 59.464167
2017-01-01 02:45:00 41.856667
2017-01-01 03:00:00 58.873333
Freq: 15T, dtype: float64
关于simon的回答下提到的性能问题,这种方法在数百万行上表现良好,并且权重是一次性计算的,而不是在相对较慢的python循环中:
new_index = pd.date_range("2017-01-01", "2021-01-01", freq="1T")
new_index = new_index + pd.TimedeltaIndex(
np.random.rand(*new_index.shape) * 60 - 30, "s"
)
values = pd.Series(np.random.rand(*new_index.shape), index=new_index)
print(values.shape)
(2103841,)
%%timeit
resample_time_weighted_mean(
values, pd.date_range("2017-01-01", "2021-01-01", freq="15T"), closed="right"
)
4.93 s ± 48.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
我在 Pandas DataFrame 中有一个时间序列。时间戳可以是不均匀的(每 1-5 分钟一个),但总是每 5 分钟一个(以 0,5,10,15,20,25,30,35,40,45,50 结尾的时间戳,55).
示例:
2017-01-01 2:05:00 32.90
2017-01-01 2:07:30 29.83
2017-01-01 2:10:00 45.76
2017-01-01 2:15:00 16.22
2017-01-01 2:20:00 17.33
2017-01-01 2:25:00 23.40
2017-01-01 2:28:45 150.12
2017-01-01 2:30:00 100.29
2017-01-01 2:35:00 38.45
2017-01-01 2:40:00 67.12
2017-01-01 2:45:00 20.00
2017-01-01 2:50:00 58.41
2017-01-01 2:55:00 58.32
2017-01-01 3:00:00 59.89
我想获取 15 分钟块的时间加权平均值。时间戳直接在15分钟标记上的行(分钟以0,15,30,45结尾的时间戳)结束一个区间,所以分组如下:
Group 1 (interval 2017-01-01 2:00:00):
2017-01-01 2:05:00 32.90
2017-01-01 2:07:30 29.83
2017-01-01 2:10:00 45.76
2017-01-01 2:15:00 16.22
Group 2 (interval 2017-01-01 2:15:00):
2017-01-01 2:20:00 17.33
2017-01-01 2:25:00 23.40
2017-01-01 2:28:45 150.12
2017-01-01 2:30:00 100.29
Group 3 (interval 2017-01-01 2:30:00):
2017-01-01 2:35:00 38.45
2017-01-01 2:40:00 67.12
2017-01-01 2:45:00 20.00
Group 4 (interval 2017-01-01 2:45:00):
2017-01-01 2:50:00 58.41
2017-01-01 2:55:00 58.32
2017-01-01 3:00:00 59.89
平均值必须是时间加权的,因此不仅仅是一组中所有值的标准平均值。
例如,第 2 组的时间加权平均值不是 72.785,它是所有 4 个值的常规平均值。相反,它应该是:
(5 minutes / 15 minutes) * 17.33 = 5.776667 ==> The 5 minutes is taken from the difference between this timestamp and the previous timestamp
+(5 minutes / 15 minutes) * 23.40 = 7.8
+(3.75 minutes / 15 minutes) * 150.12 = 37.53
+(1.25 minutes / 15 minutes) * 100.29 = 8.3575
= **59.46417**
另外,理想情况下,15 分钟是参数化的,因为这可能会在未来更改为 60 分钟(每小时),但我认为这不是问题。
此外,性能在这方面非常重要。由于我的数据集大约有 10k 行,因此逐条遍历每条记录会非常慢。
我尝试查看 Pandas 的 df.rolling() 函数,但无法弄清楚如何将其直接应用于我的特定场景。
非常感谢您的帮助!
更新 1:
根据Simon的精彩解决方案,我稍微修改了一下。
我对其进行了一些调整以适应我的具体情况:
def func(df):
if df.size == 0: return
timestep = 15*60
indexes = df.index - (df.index[-1] - pd.Timedelta(seconds=timestep))
seconds = indexes.seconds
weight = [seconds[n]/timestep if n == 0 else (seconds[n] - seconds[n - 1])/timestep
for n, k in enumerate(seconds)]
return np.sum(weight*df.values)
这是为了处理可能为空的 15 分钟间隔(数据库中缺少行)
这个很棘手。我很乐意看到另一位评论者更有效地做到这一点,因为我有一种预感,有更好的方法来做到这一点。
我还跳过了一个部分,即参数化 15 分钟的值,但我在评论中指出了您可以如何操作。这留作 reader 的练习:D 虽然它应该被参数化,因为现在有很多随机的 '*15' 和 '*60' 值散布在这个地方,看起来很笨拙。
我也累了,老婆要看电影,所以没整理代码。它有点混乱,应该写得更干净——这可能值得也可能不值得,这取决于其他人是否可以用 6 行代码重做这一切。如果明天早上仍未得到答复,我会重新审视并做得更好。
更新了更好的解决方案1
def func(df):
timestep = 15*60
seconds = (df.index.minute*60+df.index.second)-timestep
weight = [k/timestep if n == 0 else (seconds[n] - seconds[n - 1])/timestep
for n, k in enumerate(seconds)]
return np.sum(weight*df.values)
df.resample('15min', closed='right').apply(func)
设第一列的标签为ts
,下一列的标签为value
def tws(df, lenght):
df['ts'] = pd.to_datetime(df['ts'])
interval =[0]
df1 = df
for i in range(1,len(df1)):
interval.append(((df1.loc[i, 'ts']-df1.loc[i-1, 'ts']).days * 24 * 60 +(df1.loc[i, 'ts']-df1.loc[i-1, 'ts']).seconds)/60)
df1['time_interval']= interval
start = pd.to_datetime('2017-01-01 2:00:00')
TWS = []
ave = 0
for i in range(1, len(df1)+1):
try:
if df1.loc[i, 'ts']<= (start+timedelta(minutes = lenght)):
ave = ave+df1.loc[i, 'value']*df1.loc[i,'time_interval']
else:
TWS.append(ave/lenght)
ave = df1.loc[i, 'value']*df1.loc[i,'time_interval']
start = df1.loc[i-1,'ts']
except :
TWS.append(ave/lenght)
return TWS
tws(df,15)
输出的是每个区间的加权时间平均值的列表
另一种选择是将值乘以刻度之间的小数时间,然后将结果相加。以下函数采用具有值和请求的索引的系列或数据框。:
import numpy as np
import pandas as pd
def resample_time_weighted_mean(x, target_index, closed=None, label=None):
shift = 1 if closed == "right" else -1
fill = "bfill" if closed == "right" else "ffill"
# Determine length of each interval (daylight saving aware)
extended_index = target_index.union(
[target_index[0] - target_index.freq, target_index[-1] + target_index.freq]
)
interval_lengths = -extended_index.to_series().diff(periods=shift)
# Create a combined index of the source index and target index and reindex to combined index
combined_index = x.index.union(extended_index)
x = x.reindex(index=combined_index, method=fill)
interval_lengths = interval_lengths.reindex(index=combined_index, method=fill)
# Determine weights of each value and multiply source values
weights = -x.index.to_series().diff(periods=shift) / interval_lengths
x = x.mul(weights, axis=0)
# Resample to new index, the final reindex is necessary because resample
# might return more rows based on the frequency
return (
x.resample(target_index.freq, closed=closed, label=label)
.sum()
.reindex(target_index)
)
将其应用于示例数据:
x = pd.Series(
[
32.9,
29.83,
45.76,
16.22,
17.33,
23.4,
150.12,
100.29,
38.45,
67.12,
20.0,
58.41,
58.32,
59.89,
],
index=pd.to_datetime(
[
"2017-01-01 2:05:00",
"2017-01-01 2:07:30",
"2017-01-01 2:10:00",
"2017-01-01 2:15:00",
"2017-01-01 2:20:00",
"2017-01-01 2:25:00",
"2017-01-01 2:28:45",
"2017-01-01 2:30:00",
"2017-01-01 2:35:00",
"2017-01-01 2:40:00",
"2017-01-01 2:45:00",
"2017-01-01 2:50:00",
"2017-01-01 2:55:00",
"2017-01-01 3:00:00",
]
),
)
opts = dict(closed="right", label="right")
resample_time_weighted_mean(
x, pd.DatetimeIndex(x.resample("15T", **opts).groups.keys(), freq="infer"), **opts
)
哪个returns:
2017-01-01 02:15:00 18.005000
2017-01-01 02:30:00 59.464167
2017-01-01 02:45:00 41.856667
2017-01-01 03:00:00 58.873333
Freq: 15T, dtype: float64
关于simon的回答下提到的性能问题,这种方法在数百万行上表现良好,并且权重是一次性计算的,而不是在相对较慢的python循环中:
new_index = pd.date_range("2017-01-01", "2021-01-01", freq="1T")
new_index = new_index + pd.TimedeltaIndex(
np.random.rand(*new_index.shape) * 60 - 30, "s"
)
values = pd.Series(np.random.rand(*new_index.shape), index=new_index)
print(values.shape)
(2103841,)
%%timeit
resample_time_weighted_mean(
values, pd.date_range("2017-01-01", "2021-01-01", freq="15T"), closed="right"
)
4.93 s ± 48.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)