在 python 中从每小时到每年对相同形状的时间序列数据重复重新采样的最快方法是什么
What is the fastest way to repeatedly resample timeseries data of the same shape from hourly to yearly in python
重复对相同形状的时间序列数据重新采样的最快方法是什么?
问题:我有 30 年的每小时时间序列,我想将其重新采样为每年并按日历年(重新采样规则 'AS')。我需要找到每年的平均值和总和。没有错过的时间。然后我需要这样做超过 10,000 次。对于我正在编写的脚本,此重采样步骤花费的时间最多,并且是优化 运行 时间的限制因素。由于闰年,不能按一致的 8760 小时重新采样,因为每四年有 8784 小时。
示例代码:
import pandas as pd
import numpy as np
import time
hourly_timeseries = pd.DataFrame(
index=pd.date_range(
pd.Timestamp(2020, 1, 1, 0, 0),
pd.Timestamp(2050, 12, 31, 23, 30),
freq="60min")
)
hourly_timeseries['value'] = np.random.rand(len(hourly_timeseries))
# Constraints imposed by wider problem:
# 1. each hourly_timeseries is unique
# 2. each hourly_timeseries is the same shape and has the same datetimeindex
# 3. a maximum of 10 timeseries can be grouped as columns in dataframe
start_time = time.perf_counter()
for num in range(100): # setting as 100 so it runs faster, this is 10,000+ in practice
yearly_timeseries_mean = hourly_timeseries.resample('AS').mean() # resample by calendar year
yearly_timeseries_sum = hourly_timeseries.resample('AS').sum()
finish_time = time.perf_counter()
print(f"Ran in {start_time - finish_time:0.4f} seconds")
>>> Ran in -3.0516 seconds
我探索过的解决方案:
- 我通过将多个时间序列聚合到一个数据帧中并同时对它们重新采样来提高了速度;然而,由于我正在解决的更广泛问题的设置的限制,我被限制在每个数据帧中有 10 个时间序列。因此,问题仍然存在:如果知道数组的形状始终相同,是否有一种方法可以显着加快时间序列数据的重采样速度?
- 我也研究过使用 numba,但这并不能使 pandas 函数更快。
听起来合理但我研究后找不到的可能解决方案:
- 使用 numpy
对时间序列数据的 3D 数组重新采样
- 缓存正在重新采样的索引,然后以某种方式在第一次重新采样后更快地进行每次重新采样
感谢您的帮助:)
正如我在评论中所写,我为每年准备了指数,并使用它们更快地计算每年的总和。
接下来,我再次删除了平均值下不必要的求和计算,而是将每年的平均值计算为 sum/length_of_indices
。
对于 N=1000,它的速度快了约 9 倍
import pandas as pd
import numpy as np
import time
hourly_timeseries = pd.DataFrame(
index=pd.date_range(
pd.Timestamp(2020, 1, 1, 0, 0),
pd.Timestamp(2050, 12, 31, 23, 30),
freq="60min")
)
hourly_timeseries['value'] = np.random.rand(len(hourly_timeseries))
# Constraints imposed by wider problem:
# 1. each hourly_timeseries is unique
# 2. each hourly_timeseries is the same shape and has the same datetimeindex
# 3. a maximum of 10 timeseries can be grouped as columns in dataframe
start_time = time.perf_counter()
for num in range(100): # setting as 100 so it runs faster, this is 10,000+ in practice
yearly_timeseries_mean = hourly_timeseries.resample('AS').mean() # resample by calendar year
yearly_timeseries_sum = hourly_timeseries.resample('AS').sum()
finish_time = time.perf_counter()
print(f"Ran in {finish_time - start_time:0.4f} seconds")
start_time = time.perf_counter()
events_years = hourly_timeseries.index.year
unique_years = np.sort(np.unique(events_years))
indices_per_year = [np.where(events_years == year)[0] for year in unique_years]
len_indices_per_year = np.array([len(year_indices) for year_indices in indices_per_year])
for num in range(100): # setting as 100 so it runs faster, this is 10,000+ in practice
temp = hourly_timeseries.values
yearly_timeseries_sum2 = np.array([np.sum(temp[year_indices]) for year_indices in indices_per_year])
yearly_timeseries_mean2 = yearly_timeseries_sum2 / len_indices_per_year
finish_time = time.perf_counter()
print(f"Ran in {finish_time - start_time:0.4f} seconds")
assert np.allclose(yearly_timeseries_sum.values.flatten(), yearly_timeseries_sum2)
assert np.allclose(yearly_timeseries_mean.values.flatten(), yearly_timeseries_mean2)
Ran in 0.9950 seconds
Ran in 0.1386 seconds
重复对相同形状的时间序列数据重新采样的最快方法是什么?
问题:我有 30 年的每小时时间序列,我想将其重新采样为每年并按日历年(重新采样规则 'AS')。我需要找到每年的平均值和总和。没有错过的时间。然后我需要这样做超过 10,000 次。对于我正在编写的脚本,此重采样步骤花费的时间最多,并且是优化 运行 时间的限制因素。由于闰年,不能按一致的 8760 小时重新采样,因为每四年有 8784 小时。
示例代码:
import pandas as pd
import numpy as np
import time
hourly_timeseries = pd.DataFrame(
index=pd.date_range(
pd.Timestamp(2020, 1, 1, 0, 0),
pd.Timestamp(2050, 12, 31, 23, 30),
freq="60min")
)
hourly_timeseries['value'] = np.random.rand(len(hourly_timeseries))
# Constraints imposed by wider problem:
# 1. each hourly_timeseries is unique
# 2. each hourly_timeseries is the same shape and has the same datetimeindex
# 3. a maximum of 10 timeseries can be grouped as columns in dataframe
start_time = time.perf_counter()
for num in range(100): # setting as 100 so it runs faster, this is 10,000+ in practice
yearly_timeseries_mean = hourly_timeseries.resample('AS').mean() # resample by calendar year
yearly_timeseries_sum = hourly_timeseries.resample('AS').sum()
finish_time = time.perf_counter()
print(f"Ran in {start_time - finish_time:0.4f} seconds")
>>> Ran in -3.0516 seconds
我探索过的解决方案:
- 我通过将多个时间序列聚合到一个数据帧中并同时对它们重新采样来提高了速度;然而,由于我正在解决的更广泛问题的设置的限制,我被限制在每个数据帧中有 10 个时间序列。因此,问题仍然存在:如果知道数组的形状始终相同,是否有一种方法可以显着加快时间序列数据的重采样速度?
- 我也研究过使用 numba,但这并不能使 pandas 函数更快。
听起来合理但我研究后找不到的可能解决方案:
- 使用 numpy 对时间序列数据的 3D 数组重新采样
- 缓存正在重新采样的索引,然后以某种方式在第一次重新采样后更快地进行每次重新采样
感谢您的帮助:)
正如我在评论中所写,我为每年准备了指数,并使用它们更快地计算每年的总和。
接下来,我再次删除了平均值下不必要的求和计算,而是将每年的平均值计算为 sum/length_of_indices
。
对于 N=1000,它的速度快了约 9 倍
import pandas as pd
import numpy as np
import time
hourly_timeseries = pd.DataFrame(
index=pd.date_range(
pd.Timestamp(2020, 1, 1, 0, 0),
pd.Timestamp(2050, 12, 31, 23, 30),
freq="60min")
)
hourly_timeseries['value'] = np.random.rand(len(hourly_timeseries))
# Constraints imposed by wider problem:
# 1. each hourly_timeseries is unique
# 2. each hourly_timeseries is the same shape and has the same datetimeindex
# 3. a maximum of 10 timeseries can be grouped as columns in dataframe
start_time = time.perf_counter()
for num in range(100): # setting as 100 so it runs faster, this is 10,000+ in practice
yearly_timeseries_mean = hourly_timeseries.resample('AS').mean() # resample by calendar year
yearly_timeseries_sum = hourly_timeseries.resample('AS').sum()
finish_time = time.perf_counter()
print(f"Ran in {finish_time - start_time:0.4f} seconds")
start_time = time.perf_counter()
events_years = hourly_timeseries.index.year
unique_years = np.sort(np.unique(events_years))
indices_per_year = [np.where(events_years == year)[0] for year in unique_years]
len_indices_per_year = np.array([len(year_indices) for year_indices in indices_per_year])
for num in range(100): # setting as 100 so it runs faster, this is 10,000+ in practice
temp = hourly_timeseries.values
yearly_timeseries_sum2 = np.array([np.sum(temp[year_indices]) for year_indices in indices_per_year])
yearly_timeseries_mean2 = yearly_timeseries_sum2 / len_indices_per_year
finish_time = time.perf_counter()
print(f"Ran in {finish_time - start_time:0.4f} seconds")
assert np.allclose(yearly_timeseries_sum.values.flatten(), yearly_timeseries_sum2)
assert np.allclose(yearly_timeseries_mean.values.flatten(), yearly_timeseries_mean2)
Ran in 0.9950 seconds
Ran in 0.1386 seconds