在间隙处拆分时间序列数据 (pd.Series) 的更有效方法?
A more efficient way to split timeseries data (pd.Series) at gaps?
我正在尝试将 pd.Series 与已排序的日期分开,这些日期之间有时会有比正常日期更大的差距。为此,我使用 pd.Series.diff() 计算了间隙的大小,然后使用 while 循环遍历了系列中的所有元素。但不幸的是,这在计算上非常密集。有没有更好的方法(除了并行化)?
我的函数的最小示例:
import pandas as pd
import time
def get_samples_separated_at_gaps(data: pd.Series, normal_gap) -> list:
diff = data.diff()
# creating list that should contains all samples
samples_list = [pd.Series(data[0])]
i = 1
while i < len(data):
if diff[i] == normal_gap:
# normal gap: add data[i] to last sample in samples_list
samples_list[-1] = samples_list[-1].append(pd.Series(data[i]))
else:
# not normal gap: creating new sample in samples_list
samples_list.append(pd.Series(data[i]))
i += 1
return samples_list
# make sample data as example
normal_distance = pd.Timedelta(minutes=10)
first_sample = pd.Series([pd.Timestamp(2020, 1, 1) + normal_distance * i for i in range(10000)])
gap = pd.Timedelta(hours=10)
second_sample = pd.Series([first_sample.iloc[-1] + gap + normal_distance * i for i in range(10000)])
# the example data with two samples and one bigger gap of 10 hours instead of 10 minutes
data_with_samples = first_sample.append(second_sample, ignore_index=True)
# start sampling
start_time = time.time()
my_list_with_samples = get_samples_separated_at_gaps(data_with_samples, normal_distance)
print(f"Duration: {time.time() - start_time}")
真实数据超过150k,计算了几分钟...:/
关于存储这两个不同列表的方法,您的代码有点不清楚。具体来说,我不确定您心目中 sample_list
的正确结构是什么。
无论如何,使用 Series.pct_change
和 np.unique()
您应该可以大致达到您的要求。
uniques, indices = np.unique(
data_with_samples.diff()
[1:]
.pct_change(),
return_index=True)
现在 indices
将您指向错误间隙的起点和终点。
如果您的数据会有不止一个差距,那么您只想使用 diff()[1:].pct_change()
并使用 where()
.
查找所有不同于 0 的值
- 同上题
normal_distance = pd.Timedelta(minutes=10)
first_sample = pd.Series([pd.Timestamp(2020, 1, 1) + normal_distance * i for i in range(10000)])
gap = pd.Timedelta(hours=10)
second_sample = pd.Series([first_sample.iloc[-1] + gap + normal_distance * i for i in range(10000)])
# the example data with two samples and one bigger gap of 10 hours instead of 10 minutes
data_with_samples = first_sample.append(second_sample, ignore_index=True)
- 使用时间差异与 normal_distance.seconds
进行比较
- 创建一个辅助列
tag
来分隔间隙组
# start sampling
start_time = time.time()
df = data_with_samples.to_frame()
df['time_diff'] = df[0].diff().dt.seconds
cond = (df['time_diff'] > normal_distance.seconds) | (df['time_diff'].isnull())
df['tag'] = np.where(cond, 1, 0)
df['tag'] = df['tag'].cumsum()
my_list_with_samples = []
for _, group in df.groupby('tag'):
my_list_with_samples.append(group[0])
print(f"Duration: {time.time() - start_time}")
我不确定我是否完全理解你想要什么,但我认为这可行:
...
data_with_samples = first_sample.append(second_sample, ignore_index=True)
idx = data_with_samples[data_with_samples.diff(1) > normal_distance].index
samples_list = [data_with_samples]
if len(idx) > 0:
samples_list = ([data_with_samples.iloc[:idx[0]]]
+ [data_with_samples.iloc[idx[i-1]:idx[i]] for i in range(1, len(idx))]
+ [data_with_samples.iloc[idx[-1]:]])
idx
在一个gap之后直接收集indicees,剩下的就是在这个indicees处拆分series然后打包到list中samples_list
.
如果索引是非标准的,那么您需要一些开销(重置索引,然后将索引设置回原始索引)以确保可以使用iloc
。
...
data_with_samples = first_sample.append(second_sample, ignore_index=True)
data_with_samples = data_with_samples.reset_index(drop=False).rename(columns={0: 'data'})
idx = data_with_samples.data[data_with_samples.data.diff(1) > normal_distance].index
data_with_samples.set_index('index', drop=True, inplace=True)
samples_list = [data_with_samples]
if len(idx) > 0:
samples_list = ([data_with_samples.iloc[:idx[0]]]
+ [data_with_samples.iloc[idx[i-1]:idx[i]] for i in range(1, len(idx))]
+ [data_with_samples.iloc[idx[-1]:]])
(您的示例不需要它。)
我正在尝试将 pd.Series 与已排序的日期分开,这些日期之间有时会有比正常日期更大的差距。为此,我使用 pd.Series.diff() 计算了间隙的大小,然后使用 while 循环遍历了系列中的所有元素。但不幸的是,这在计算上非常密集。有没有更好的方法(除了并行化)?
我的函数的最小示例:
import pandas as pd
import time
def get_samples_separated_at_gaps(data: pd.Series, normal_gap) -> list:
diff = data.diff()
# creating list that should contains all samples
samples_list = [pd.Series(data[0])]
i = 1
while i < len(data):
if diff[i] == normal_gap:
# normal gap: add data[i] to last sample in samples_list
samples_list[-1] = samples_list[-1].append(pd.Series(data[i]))
else:
# not normal gap: creating new sample in samples_list
samples_list.append(pd.Series(data[i]))
i += 1
return samples_list
# make sample data as example
normal_distance = pd.Timedelta(minutes=10)
first_sample = pd.Series([pd.Timestamp(2020, 1, 1) + normal_distance * i for i in range(10000)])
gap = pd.Timedelta(hours=10)
second_sample = pd.Series([first_sample.iloc[-1] + gap + normal_distance * i for i in range(10000)])
# the example data with two samples and one bigger gap of 10 hours instead of 10 minutes
data_with_samples = first_sample.append(second_sample, ignore_index=True)
# start sampling
start_time = time.time()
my_list_with_samples = get_samples_separated_at_gaps(data_with_samples, normal_distance)
print(f"Duration: {time.time() - start_time}")
真实数据超过150k,计算了几分钟...:/
关于存储这两个不同列表的方法,您的代码有点不清楚。具体来说,我不确定您心目中 sample_list
的正确结构是什么。
无论如何,使用 Series.pct_change
和 np.unique()
您应该可以大致达到您的要求。
uniques, indices = np.unique(
data_with_samples.diff()
[1:]
.pct_change(),
return_index=True)
现在 indices
将您指向错误间隙的起点和终点。
如果您的数据会有不止一个差距,那么您只想使用 diff()[1:].pct_change()
并使用 where()
.
- 同上题
normal_distance = pd.Timedelta(minutes=10)
first_sample = pd.Series([pd.Timestamp(2020, 1, 1) + normal_distance * i for i in range(10000)])
gap = pd.Timedelta(hours=10)
second_sample = pd.Series([first_sample.iloc[-1] + gap + normal_distance * i for i in range(10000)])
# the example data with two samples and one bigger gap of 10 hours instead of 10 minutes
data_with_samples = first_sample.append(second_sample, ignore_index=True)
- 使用时间差异与 normal_distance.seconds 进行比较
- 创建一个辅助列
tag
来分隔间隙组
# start sampling
start_time = time.time()
df = data_with_samples.to_frame()
df['time_diff'] = df[0].diff().dt.seconds
cond = (df['time_diff'] > normal_distance.seconds) | (df['time_diff'].isnull())
df['tag'] = np.where(cond, 1, 0)
df['tag'] = df['tag'].cumsum()
my_list_with_samples = []
for _, group in df.groupby('tag'):
my_list_with_samples.append(group[0])
print(f"Duration: {time.time() - start_time}")
我不确定我是否完全理解你想要什么,但我认为这可行:
...
data_with_samples = first_sample.append(second_sample, ignore_index=True)
idx = data_with_samples[data_with_samples.diff(1) > normal_distance].index
samples_list = [data_with_samples]
if len(idx) > 0:
samples_list = ([data_with_samples.iloc[:idx[0]]]
+ [data_with_samples.iloc[idx[i-1]:idx[i]] for i in range(1, len(idx))]
+ [data_with_samples.iloc[idx[-1]:]])
idx
在一个gap之后直接收集indicees,剩下的就是在这个indicees处拆分series然后打包到list中samples_list
.
如果索引是非标准的,那么您需要一些开销(重置索引,然后将索引设置回原始索引)以确保可以使用iloc
。
...
data_with_samples = first_sample.append(second_sample, ignore_index=True)
data_with_samples = data_with_samples.reset_index(drop=False).rename(columns={0: 'data'})
idx = data_with_samples.data[data_with_samples.data.diff(1) > normal_distance].index
data_with_samples.set_index('index', drop=True, inplace=True)
samples_list = [data_with_samples]
if len(idx) > 0:
samples_list = ([data_with_samples.iloc[:idx[0]]]
+ [data_with_samples.iloc[idx[i-1]:idx[i]] for i in range(1, len(idx))]
+ [data_with_samples.iloc[idx[-1]:]])
(您的示例不需要它。)