在间隙处拆分时间序列数据 (pd.Series) 的更有效方法？

Question

我正在尝试将 pd.Series 与已排序的日期分开，这些日期之间有时会有比正常日期更大的差距。为此，我使用 pd.Series.diff() 计算了间隙的大小，然后使用 while 循环遍历了系列中的所有元素。但不幸的是，这在计算上非常密集。有没有更好的方法（除了并行化）？

我的函数的最小示例：

import pandas as pd
import time


def get_samples_separated_at_gaps(data: pd.Series, normal_gap) -> list:
    diff = data.diff()
    # creating list that should contains all samples
    samples_list = [pd.Series(data[0])]
    i = 1
    while i < len(data):
        if diff[i] == normal_gap:
            # normal gap: add data[i] to last sample in samples_list
            samples_list[-1] = samples_list[-1].append(pd.Series(data[i]))
        else:
            # not normal gap: creating new sample in samples_list
            samples_list.append(pd.Series(data[i]))
        i += 1
    return samples_list


# make sample data as example
normal_distance = pd.Timedelta(minutes=10)
first_sample = pd.Series([pd.Timestamp(2020, 1, 1) + normal_distance * i for i in range(10000)])
gap = pd.Timedelta(hours=10)
second_sample = pd.Series([first_sample.iloc[-1] + gap + normal_distance * i for i in range(10000)])

# the example data with two samples and one bigger gap of 10 hours instead of 10 minutes
data_with_samples = first_sample.append(second_sample, ignore_index=True)
# start sampling
start_time = time.time()
my_list_with_samples = get_samples_separated_at_gaps(data_with_samples, normal_distance)
print(f"Duration: {time.time() - start_time}")

真实数据超过150k，计算了几分钟...:/

Answer 1

关于存储这两个不同列表的方法，您的代码有点不清楚。具体来说，我不确定您心目中 sample_list 的正确结构是什么。

无论如何，使用 Series.pct_change 和 np.unique() 您应该可以大致达到您的要求。

uniques, indices = np.unique(
    data_with_samples.diff()
        [1:]
        .pct_change(),
    return_index=True)

现在 indices 将您指向错误间隙的起点和终点。

如果您的数据会有不止一个差距，那么您只想使用 diff()[1:].pct_change() 并使用 where().

查找所有不同于 0 的值

Answer 2

同上题

normal_distance = pd.Timedelta(minutes=10)
first_sample = pd.Series([pd.Timestamp(2020, 1, 1) + normal_distance * i for i in range(10000)])
gap = pd.Timedelta(hours=10)
second_sample = pd.Series([first_sample.iloc[-1] + gap + normal_distance * i for i in range(10000)])

# the example data with two samples and one bigger gap of 10 hours instead of 10 minutes
data_with_samples = first_sample.append(second_sample, ignore_index=True)

使用时间差异与 normal_distance.seconds
创建一个辅助列tag来分隔间隙组

# start sampling
start_time = time.time()
df = data_with_samples.to_frame()
df['time_diff'] = df[0].diff().dt.seconds
cond = (df['time_diff'] > normal_distance.seconds) | (df['time_diff'].isnull())
df['tag'] = np.where(cond, 1, 0)
df['tag'] = df['tag'].cumsum()
my_list_with_samples = []
for _, group in df.groupby('tag'):
    my_list_with_samples.append(group[0])
print(f"Duration: {time.time() - start_time}")

Answer 3

我不确定我是否完全理解你想要什么，但我认为这可行：

...
data_with_samples = first_sample.append(second_sample, ignore_index=True)

idx = data_with_samples[data_with_samples.diff(1) > normal_distance].index
samples_list = [data_with_samples]
if len(idx) > 0:
    samples_list = ([data_with_samples.iloc[:idx[0]]]
                    + [data_with_samples.iloc[idx[i-1]:idx[i]] for i in range(1, len(idx))]
                    + [data_with_samples.iloc[idx[-1]:]])

idx在一个gap之后直接收集indicees，剩下的就是在这个indicees处拆分series然后打包到list中samples_list.

如果索引是非标准的，那么您需要一些开销（重置索引，然后将索引设置回原始索引）以确保可以使用iloc。

...
data_with_samples = first_sample.append(second_sample, ignore_index=True)

data_with_samples = data_with_samples.reset_index(drop=False).rename(columns={0: 'data'})
idx = data_with_samples.data[data_with_samples.data.diff(1) > normal_distance].index
data_with_samples.set_index('index', drop=True, inplace=True)
samples_list = [data_with_samples]
if len(idx) > 0:
    samples_list = ([data_with_samples.iloc[:idx[0]]]
                    + [data_with_samples.iloc[idx[i-1]:idx[i]] for i in range(1, len(idx))]
                    + [data_with_samples.iloc[idx[-1]:]])

（您的示例不需要它。）

在间隙处拆分时间序列数据 (pd.Series) 的更有效方法？

A more efficient way to split timeseries data (pd.Series) at gaps?

python

performance

timestamp

sample

pandas