Pandas 删除连续时间范围内具有相同特征对的行

Pandas drop rows in consecutive time range and with same pair of features

我有一个如下所示的数据集:

id1 id2 time
0 56 99 2007-04-06 15:49:21
1 56 104 2007-04-06 18:26:13
2 56 104 2007-04-13 11:27:52
3 56 104 2007-04-13 11:28:41
4 56 104 2007-04-13 11:28:52
5 56 104 2007-04-13 11:33:25
6 56 104 2007-04-13 14:35:52
7 104 56 2007-04-13 11:28:23
8 104 56 2007-04-13 11:29:46
9 128 105 2007-03-27 18:39:45
10 217 256 2007-03-29 14:55:57

我想删除所有观察结果,其中对于同一对 ID,时间值在前一行的 5 分钟内。它也应该是“滚动”的意思是如果有三个观察,其中第二个距离第一个 4 分钟,第三个距离第二个 4 分钟,我只保留第一行。此外,Id 在 id1 或 Id2 列中也没有关系。

所以上面数据框的输出应该是:

id1 id2 time
0 56 99 2007-04-06 15:49:21
1 56 104 2007-04-06 18:26:13
2 56 104 2007-04-13 11:27:52
3 56 104 2007-04-13 14:35:52
4 128 105 2007-03-27 18:39:45
5 217 256 2007-03-29 14:55:57

我能想到的最好的是:

for i in range(1, len(df)):
    if df['time'].iloc[i] <= df['time'].iloc[i-1] + pd.Timedelta(minutes=5):
        df = df.drop(i)
        df = df.reset_index(drop=True)
    else:
        continue

但是:1. 它引发了索引器越界错误。 2.它不“滚动”。 3. 区分id是在id1还是id2列。

在此先感谢您的帮助!

m = df.groupby(['id1', 'id2'], as_index=False)['time'].transform(lambda g: g.diff()).le(pd.Timedelta(minutes=5))['time']
df = df.loc[~(m | m.shift(-1))]

分步介绍:

您可以按 id1id2 列分组并区分 time

diff = df.groupby(['id1', 'id2'], as_index=False)['time'].transform(lambda g: g.diff())
print(diff)

              time
0              NaT
1              NaT
2  6 days 16:55:39
3  0 days 00:06:49
4  0 days 00:00:11
5  0 days 00:05:33
6  0 days 03:01:27
7              NaT
8  0 days 00:01:23
9              NaT
10             NaT

再与5分钟比较

m = diff.le(pd.Timedelta(minutes=5))
print(m)

     time
0   False
1   False
2   False
3   False
4    True
5   False
6   False
7   False
8    True
9   False
10  False

并将上一行转换为 True

m = m | m.shift(-1)
print(m)

     time
0   False
1   False
2   False
3    True
4    True
5   False
6   False
7    True
8    True
9   False
10  False

最后,对 select False 行使用布尔索引

df = df.loc[~m['time']]
print(df)

    id1  id2                time
0    56   99 2007-04-06 15:49:21
1    56  104 2007-04-06 18:26:13
2    56  104 2007-04-13 11:21:52
5    56  104 2007-04-13 11:34:25
6    56  104 2007-04-13 14:35:52
9   128  105 2007-03-27 18:39:45
10  217  256 2007-03-29 14:55:57

这是使用 groupbydiff 的解决方案:

MIN_LIMIT = 5

def remove_duplicated_entries(df):
    time_diff = df['time'].diff()
    return df[(time_diff.dt.seconds > MIN_LIMIT*60) | (time_diff.isna())]

# Created to sort ids to ignore the order in the groupby. You can reuse id1 and id2 instead if you don't care.
df[['id_min', 'id_max']] = np.sort(df[['id1', 'id2']], axis=1)
clean_df = df.sort_values('time').groupby(['id_min', 'id_max'], as_index=False).apply(remove_duplicated_entries).reset_index(drop=True).drop(columns=['id_min', 'id_max'])

这导致:

   id1  id2                time
0   56   99 2007-04-06 15:49:21
1   56  104 2007-04-06 18:26:13
2   56  104 2007-04-13 11:27:52
3   56  104 2007-04-13 11:34:25
4   56  104 2007-04-13 14:35:52
5  128  105 2007-03-27 18:39:45
6  217  256 2007-03-29 14:55:57

请注意,使用 diff 假定 time 列按升序排序(因此 sort_values 以确保是这种情况)。

讨论后编辑(见评论): 以下测试集也打破了@Ynjxsjmh 的答案(注意末尾的新项目):


df = pd.DataFrame({'id1':[56, 56, 56, 56, 56, 56, 56, 128, 217, 104, 104], 
                   'id2':[99, 104, 104, 104, 104, 104, 104, 105, 256, 56, 56],
                   'time': pd.to_datetime(['2007-04-06 15:49:21', '2007-04-06 18:26:13', '2007-04-13 11:27:52', '2007-04-13 11:28:41',
                            '2007-04-13 11:28:52', '2007-04-13 11:34:25', '2007-04-13 14:35:52', 
                            '2007-03-27 18:39:45', '2007-03-29 14:55:57', '2007-04-13 11:34:35', '2007-04-13 14:36:35'])})

他们回答的结果是:

    id1  id2                time
0    56   99 2007-04-06 15:49:21
1    56  104 2007-04-06 18:26:13
5    56  104 2007-04-13 11:34:25
6    56  104 2007-04-13 14:35:52
7   128  105 2007-03-27 18:39:45
8   217  256 2007-03-29 14:55:57
9   104   56 2007-04-13 11:34:35
10  104   56 2007-04-13 14:36:35