Pandas 删除连续时间范围内具有相同特征对的行
Pandas drop rows in consecutive time range and with same pair of features
我有一个如下所示的数据集:
id1
id2
time
0
56
99
2007-04-06 15:49:21
1
56
104
2007-04-06 18:26:13
2
56
104
2007-04-13 11:27:52
3
56
104
2007-04-13 11:28:41
4
56
104
2007-04-13 11:28:52
5
56
104
2007-04-13 11:33:25
6
56
104
2007-04-13 14:35:52
7
104
56
2007-04-13 11:28:23
8
104
56
2007-04-13 11:29:46
9
128
105
2007-03-27 18:39:45
10
217
256
2007-03-29 14:55:57
我想删除所有观察结果,其中对于同一对 ID,时间值在前一行的 5 分钟内。它也应该是“滚动”的意思是如果有三个观察,其中第二个距离第一个 4 分钟,第三个距离第二个 4 分钟,我只保留第一行。此外,Id 在 id1 或 Id2 列中也没有关系。
所以上面数据框的输出应该是:
id1
id2
time
0
56
99
2007-04-06 15:49:21
1
56
104
2007-04-06 18:26:13
2
56
104
2007-04-13 11:27:52
3
56
104
2007-04-13 14:35:52
4
128
105
2007-03-27 18:39:45
5
217
256
2007-03-29 14:55:57
我能想到的最好的是:
for i in range(1, len(df)):
if df['time'].iloc[i] <= df['time'].iloc[i-1] + pd.Timedelta(minutes=5):
df = df.drop(i)
df = df.reset_index(drop=True)
else:
continue
但是:1. 它引发了索引器越界错误。 2.它不“滚动”。 3. 区分id是在id1还是id2列。
在此先感谢您的帮助!
m = df.groupby(['id1', 'id2'], as_index=False)['time'].transform(lambda g: g.diff()).le(pd.Timedelta(minutes=5))['time']
df = df.loc[~(m | m.shift(-1))]
分步介绍:
您可以按 id1
和 id2
列分组并区分 time
列
diff = df.groupby(['id1', 'id2'], as_index=False)['time'].transform(lambda g: g.diff())
print(diff)
time
0 NaT
1 NaT
2 6 days 16:55:39
3 0 days 00:06:49
4 0 days 00:00:11
5 0 days 00:05:33
6 0 days 03:01:27
7 NaT
8 0 days 00:01:23
9 NaT
10 NaT
再与5分钟比较
m = diff.le(pd.Timedelta(minutes=5))
print(m)
time
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 True
9 False
10 False
并将上一行转换为 True
m = m | m.shift(-1)
print(m)
time
0 False
1 False
2 False
3 True
4 True
5 False
6 False
7 True
8 True
9 False
10 False
最后,对 select False 行使用布尔索引
df = df.loc[~m['time']]
print(df)
id1 id2 time
0 56 99 2007-04-06 15:49:21
1 56 104 2007-04-06 18:26:13
2 56 104 2007-04-13 11:21:52
5 56 104 2007-04-13 11:34:25
6 56 104 2007-04-13 14:35:52
9 128 105 2007-03-27 18:39:45
10 217 256 2007-03-29 14:55:57
这是使用 groupby
和 diff
的解决方案:
MIN_LIMIT = 5
def remove_duplicated_entries(df):
time_diff = df['time'].diff()
return df[(time_diff.dt.seconds > MIN_LIMIT*60) | (time_diff.isna())]
# Created to sort ids to ignore the order in the groupby. You can reuse id1 and id2 instead if you don't care.
df[['id_min', 'id_max']] = np.sort(df[['id1', 'id2']], axis=1)
clean_df = df.sort_values('time').groupby(['id_min', 'id_max'], as_index=False).apply(remove_duplicated_entries).reset_index(drop=True).drop(columns=['id_min', 'id_max'])
这导致:
id1 id2 time
0 56 99 2007-04-06 15:49:21
1 56 104 2007-04-06 18:26:13
2 56 104 2007-04-13 11:27:52
3 56 104 2007-04-13 11:34:25
4 56 104 2007-04-13 14:35:52
5 128 105 2007-03-27 18:39:45
6 217 256 2007-03-29 14:55:57
请注意,使用 diff
假定 time
列按升序排序(因此 sort_values
以确保是这种情况)。
讨论后编辑(见评论):
以下测试集也打破了@Ynjxsjmh 的答案(注意末尾的新项目):
df = pd.DataFrame({'id1':[56, 56, 56, 56, 56, 56, 56, 128, 217, 104, 104],
'id2':[99, 104, 104, 104, 104, 104, 104, 105, 256, 56, 56],
'time': pd.to_datetime(['2007-04-06 15:49:21', '2007-04-06 18:26:13', '2007-04-13 11:27:52', '2007-04-13 11:28:41',
'2007-04-13 11:28:52', '2007-04-13 11:34:25', '2007-04-13 14:35:52',
'2007-03-27 18:39:45', '2007-03-29 14:55:57', '2007-04-13 11:34:35', '2007-04-13 14:36:35'])})
他们回答的结果是:
id1 id2 time
0 56 99 2007-04-06 15:49:21
1 56 104 2007-04-06 18:26:13
5 56 104 2007-04-13 11:34:25
6 56 104 2007-04-13 14:35:52
7 128 105 2007-03-27 18:39:45
8 217 256 2007-03-29 14:55:57
9 104 56 2007-04-13 11:34:35
10 104 56 2007-04-13 14:36:35
我有一个如下所示的数据集:
id1 | id2 | time | |
---|---|---|---|
0 | 56 | 99 | 2007-04-06 15:49:21 |
1 | 56 | 104 | 2007-04-06 18:26:13 |
2 | 56 | 104 | 2007-04-13 11:27:52 |
3 | 56 | 104 | 2007-04-13 11:28:41 |
4 | 56 | 104 | 2007-04-13 11:28:52 |
5 | 56 | 104 | 2007-04-13 11:33:25 |
6 | 56 | 104 | 2007-04-13 14:35:52 |
7 | 104 | 56 | 2007-04-13 11:28:23 |
8 | 104 | 56 | 2007-04-13 11:29:46 |
9 | 128 | 105 | 2007-03-27 18:39:45 |
10 | 217 | 256 | 2007-03-29 14:55:57 |
我想删除所有观察结果,其中对于同一对 ID,时间值在前一行的 5 分钟内。它也应该是“滚动”的意思是如果有三个观察,其中第二个距离第一个 4 分钟,第三个距离第二个 4 分钟,我只保留第一行。此外,Id 在 id1 或 Id2 列中也没有关系。
所以上面数据框的输出应该是:
id1 | id2 | time | |
---|---|---|---|
0 | 56 | 99 | 2007-04-06 15:49:21 |
1 | 56 | 104 | 2007-04-06 18:26:13 |
2 | 56 | 104 | 2007-04-13 11:27:52 |
3 | 56 | 104 | 2007-04-13 14:35:52 |
4 | 128 | 105 | 2007-03-27 18:39:45 |
5 | 217 | 256 | 2007-03-29 14:55:57 |
我能想到的最好的是:
for i in range(1, len(df)):
if df['time'].iloc[i] <= df['time'].iloc[i-1] + pd.Timedelta(minutes=5):
df = df.drop(i)
df = df.reset_index(drop=True)
else:
continue
但是:1. 它引发了索引器越界错误。 2.它不“滚动”。 3. 区分id是在id1还是id2列。
在此先感谢您的帮助!
m = df.groupby(['id1', 'id2'], as_index=False)['time'].transform(lambda g: g.diff()).le(pd.Timedelta(minutes=5))['time']
df = df.loc[~(m | m.shift(-1))]
分步介绍:
您可以按 id1
和 id2
列分组并区分 time
列
diff = df.groupby(['id1', 'id2'], as_index=False)['time'].transform(lambda g: g.diff())
print(diff)
time
0 NaT
1 NaT
2 6 days 16:55:39
3 0 days 00:06:49
4 0 days 00:00:11
5 0 days 00:05:33
6 0 days 03:01:27
7 NaT
8 0 days 00:01:23
9 NaT
10 NaT
再与5分钟比较
m = diff.le(pd.Timedelta(minutes=5))
print(m)
time
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 True
9 False
10 False
并将上一行转换为 True
m = m | m.shift(-1)
print(m)
time
0 False
1 False
2 False
3 True
4 True
5 False
6 False
7 True
8 True
9 False
10 False
最后,对 select False 行使用布尔索引
df = df.loc[~m['time']]
print(df)
id1 id2 time
0 56 99 2007-04-06 15:49:21
1 56 104 2007-04-06 18:26:13
2 56 104 2007-04-13 11:21:52
5 56 104 2007-04-13 11:34:25
6 56 104 2007-04-13 14:35:52
9 128 105 2007-03-27 18:39:45
10 217 256 2007-03-29 14:55:57
这是使用 groupby
和 diff
的解决方案:
MIN_LIMIT = 5
def remove_duplicated_entries(df):
time_diff = df['time'].diff()
return df[(time_diff.dt.seconds > MIN_LIMIT*60) | (time_diff.isna())]
# Created to sort ids to ignore the order in the groupby. You can reuse id1 and id2 instead if you don't care.
df[['id_min', 'id_max']] = np.sort(df[['id1', 'id2']], axis=1)
clean_df = df.sort_values('time').groupby(['id_min', 'id_max'], as_index=False).apply(remove_duplicated_entries).reset_index(drop=True).drop(columns=['id_min', 'id_max'])
这导致:
id1 id2 time
0 56 99 2007-04-06 15:49:21
1 56 104 2007-04-06 18:26:13
2 56 104 2007-04-13 11:27:52
3 56 104 2007-04-13 11:34:25
4 56 104 2007-04-13 14:35:52
5 128 105 2007-03-27 18:39:45
6 217 256 2007-03-29 14:55:57
请注意,使用 diff
假定 time
列按升序排序(因此 sort_values
以确保是这种情况)。
讨论后编辑(见评论): 以下测试集也打破了@Ynjxsjmh 的答案(注意末尾的新项目):
df = pd.DataFrame({'id1':[56, 56, 56, 56, 56, 56, 56, 128, 217, 104, 104],
'id2':[99, 104, 104, 104, 104, 104, 104, 105, 256, 56, 56],
'time': pd.to_datetime(['2007-04-06 15:49:21', '2007-04-06 18:26:13', '2007-04-13 11:27:52', '2007-04-13 11:28:41',
'2007-04-13 11:28:52', '2007-04-13 11:34:25', '2007-04-13 14:35:52',
'2007-03-27 18:39:45', '2007-03-29 14:55:57', '2007-04-13 11:34:35', '2007-04-13 14:36:35'])})
他们回答的结果是:
id1 id2 time
0 56 99 2007-04-06 15:49:21
1 56 104 2007-04-06 18:26:13
5 56 104 2007-04-13 11:34:25
6 56 104 2007-04-13 14:35:52
7 128 105 2007-03-27 18:39:45
8 217 256 2007-03-29 14:55:57
9 104 56 2007-04-13 11:34:35
10 104 56 2007-04-13 14:36:35