如何根据时间增量删除重复行,同时保留该记录的最新出现?
How drop duplicate rows based on a time delta whilst keep the latest occurrence of that record?
我有一个 table 的形式:
ID
DATE_ENCOUNTER
LOAD
151336
2017-08-22
40
151336
2017-08-23
40
151336
2017-08-24
40
151336
2017-08-25
40
151336
2017-09-05
50
151336
2017-09-06
50
151336
2017-10-16
51
151336
2017-10-17
51
151336
2017-10-18
51
151336
2017-10-30
50
151336
2017-10-31
50
151336
2017-11-01
50
151336
2017-12-13
62
151336
2018-01-03
65
151336
2018-02-09
60
虽然日期不一样,但有些记录是重复的(仅在 4 天的增量内)。如果 timestamps/dates 接近(在 4 天内),如何在数据框中删除重复项(最早的记录)天增量)但不相同。结果应显示如下 table:
ID
DATE_ENCOUNTER
LOAD
151336
2017-08-25
40
151336
2017-09-06
50
151336
2017-10-18
51
151336
2017-11-01
50
151336
2017-12-13
62
151336
2018-01-03
65
151336
2018-02-09
60
我试过:
m = df.groupby('ID').DATE_ENCOUNTER.apply(lambda x: x.diff().dt.days < 4)
m2 = df.ID.duplicated(keep=false) & (m | m.shift(-1))
df_dedup2 = df[~m2]
下面是一些生成数据框的代码:
import pandas as pd
details = {
'ID':[151336,151336,151336,151336,151336,151336,151336,151336,151336,151336,151336,151336,151336,151336,151336],
'DATE_ENCOUNTER':['2017-08-22','2017-08-23','2017-08-24','2017-08-25','2017-09-05','2017-09-06','2017-10-16','2017-10-17','2017-10-18','2017-10-30','2017-10-31','2017-11-01','2017-12-13','2018-01-03','2018-02-09'],
'LOAD':[40,40,40,40,50,50,51,51,51,50,50,50,62,65,60]
}
df=pd.DataFrame(details)
注意有更多字段和更多 ID。
您可以使用:
#Convert to datetime format if not already in datetime
#df['DATE_ENCOUNTER'] = pd.to_datetime(df['DATE_ENCOUNTER'])
#Sort DATE_ENCOUNTER within the same ID if not already in this sequence
#df = df.sort_values(by=['ID', 'DATE_ENCOUNTER'])
# reuse your code of mask `m`
m = df.groupby('ID').DATE_ENCOUNTER.apply(lambda x: x.diff().dt.days < 4)
# set grouping of consecutive entries within 4 days difference within the same `ID`
g = (~m).groupby(df['ID']).cumsum()
# pick the last entry in each group
df.groupby(['ID', g], as_index=False).last()
结果:
ID DATE_ENCOUNTER LOAD
0 151336 2017-08-25 40
1 151336 2017-09-06 50
2 151336 2017-10-18 51
3 151336 2017-11-01 50
4 151336 2017-12-13 62
5 151336 2018-01-03 65
6 151336 2018-02-09 60
您可以使用:
df[(df.groupby('ID')
['DATE_ENCOUNTER']
.diff(-1).dt.days.mul(-1) # calculate the difference
.fillna(float('inf')) # make sure last row is kept
.ge(4) # select diff >= 4
)]
输出:
ID DATE_ENCOUNTER LOAD
3 151336 2017-08-25 40
5 151336 2017-09-06 50
8 151336 2017-10-18 51
11 151336 2017-11-01 50
12 151336 2017-12-13 62
13 151336 2018-01-03 65
14 151336 2018-02-09 60
我有一个 table 的形式:
ID | DATE_ENCOUNTER | LOAD |
---|---|---|
151336 | 2017-08-22 | 40 |
151336 | 2017-08-23 | 40 |
151336 | 2017-08-24 | 40 |
151336 | 2017-08-25 | 40 |
151336 | 2017-09-05 | 50 |
151336 | 2017-09-06 | 50 |
151336 | 2017-10-16 | 51 |
151336 | 2017-10-17 | 51 |
151336 | 2017-10-18 | 51 |
151336 | 2017-10-30 | 50 |
151336 | 2017-10-31 | 50 |
151336 | 2017-11-01 | 50 |
151336 | 2017-12-13 | 62 |
151336 | 2018-01-03 | 65 |
151336 | 2018-02-09 | 60 |
虽然日期不一样,但有些记录是重复的(仅在 4 天的增量内)。如果 timestamps/dates 接近(在 4 天内),如何在数据框中删除重复项(最早的记录)天增量)但不相同。结果应显示如下 table:
ID | DATE_ENCOUNTER | LOAD |
---|---|---|
151336 | 2017-08-25 | 40 |
151336 | 2017-09-06 | 50 |
151336 | 2017-10-18 | 51 |
151336 | 2017-11-01 | 50 |
151336 | 2017-12-13 | 62 |
151336 | 2018-01-03 | 65 |
151336 | 2018-02-09 | 60 |
我试过:
m = df.groupby('ID').DATE_ENCOUNTER.apply(lambda x: x.diff().dt.days < 4)
m2 = df.ID.duplicated(keep=false) & (m | m.shift(-1))
df_dedup2 = df[~m2]
下面是一些生成数据框的代码:
import pandas as pd
details = {
'ID':[151336,151336,151336,151336,151336,151336,151336,151336,151336,151336,151336,151336,151336,151336,151336],
'DATE_ENCOUNTER':['2017-08-22','2017-08-23','2017-08-24','2017-08-25','2017-09-05','2017-09-06','2017-10-16','2017-10-17','2017-10-18','2017-10-30','2017-10-31','2017-11-01','2017-12-13','2018-01-03','2018-02-09'],
'LOAD':[40,40,40,40,50,50,51,51,51,50,50,50,62,65,60]
}
df=pd.DataFrame(details)
注意有更多字段和更多 ID。
您可以使用:
#Convert to datetime format if not already in datetime
#df['DATE_ENCOUNTER'] = pd.to_datetime(df['DATE_ENCOUNTER'])
#Sort DATE_ENCOUNTER within the same ID if not already in this sequence
#df = df.sort_values(by=['ID', 'DATE_ENCOUNTER'])
# reuse your code of mask `m`
m = df.groupby('ID').DATE_ENCOUNTER.apply(lambda x: x.diff().dt.days < 4)
# set grouping of consecutive entries within 4 days difference within the same `ID`
g = (~m).groupby(df['ID']).cumsum()
# pick the last entry in each group
df.groupby(['ID', g], as_index=False).last()
结果:
ID DATE_ENCOUNTER LOAD
0 151336 2017-08-25 40
1 151336 2017-09-06 50
2 151336 2017-10-18 51
3 151336 2017-11-01 50
4 151336 2017-12-13 62
5 151336 2018-01-03 65
6 151336 2018-02-09 60
您可以使用:
df[(df.groupby('ID')
['DATE_ENCOUNTER']
.diff(-1).dt.days.mul(-1) # calculate the difference
.fillna(float('inf')) # make sure last row is kept
.ge(4) # select diff >= 4
)]
输出:
ID DATE_ENCOUNTER LOAD
3 151336 2017-08-25 40
5 151336 2017-09-06 50
8 151336 2017-10-18 51
11 151336 2017-11-01 50
12 151336 2017-12-13 62
13 151336 2018-01-03 65
14 151336 2018-02-09 60