特定条件下DataFrame中重复行的分析
Analysis of duplicate rows in a DataFrame under specific conditions
如下DataFrame:
ID
code
date
direction
GER
0000
2021-02-05
OUT
USA
1234
2021-04-03
IN
USA
7283
2021-03-11
OUT
GER
7384
2021-02-05
OUT
FRA
6523
2021-04-12
IN
ITL
1111
2021-04-05
IN
USA
1234
2021-04-03
IN
GER
2222
2021-02-05
OUT
ITL
0392
2021-04-05
IN
首先我想按日期、ID 和方向获取重复项,但它的代码列值不同。
df = df[df.duplicated(['date', 'ID', 'direction'], keep=False)]
得到如下table:
ID
code
date
direction
USA
1234
2021-04-03
IN
USA
1234
2021-04-03
IN
GER
7384
2021-02-05
OUT
GER
0000
2021-02-05
OUT
GER
2222
2021-02-05
OUT
ITL
0392
2021-04-05
IN
ITL
1111
2021-04-05
IN
最后我想删除代码字段中重复的行。也就是说,我想获得以下 table:
ID
code
date
direction
GER
0000
2021-02-05
OUT
GER
7384
2021-02-05
OUT
GER
2222
2021-02-05
OUT
ITL
0392
2021-04-05
IN
ITL
1111
2021-04-05
IN
希望你能帮我用几行简单的代码搞定
最后对于每个日期,我想显示有多少行代码:
2021-02-05: 3 matches
2021-04-05: 2 matches
也许你想要这样的东西:
out = df[df.duplicated(['date', 'ID', 'direction'],
keep=False)].drop_duplicates(
subset=["code"], keep=False).pivot_table(
columns=['date'], aggfunc='size')
输出:
date
2021-02-05 3
2021-04-05 2
如下DataFrame:
ID | code | date | direction |
---|---|---|---|
GER | 0000 | 2021-02-05 | OUT |
USA | 1234 | 2021-04-03 | IN |
USA | 7283 | 2021-03-11 | OUT |
GER | 7384 | 2021-02-05 | OUT |
FRA | 6523 | 2021-04-12 | IN |
ITL | 1111 | 2021-04-05 | IN |
USA | 1234 | 2021-04-03 | IN |
GER | 2222 | 2021-02-05 | OUT |
ITL | 0392 | 2021-04-05 | IN |
首先我想按日期、ID 和方向获取重复项,但它的代码列值不同。
df = df[df.duplicated(['date', 'ID', 'direction'], keep=False)]
得到如下table:
ID | code | date | direction |
---|---|---|---|
USA | 1234 | 2021-04-03 | IN |
USA | 1234 | 2021-04-03 | IN |
GER | 7384 | 2021-02-05 | OUT |
GER | 0000 | 2021-02-05 | OUT |
GER | 2222 | 2021-02-05 | OUT |
ITL | 0392 | 2021-04-05 | IN |
ITL | 1111 | 2021-04-05 | IN |
最后我想删除代码字段中重复的行。也就是说,我想获得以下 table:
ID | code | date | direction |
---|---|---|---|
GER | 0000 | 2021-02-05 | OUT |
GER | 7384 | 2021-02-05 | OUT |
GER | 2222 | 2021-02-05 | OUT |
ITL | 0392 | 2021-04-05 | IN |
ITL | 1111 | 2021-04-05 | IN |
希望你能帮我用几行简单的代码搞定
最后对于每个日期,我想显示有多少行代码:
2021-02-05: 3 matches
2021-04-05: 2 matches
也许你想要这样的东西:
out = df[df.duplicated(['date', 'ID', 'direction'],
keep=False)].drop_duplicates(
subset=["code"], keep=False).pivot_table(
columns=['date'], aggfunc='size')
输出:
date
2021-02-05 3
2021-04-05 2