特定条件下DataFrame中重复行的分析

Analysis of duplicate rows in a DataFrame under specific conditions

如下DataFrame:

ID code date direction
GER 0000 2021-02-05 OUT
USA 1234 2021-04-03 IN
USA 7283 2021-03-11 OUT
GER 7384 2021-02-05 OUT
FRA 6523 2021-04-12 IN
ITL 1111 2021-04-05 IN
USA 1234 2021-04-03 IN
GER 2222 2021-02-05 OUT
ITL 0392 2021-04-05 IN

首先我想按日期、ID 和方向获取重复项,但它的代码列值不同。

 df = df[df.duplicated(['date', 'ID', 'direction'], keep=False)]

得到如下table:

ID code date direction
USA 1234 2021-04-03 IN
USA 1234 2021-04-03 IN
GER 7384 2021-02-05 OUT
GER 0000 2021-02-05 OUT
GER 2222 2021-02-05 OUT
ITL 0392 2021-04-05 IN
ITL 1111 2021-04-05 IN

最后我想删除代码字段中重复的行。也就是说,我想获得以下 table:

ID code date direction
GER 0000 2021-02-05 OUT
GER 7384 2021-02-05 OUT
GER 2222 2021-02-05 OUT
ITL 0392 2021-04-05 IN
ITL 1111 2021-04-05 IN

希望你能帮我用几行简单的代码搞定

最后对于每个日期,我想显示有多少行代码:

2021-02-05: 3 matches
2021-04-05: 2 matches

也许你想要这样的东西:

   out = df[df.duplicated(['date', 'ID', 'direction'], 
                         keep=False)].drop_duplicates(
                             subset=["code"], keep=False).pivot_table(
                                 columns=['date'], aggfunc='size')

输出:

date
2021-02-05    3
2021-04-05    2