Python pandas：根据某些列比较数据帧的行并删除具有最低值的行

Question

我有一个数据框 df:

       first_seen              last_seen             uri
0   2015-05-11 23:08:46     2015-05-11 23:08:50 http://11i-ssaintandder.com/
1   2015-05-11 23:08:46     2015-05-11 23:08:46 http://11i-ssaintandder.com/
2   2015-05-02 18:27:10     2015-06-06 03:52:03 http://goo.gl/NMqjd1
3   2015-05-02 18:27:10     2015-06-08 08:44:53 http://goo.gl/NMqjd1

我想删除具有相同 "first_seen"、"uri" 的行，只保留具有最新 last_seen 的行。

这是 expected 数据集的示例：

       first_seen              last_seen             uri
0   2015-05-11 23:08:46     2015-05-11 23:08:50 http://11i-ssaintandder.com/
3   2015-05-02 18:27:10     2015-06-08 08:44:53 http://goo.gl/NMqjd1

有没有人知道不用写 for 循环该由谁来做？

Answer 1

调用 drop_duplicates 并传递要考虑进行重复匹配的列作为 subset 的参数并设置参数 take_last=True:

In [295]:

df.drop_duplicates(subset=['first_seen','uri'], take_last=True)
Out[295]:
  index          first_seen            last_seen                           uri
1     1 2015-05-11 23:08:46  2015-05-11 23:08:46  http://11i-ssaintandder.com/
3     3 2015-05-02 18:27:10  2015-06-08 08:44:53          http://goo.gl/NMqjd1

编辑

为了获取最新日期，您需要先在 'first_seen' 和 'last_seen' 上对 df 进行排序：

n [317]:
df = df.sort(columns=['first_seen','last_seen'], ascending=[0,1])
df.drop_duplicates(subset=['first_seen','uri'], take_last=True)

Out[317]:
  index          first_seen            last_seen                           uri
0     0 2015-05-11 23:08:46  2015-05-11 23:08:50  http://11i-ssaintandder.com/
3     3 2015-05-02 18:27:10  2015-06-08 08:44:53          http://goo.gl/NMqjd1

Python pandas：根据某些列比较数据帧的行并删除具有最低值的行

Python pandas: Compare rows of dataframe based on some columns and drop row with lowest value

python

compare

rows

dataframe

pandas