Python Pandas 如果其他列上没有重复项，则删除列上的重复行

Question

我有这个 df 和电子邮件 headers。我需要消除 Subject 相同且 Source 不同的所有重复项。我花了几个小时试图找出解决方案或找到类似的案例...

Date	From	Subject	Source
12/06/21	Sender1	Test123	Inbox
12/06/21	Sender2	Confirm	Inbox
12/06/21	Sender1	Test123	Sent
12/06/21	Sender3	Test_on	Inbox
12/06/21	Sender3	Test_on	Inbox

实际上，从 table 以上，主题 = 'Test123' 的行应该被删除。

Date	From	Subject	Source
12/06/21	Sender2	Confirm	Inbox
12/06/21	Sender3	Test_on	Inbox
12/06/21	Sender3	Test_on	Inbox

Answer 1

您可以使用 set 来确定每个发件人是否有多个来源。如果是，则删除该行。

>>> df.loc[df.groupby('From')['Source'].transform(lambda x: len(set(x)) == 1)]

       Date     From  Subject Source
1  12/06/21  Sender2  Confirm  Inbox
3  12/06/21  Sender3  Test_on  Inbox
4  12/06/21  Sender3  Test_on  Inbox

Answer 2

duplicated_subject = df.duplicated('Subject', keep=False)
duplicated_subject_and_source = df.duplicated(['Subject', 'Source'], keep=False)
df[~duplicated_subject | duplicated_subject_and_source]

消除“主题相同且来源不同”的所有重复项

相当于

保留“主题不重复或主题重复且来源相同”的位置

Python Pandas 如果其他列上没有重复项，则删除列上的重复行

Python Pandas drop row duplicates on a column if no duplicate on other column

python

duplicates

pandas

drop