如何使用 pandas groupby 来过滤此数据框?

How to use a pandas groupby to filter this dataframe?

使用 Python 如何使用分组依据来过滤此数据集

开始

First     Last     Location      ID1              ID2             First3   Last3
John      Smith    Toronto       JohnToronto      SmithToronto    Joh      Smi
Joh       Smith    Toronto       JohToronto       SmithToronto    Joh      Smi
Steph     Sax      Vancouver     StephVancouver   SaxVancouver    Ste      Sax
Steph     Sa       Vancouver     StephVancouver   SaxeVancouver   Ste      Sax 
Stacy     Lee      Markham       StacyMarkham     LeeMarkham      Sta      Lee
Stac      Lee      Markham       StacMarkham      LeeMarkham      Sta      Lee
Stac      Wong     Aurora        StacAurora       LeeAurora       Sta      Won
Stac      Lee      Newmarket     StacNewmarket    LeeNewmarket    Sta      Lee
Steve     Smith    Toronto       SteveToronto     SmithToronto    Ste      Smi
John      Jones    Toronto       JohnToronto      JonesToronto    Joh      Jon

我怎样才能做到在接受这两个条件的情况下,过滤掉不符合这两个条件的所有其他内容

结束

 First     Last     Location      ID1              ID2             First3   Last3
 John      Smith    Toronto       JohnToronto      SmithToronto    Joh      Smi
 Joh       Smith    Toronto       JohToronto       SmithToronto    Joh      Smi
 Steph     Sax      Vancouver     StephVancouver   SaxVancouver    Ste      Sax
 Steph     Sa       Vancouver     StephVancouver   SaxeVancouver   Ste      Sax 
 Stacy     Lee      Markham       StacyMarkham     LeeMarkham      Sta      Lee
 Stac      Lee      Markham       StacMarkham      LeeMarkham      Sta      Lee

您可以使用:

df = pd.DataFrame({
    'First':['John', 'Joh', 'Steph', 'Steph', 'Stacy', 'Stac', 'Stac', 'Stac', 'Steve', 'John'],
    'Last':['Smith', 'Smith', 'Sax', 'Sa', 'Lee', 'Lee', 'Wong', 'Lee', 'Smith', 'Jones'],
    'Location':['Toronto', 'Toronto', 'Vancouver', 'Vancouver', 'Markham', 
                'Markham', 'Aurora', 'Newmarket', 'Toronto', 'Toronto'],
    'ID1':['JohnToronto', 'JohnToronto', 'StephVancouver', 'StephVancouver', 'StacyMarkham',
          'StacyMarkham','StacAurora', 'StacNewmarket','SteveToronto','JohnToronto'],
    'ID2':['SmithToronto','SmithToronto','SaxVancouver','SaxVancouver',
          'LeeMarkham','LeeMarkham','LeeAurora','LeeNewmarket','SmithToronto','JonesToronto'],
    
    'First3':['Joh','Joh','Ste','Ste','Sta','Sta','Sta','Sta','Ste','Joh'],
    'Last3':['Smi','Smi','Sax','Sax','Lee','Lee','Won','Lee','Smi','Jon']
})
m1 = df.duplicated(subset=['ID1','Last3'],keep=False)
m2 = df[m1].duplicated(subset=['ID2','First3'],keep=False)
df = df[m1 & m2]

基于对问题陈述的澄清评论 -

trying to groupby ID1 or ID2. And then depending which ID filter if Last3 col and First3 Col are the same respectively

试试这个方法 -

#group by ID1 and check if duplicates in last3. Then extract the index number that satisfies condition
c1 = df.groupby('ID1').apply(pd.DataFrame.duplicated, subset=['Last3'], keep=False)
c1_idx = c1[c1].droplevel(0).index

#group by ID2 and check if duplicates in first3. Then extract the index number that satisfies condition
c2 = df.groupby('ID2').apply(pd.DataFrame.duplicated, subset=['First3'], keep=False)
c2_idx = c2[c2].droplevel(0).index

#take a union of the 2 indexes and then ..
#filter dataframe for the indexes that meet the 2 independent conditions
output = df.iloc[c1_idx.union(c2_idx)]
print(output)
   First   Last   Location             ID1            ID2 First3 Last3
0   John  Smith    Toronto     JohnToronto   SmithToronto    Joh   Smi
1    Joh  Smith    Toronto      JohToronto   SmithToronto    Joh   Smi
2  Steph    Sax  Vancouver  StephVancouver   SaxVancouver    Ste   Sax
3  Steph     Sa  Vancouver  StephVancouver  SaxeVancouver    Ste   Sax
4  Stacy    Lee    Markham    StacyMarkham     LeeMarkham    Sta   Lee
5   Stac    Lee    Markham     StacMarkham     LeeMarkham    Sta   Lee

编辑: 修改@SomeDude 提供的上述答案,您可以 运行 这是 2 个没有 groupby 的独立条件,并在它们之间进行 OR -

m1 = df.duplicated(subset=['ID1','Last3'],keep=False)
m2 = df.duplicated(subset=['ID2','First3'],keep=False)
df[m1 | m2]