如何使用 pandas groupby 来过滤此数据框?
How to use a pandas groupby to filter this dataframe?
使用 Python 如何使用分组依据来过滤此数据集
开始
First Last Location ID1 ID2 First3 Last3
John Smith Toronto JohnToronto SmithToronto Joh Smi
Joh Smith Toronto JohToronto SmithToronto Joh Smi
Steph Sax Vancouver StephVancouver SaxVancouver Ste Sax
Steph Sa Vancouver StephVancouver SaxeVancouver Ste Sax
Stacy Lee Markham StacyMarkham LeeMarkham Sta Lee
Stac Lee Markham StacMarkham LeeMarkham Sta Lee
Stac Wong Aurora StacAurora LeeAurora Sta Won
Stac Lee Newmarket StacNewmarket LeeNewmarket Sta Lee
Steve Smith Toronto SteveToronto SmithToronto Ste Smi
John Jones Toronto JohnToronto JonesToronto Joh Jon
我怎样才能做到在接受这两个条件的情况下,过滤掉不符合这两个条件的所有其他内容
- ID1 - 匹配另一个ID1且Last3相同
- ID2 - 匹配另一个 ID2 且前 3 个相同
结束
First Last Location ID1 ID2 First3 Last3
John Smith Toronto JohnToronto SmithToronto Joh Smi
Joh Smith Toronto JohToronto SmithToronto Joh Smi
Steph Sax Vancouver StephVancouver SaxVancouver Ste Sax
Steph Sa Vancouver StephVancouver SaxeVancouver Ste Sax
Stacy Lee Markham StacyMarkham LeeMarkham Sta Lee
Stac Lee Markham StacMarkham LeeMarkham Sta Lee
您可以使用:
df = pd.DataFrame({
'First':['John', 'Joh', 'Steph', 'Steph', 'Stacy', 'Stac', 'Stac', 'Stac', 'Steve', 'John'],
'Last':['Smith', 'Smith', 'Sax', 'Sa', 'Lee', 'Lee', 'Wong', 'Lee', 'Smith', 'Jones'],
'Location':['Toronto', 'Toronto', 'Vancouver', 'Vancouver', 'Markham',
'Markham', 'Aurora', 'Newmarket', 'Toronto', 'Toronto'],
'ID1':['JohnToronto', 'JohnToronto', 'StephVancouver', 'StephVancouver', 'StacyMarkham',
'StacyMarkham','StacAurora', 'StacNewmarket','SteveToronto','JohnToronto'],
'ID2':['SmithToronto','SmithToronto','SaxVancouver','SaxVancouver',
'LeeMarkham','LeeMarkham','LeeAurora','LeeNewmarket','SmithToronto','JonesToronto'],
'First3':['Joh','Joh','Ste','Ste','Sta','Sta','Sta','Sta','Ste','Joh'],
'Last3':['Smi','Smi','Sax','Sax','Lee','Lee','Won','Lee','Smi','Jon']
})
m1 = df.duplicated(subset=['ID1','Last3'],keep=False)
m2 = df[m1].duplicated(subset=['ID2','First3'],keep=False)
df = df[m1 & m2]
基于对问题陈述的澄清评论 -
trying to groupby ID1 or ID2. And then depending which ID filter if Last3 col and First3 Col are the same respectively
试试这个方法 -
#group by ID1 and check if duplicates in last3. Then extract the index number that satisfies condition
c1 = df.groupby('ID1').apply(pd.DataFrame.duplicated, subset=['Last3'], keep=False)
c1_idx = c1[c1].droplevel(0).index
#group by ID2 and check if duplicates in first3. Then extract the index number that satisfies condition
c2 = df.groupby('ID2').apply(pd.DataFrame.duplicated, subset=['First3'], keep=False)
c2_idx = c2[c2].droplevel(0).index
#take a union of the 2 indexes and then ..
#filter dataframe for the indexes that meet the 2 independent conditions
output = df.iloc[c1_idx.union(c2_idx)]
print(output)
First Last Location ID1 ID2 First3 Last3
0 John Smith Toronto JohnToronto SmithToronto Joh Smi
1 Joh Smith Toronto JohToronto SmithToronto Joh Smi
2 Steph Sax Vancouver StephVancouver SaxVancouver Ste Sax
3 Steph Sa Vancouver StephVancouver SaxeVancouver Ste Sax
4 Stacy Lee Markham StacyMarkham LeeMarkham Sta Lee
5 Stac Lee Markham StacMarkham LeeMarkham Sta Lee
编辑: 修改@SomeDude 提供的上述答案,您可以 运行 这是 2 个没有 groupby 的独立条件,并在它们之间进行 OR -
m1 = df.duplicated(subset=['ID1','Last3'],keep=False)
m2 = df.duplicated(subset=['ID2','First3'],keep=False)
df[m1 | m2]
使用 Python 如何使用分组依据来过滤此数据集
开始
First Last Location ID1 ID2 First3 Last3
John Smith Toronto JohnToronto SmithToronto Joh Smi
Joh Smith Toronto JohToronto SmithToronto Joh Smi
Steph Sax Vancouver StephVancouver SaxVancouver Ste Sax
Steph Sa Vancouver StephVancouver SaxeVancouver Ste Sax
Stacy Lee Markham StacyMarkham LeeMarkham Sta Lee
Stac Lee Markham StacMarkham LeeMarkham Sta Lee
Stac Wong Aurora StacAurora LeeAurora Sta Won
Stac Lee Newmarket StacNewmarket LeeNewmarket Sta Lee
Steve Smith Toronto SteveToronto SmithToronto Ste Smi
John Jones Toronto JohnToronto JonesToronto Joh Jon
我怎样才能做到在接受这两个条件的情况下,过滤掉不符合这两个条件的所有其他内容
- ID1 - 匹配另一个ID1且Last3相同
- ID2 - 匹配另一个 ID2 且前 3 个相同
结束
First Last Location ID1 ID2 First3 Last3
John Smith Toronto JohnToronto SmithToronto Joh Smi
Joh Smith Toronto JohToronto SmithToronto Joh Smi
Steph Sax Vancouver StephVancouver SaxVancouver Ste Sax
Steph Sa Vancouver StephVancouver SaxeVancouver Ste Sax
Stacy Lee Markham StacyMarkham LeeMarkham Sta Lee
Stac Lee Markham StacMarkham LeeMarkham Sta Lee
您可以使用:
df = pd.DataFrame({
'First':['John', 'Joh', 'Steph', 'Steph', 'Stacy', 'Stac', 'Stac', 'Stac', 'Steve', 'John'],
'Last':['Smith', 'Smith', 'Sax', 'Sa', 'Lee', 'Lee', 'Wong', 'Lee', 'Smith', 'Jones'],
'Location':['Toronto', 'Toronto', 'Vancouver', 'Vancouver', 'Markham',
'Markham', 'Aurora', 'Newmarket', 'Toronto', 'Toronto'],
'ID1':['JohnToronto', 'JohnToronto', 'StephVancouver', 'StephVancouver', 'StacyMarkham',
'StacyMarkham','StacAurora', 'StacNewmarket','SteveToronto','JohnToronto'],
'ID2':['SmithToronto','SmithToronto','SaxVancouver','SaxVancouver',
'LeeMarkham','LeeMarkham','LeeAurora','LeeNewmarket','SmithToronto','JonesToronto'],
'First3':['Joh','Joh','Ste','Ste','Sta','Sta','Sta','Sta','Ste','Joh'],
'Last3':['Smi','Smi','Sax','Sax','Lee','Lee','Won','Lee','Smi','Jon']
})
m1 = df.duplicated(subset=['ID1','Last3'],keep=False)
m2 = df[m1].duplicated(subset=['ID2','First3'],keep=False)
df = df[m1 & m2]
基于对问题陈述的澄清评论 -
trying to groupby ID1 or ID2. And then depending which ID filter if Last3 col and First3 Col are the same respectively
试试这个方法 -
#group by ID1 and check if duplicates in last3. Then extract the index number that satisfies condition
c1 = df.groupby('ID1').apply(pd.DataFrame.duplicated, subset=['Last3'], keep=False)
c1_idx = c1[c1].droplevel(0).index
#group by ID2 and check if duplicates in first3. Then extract the index number that satisfies condition
c2 = df.groupby('ID2').apply(pd.DataFrame.duplicated, subset=['First3'], keep=False)
c2_idx = c2[c2].droplevel(0).index
#take a union of the 2 indexes and then ..
#filter dataframe for the indexes that meet the 2 independent conditions
output = df.iloc[c1_idx.union(c2_idx)]
print(output)
First Last Location ID1 ID2 First3 Last3
0 John Smith Toronto JohnToronto SmithToronto Joh Smi
1 Joh Smith Toronto JohToronto SmithToronto Joh Smi
2 Steph Sax Vancouver StephVancouver SaxVancouver Ste Sax
3 Steph Sa Vancouver StephVancouver SaxeVancouver Ste Sax
4 Stacy Lee Markham StacyMarkham LeeMarkham Sta Lee
5 Stac Lee Markham StacMarkham LeeMarkham Sta Lee
编辑: 修改@SomeDude 提供的上述答案,您可以 运行 这是 2 个没有 groupby 的独立条件,并在它们之间进行 OR -
m1 = df.duplicated(subset=['ID1','Last3'],keep=False)
m2 = df.duplicated(subset=['ID2','First3'],keep=False)
df[m1 | m2]