找到列中的值与列表中的值不匹配的行 Python
Locate rows where values in columns do not match values from a list Python
我有一个带有 ID 和一些电子邮件地址的数据框
personid sup1_email sup2_email sup3_email sup4_email
1 evan.o@abc.com jon.k@abc.com kelm.q@abc.com john.d@abc.com
5 evan.o@abc.com polly.u@abc.com jim.e@ABC.COM nan
11 jim.y@abc.com manfred.a@abc.com greg.s@Abc.com adele.a@abc.com
52 jim.y@abc.com manfred.a@abc.com greg.s@Abc.com adele.a@abc.com
65 evan.o@abc.com lenny.t@yahoo.com john.s@abc.com sally.j@ABC.com
89 dom.q@ABC.com laurie.g@Abc.com topher.u@abc.com ross.k@qqpower.com
我想找到与接受的电子邮件值列表不匹配的行(即不是“@abc.com”、“@ABC.COM”、“@Abc.com ').我想得到的是这个
personid sup1_email sup2_email sup3_email sup4_email
65 evan.o@abc.com lenny.t@yahoo.com john.s@abc.com sally.j@ABC.com
89 dom.q@ABC.com laurie.g@Abc.com topher.u@abc.com ross.k@qqpower.com
我编写了以下代码并且它可以工作,但我必须手动检查每个 sup_email 列并重复该过程,这是低效的
#list down all the variations of accepted email domains
email_adds = ['@abc.com','@ABC.COM','@Abc.com']
#combine the variations of email addresses in the list
accepted_emails = '|'.join(email_adds)
not_accepted = df.loc[~df['sup1_email'].str.contains(accepted_emails, na=False)]
我想知道是否有更有效的方法使用 for 循环来执行此操作。到目前为止我所管理的是显示一个包含未接受电子邮件的列,但它没有显示包含未接受电子邮件的行。感谢我能得到的任何形式的帮助,谢谢。
sup_emails = df[['sup1_email','sup2_email', 'sup3_email', 'sup4_email']]
#for each sup column, check if the accepted email addresses are not in it
for col in sup_emails:
if any(x not in col for x in accepted_emails):
print(col)
一个想法:
#list down all the variations of accepted email domains
email_adds = ['@abc.com','@ABC.COM','@Abc.com']
#combine the variations of email addresses in the list
accepted_emails = '|'.join(email_adds)
#columns for test
c = ['sup1_email','sup2_email', 'sup3_email', 'sup4_email']
#reshape and test all values, if `nan` pass `True`
m = df[c].stack(dropna=False).str.contains(accepted_emails, na=True).unstack().all(axis=1)
df = df[~m]
print (df)
personid sup1_email sup2_email sup3_email \
4 65 evan.o@abc.com lenny.t@yahoo.com john.s@abc.com
5 89 dom.q@ABC.com laurie.g@Abc.com topher.u@abc.com
sup4_email
4 sally.j@ABC.com
5 ross.k@qqpower.com
您使用生成器和 any
的解决方案:
c = ['sup1_email','sup2_email', 'sup3_email', 'sup4_email']
f = lambda y: any(x in y for x in email_adds) if isinstance(y, str) else True
df = df[~df[c].applymap(f).all(axis=1)]
print (df)
personid sup1_email sup2_email sup3_email \
4 65 evan.o@abc.com lenny.t@yahoo.com john.s@abc.com
5 89 dom.q@ABC.com laurie.g@Abc.com topher.u@abc.com
sup4_email
4 sally.j@ABC.com
5 ross.k@qqpower.com
你可以这样做:
# list down all the variations of accepted email domains
email_adds = ['@abc.com','@ABC.COM','@Abc.com']
# combine the variations of email addresses in the list
accepted_emails = '|'.join(email_adds)
# create a single email column
melted = df.melt('personid')
# check the matching emails
mask = melted['value'].str.contains(accepted_emails, na=True)
# filter out the ones that do not match
mask = df['personid'].isin(melted.loc[~mask, 'personid'])
print(df[mask])
输出
personid sup1_email ... sup3_email sup4_email
4 65 evan.o@abc.com ... john.s@abc.com sally.j@ABC.com
5 89 dom.q@ABC.com ... topher.u@abc.com ross.k@qqpower.com
[2 rows x 5 columns]
让我们尝试查看 @
之后的字符在所有列中是否为 ABC
或 abc
或 Abc
。当然我们可以暂时过滤掉PersonID
。检查后,使用~
反转结果并屏蔽它们
df[-(df.iloc[:,1:].apply(lambda x: x.str.contains('(\@(?=ABC|abc|Abc))').all(), axis=1))]
personid sup1_email sup2_email sup3_email \
4 65.0 evan.o@abc.com lenny.t@yahoo.com john.s@abc.com
5 89.0 dom.q@ABC.com laurie.g@Abc.com topher.u@abc.com
sup4_email
4 sally.j@ABC.com
5 ross.k@qqpower.com
我有一个带有 ID 和一些电子邮件地址的数据框
personid sup1_email sup2_email sup3_email sup4_email
1 evan.o@abc.com jon.k@abc.com kelm.q@abc.com john.d@abc.com
5 evan.o@abc.com polly.u@abc.com jim.e@ABC.COM nan
11 jim.y@abc.com manfred.a@abc.com greg.s@Abc.com adele.a@abc.com
52 jim.y@abc.com manfred.a@abc.com greg.s@Abc.com adele.a@abc.com
65 evan.o@abc.com lenny.t@yahoo.com john.s@abc.com sally.j@ABC.com
89 dom.q@ABC.com laurie.g@Abc.com topher.u@abc.com ross.k@qqpower.com
我想找到与接受的电子邮件值列表不匹配的行(即不是“@abc.com”、“@ABC.COM”、“@Abc.com ').我想得到的是这个
personid sup1_email sup2_email sup3_email sup4_email
65 evan.o@abc.com lenny.t@yahoo.com john.s@abc.com sally.j@ABC.com
89 dom.q@ABC.com laurie.g@Abc.com topher.u@abc.com ross.k@qqpower.com
我编写了以下代码并且它可以工作,但我必须手动检查每个 sup_email 列并重复该过程,这是低效的
#list down all the variations of accepted email domains
email_adds = ['@abc.com','@ABC.COM','@Abc.com']
#combine the variations of email addresses in the list
accepted_emails = '|'.join(email_adds)
not_accepted = df.loc[~df['sup1_email'].str.contains(accepted_emails, na=False)]
我想知道是否有更有效的方法使用 for 循环来执行此操作。到目前为止我所管理的是显示一个包含未接受电子邮件的列,但它没有显示包含未接受电子邮件的行。感谢我能得到的任何形式的帮助,谢谢。
sup_emails = df[['sup1_email','sup2_email', 'sup3_email', 'sup4_email']]
#for each sup column, check if the accepted email addresses are not in it
for col in sup_emails:
if any(x not in col for x in accepted_emails):
print(col)
一个想法:
#list down all the variations of accepted email domains
email_adds = ['@abc.com','@ABC.COM','@Abc.com']
#combine the variations of email addresses in the list
accepted_emails = '|'.join(email_adds)
#columns for test
c = ['sup1_email','sup2_email', 'sup3_email', 'sup4_email']
#reshape and test all values, if `nan` pass `True`
m = df[c].stack(dropna=False).str.contains(accepted_emails, na=True).unstack().all(axis=1)
df = df[~m]
print (df)
personid sup1_email sup2_email sup3_email \
4 65 evan.o@abc.com lenny.t@yahoo.com john.s@abc.com
5 89 dom.q@ABC.com laurie.g@Abc.com topher.u@abc.com
sup4_email
4 sally.j@ABC.com
5 ross.k@qqpower.com
您使用生成器和 any
的解决方案:
c = ['sup1_email','sup2_email', 'sup3_email', 'sup4_email']
f = lambda y: any(x in y for x in email_adds) if isinstance(y, str) else True
df = df[~df[c].applymap(f).all(axis=1)]
print (df)
personid sup1_email sup2_email sup3_email \
4 65 evan.o@abc.com lenny.t@yahoo.com john.s@abc.com
5 89 dom.q@ABC.com laurie.g@Abc.com topher.u@abc.com
sup4_email
4 sally.j@ABC.com
5 ross.k@qqpower.com
你可以这样做:
# list down all the variations of accepted email domains
email_adds = ['@abc.com','@ABC.COM','@Abc.com']
# combine the variations of email addresses in the list
accepted_emails = '|'.join(email_adds)
# create a single email column
melted = df.melt('personid')
# check the matching emails
mask = melted['value'].str.contains(accepted_emails, na=True)
# filter out the ones that do not match
mask = df['personid'].isin(melted.loc[~mask, 'personid'])
print(df[mask])
输出
personid sup1_email ... sup3_email sup4_email
4 65 evan.o@abc.com ... john.s@abc.com sally.j@ABC.com
5 89 dom.q@ABC.com ... topher.u@abc.com ross.k@qqpower.com
[2 rows x 5 columns]
让我们尝试查看 @
之后的字符在所有列中是否为 ABC
或 abc
或 Abc
。当然我们可以暂时过滤掉PersonID
。检查后,使用~
反转结果并屏蔽它们
df[-(df.iloc[:,1:].apply(lambda x: x.str.contains('(\@(?=ABC|abc|Abc))').all(), axis=1))]
personid sup1_email sup2_email sup3_email \
4 65.0 evan.o@abc.com lenny.t@yahoo.com john.s@abc.com
5 89.0 dom.q@ABC.com laurie.g@Abc.com topher.u@abc.com
sup4_email
4 sally.j@ABC.com
5 ross.k@qqpower.com