如何根据 pandas 数据框中的部分匹配来隔离重复项

Question

我有一个 pandas 数据框，如下所示：

email                   col2  col3
email@example.com       John  Doe
xxxemail@example.com    John  Doe
xxemail@example.com     John  Doe
xxxxxemail@example.com  John  Doe
xxxemail@example2.com   Jane  Doe

我想遍历每个以至少两个“x”开头的电子邮件地址，并检查是否存在没有这些“x”的相同电子邮件地址。

所需结果：

email                   col2  col3  exists_in_valid_form
email@example.com       John  Doe   False
xxxemail@example.com    John  Doe   True
xxemail@example.com     John  Doe   True
xxxxxemail@example.com  John  Doe   True
xxxemail@example2.com   Jane  Doe   False

我能够使用 df[df['email'].str.contains("xx")] 获取包含所有这些行的子数据框，其中电子邮件以 'xx' 开头，并且还能够获取不带“x”的电子邮件地址使用 str.lstrip('x')，但似乎都无法帮助我了解这封电子邮件是否出现在其他地方而没有这些 x。

Answer 1

您可以使用 duplicated() 来获取其他行中是否存在某个值。

df['exists_in_valid_form'] = df.email.str.lstrip('x').duplicated(keep=False) & df.email.str.startswith('xx')

我添加了 df.email.str.startswith('xx') 以确保它应该至少以 2 个“x”开头，并且 return False for “xemail@example.com”。

如何根据 pandas 数据框中的部分匹配来隔离重复项

How to isolate duplicates based on partial match in a pandas dataframe

duplicates

dataframe

python-3.x

pandas

partial-matches