如何从列表中找到具有匹配的单词或组的字符串?

How do I find string that has a word or group of matched from a list?

我有一长串字符串(或 pandas 数据框中的列),我希望能够根据不同引用列表中的某些值从中分离字符串。我想以 pythonic 方式完成它,而不仅仅是迭代和匹配。

Input:
my_list_or_column = ["this is a test", "blank text", "another test", "do not select this" ]
ref_list = ["test", "conduct"]

现在,我应该可以将包含 ref_list 中单词的句子分开了。

Output:
match = ["this is a test" .... ]
did_not_match = ["do not select this"]

有什么帮助吗?

怎么样:

my_list_or_column = ["this is a test", "blank text", "another test", "do not select this" ]
ref_list = ["test", "conduct"]

def is_contain(col):
  for ref in ref_list:
    if ref in col:
      return True
  return False

print(list(filter(lambda x: is_contain(x), my_list_or_column)))

您可以将 ref_list 转换为一个集合并查看它,而不是遍历列表。这可能很有用,尤其是当 ref_list 很大时。

did_not_match = []
match = []
my_set = set(ref_list)
for string in my_list_or_column:
    set_string = set(string.split())
    if set_string - my_set != set_string:
        match.append(string)
    else:
        did_not_match.append(string)

由于您提到 my_list_or_column 可能是 pandas DataFrame 列,您还可以为相关文本创建布尔掩码和过滤器,如:

my_Series = pd.Series(my_list_or_column)
mask = my_Series.str.contains('|'.join(ref_list))
match = my_Series[mask].tolist()
did_not_match = my_Series[~mask].tolist()

输出:

>>> print(match)
['this is a test', 'another test']

>>> print(did_not_match)
['blank text', 'do not select this']