识别包含来自两个不同列表的单词的字符串

Question

我有一个 dataframe，其中包含如下所示的三列：

index   string                                         Result
1       The quick brown fox jumps over the lazy dog 
2       fast and furious was a good movie

我有 两个列表 像这样的单词：

list1   ["over", "dog", "movie"]
list2   ["quick", "brown", "sun", "book"]

我想识别至少有一个来自 list1 的单词 AND 至少有一个来自 list2 的单词的字符串，结果如下：

index   string                                      Result
1   The quick brown fox jumps over the lazy dog     TRUE
2   fast and furious was a good movie               FALSE

解释：第一个句子包含两个列表中的单词，因此结果为 TRUE。第二句只有 list1 中的一个词，因此结果为 False。

我们可以用 python 做到这一点吗？我使用了 NLTK 的搜索技术，但我不知道如何合并两个列表的结果。谢谢

Answer 1

如果您的数据框（前两列）名为 df，您可以执行以下操作：

df['Result'] = (df['string'].str.contains('|'.join(list1)) 
 & df['string'].str.contains('|'.join(list2)))

结果：

                                        string  Result
0  The quick brown fox jumps over the lazy dog    True
1            fast and furious was a good movie   False

针对您的评论，以下内容可能符合您的要求：

words = set(list1).union(set(list2))
df['Result_2'] = [[*words.intersection(s.split())] for s in df['string'].tolist()]

结果：

...   Result                   Result_2
...    True  [dog, quick, brown, over]
...   False                    [movie]

Answer 2

另一种选择是拆分字符串并在列表理解中使用 set.intersection 和 all：

s_lists = [set(list1), set(list2)]
df['Result'] = [all(s_lst.intersection(s.split()) for s_lst in s_lists) for s in df['string'].tolist()]

输出：

   index                                       string  Result
0      1  The quick brown fox jumps over the lazy dog    True
1      2            fast and furious was a good movie   False

识别包含来自两个不同列表的单词的字符串

Identify strings having words from two different lists

python

nlp

list

dataframe