识别包含来自两个不同列表的单词的字符串

Identify strings having words from two different lists

我有一个 dataframe,其中包含如下所示的三列:

index   string                                         Result
1       The quick brown fox jumps over the lazy dog 
2       fast and furious was a good movie   

我有 两个列表 像这样的单词:

list1   ["over", "dog", "movie"]
list2   ["quick", "brown", "sun", "book"]

我想识别至少有一个来自 list1 的单词 AND 至少有一个来自 list2 的单词的字符串,结果如下:

index   string                                      Result
1   The quick brown fox jumps over the lazy dog     TRUE
2   fast and furious was a good movie               FALSE

解释:第一个句子包含两个列表中的单词,因此结果为 TRUE。第二句只有 list1 中的一个词,因此结果为 False。

我们可以用 python 做到这一点吗?我使用了 NLTK 的搜索技术,但我不知道如何合并两个列表的结果。谢谢

如果您的数据框(前两列)名为 df,您可以执行以下操作:

df['Result'] = (df['string'].str.contains('|'.join(list1)) 
 & df['string'].str.contains('|'.join(list2)))

结果:

                                        string  Result
0  The quick brown fox jumps over the lazy dog    True
1            fast and furious was a good movie   False

针对您的评论,以下内容可能符合您的要求:

words = set(list1).union(set(list2))
df['Result_2'] = [[*words.intersection(s.split())] for s in df['string'].tolist()]

结果:

...   Result                   Result_2
...    True  [dog, quick, brown, over]
...   False                    [movie]

另一种选择是拆分字符串并在列表理解中使用 set.intersectionall

s_lists = [set(list1), set(list2)]
df['Result'] = [all(s_lst.intersection(s.split()) for s_lst in s_lists) for s in df['string'].tolist()]

输出:

   index                                       string  Result
0      1  The quick brown fox jumps over the lazy dog    True
1      2            fast and furious was a good movie   False