识别包含来自两个不同列表的单词的字符串
Identify strings having words from two different lists
我有一个 dataframe,其中包含如下所示的三列:
index string Result
1 The quick brown fox jumps over the lazy dog
2 fast and furious was a good movie
我有 两个列表 像这样的单词:
list1 ["over", "dog", "movie"]
list2 ["quick", "brown", "sun", "book"]
我想识别至少有一个来自 list1 的单词 AND 至少有一个来自 list2 的单词的字符串,结果如下:
index string Result
1 The quick brown fox jumps over the lazy dog TRUE
2 fast and furious was a good movie FALSE
解释:第一个句子包含两个列表中的单词,因此结果为 TRUE。第二句只有 list1 中的一个词,因此结果为 False。
我们可以用 python 做到这一点吗?我使用了 NLTK 的搜索技术,但我不知道如何合并两个列表的结果。谢谢
如果您的数据框(前两列)名为 df
,您可以执行以下操作:
df['Result'] = (df['string'].str.contains('|'.join(list1))
& df['string'].str.contains('|'.join(list2)))
结果:
string Result
0 The quick brown fox jumps over the lazy dog True
1 fast and furious was a good movie False
针对您的评论,以下内容可能符合您的要求:
words = set(list1).union(set(list2))
df['Result_2'] = [[*words.intersection(s.split())] for s in df['string'].tolist()]
结果:
... Result Result_2
... True [dog, quick, brown, over]
... False [movie]
另一种选择是拆分字符串并在列表理解中使用 set.intersection
和 all
:
s_lists = [set(list1), set(list2)]
df['Result'] = [all(s_lst.intersection(s.split()) for s_lst in s_lists) for s in df['string'].tolist()]
输出:
index string Result
0 1 The quick brown fox jumps over the lazy dog True
1 2 fast and furious was a good movie False
我有一个 dataframe,其中包含如下所示的三列:
index string Result
1 The quick brown fox jumps over the lazy dog
2 fast and furious was a good movie
我有 两个列表 像这样的单词:
list1 ["over", "dog", "movie"]
list2 ["quick", "brown", "sun", "book"]
我想识别至少有一个来自 list1 的单词 AND 至少有一个来自 list2 的单词的字符串,结果如下:
index string Result
1 The quick brown fox jumps over the lazy dog TRUE
2 fast and furious was a good movie FALSE
解释:第一个句子包含两个列表中的单词,因此结果为 TRUE。第二句只有 list1 中的一个词,因此结果为 False。
我们可以用 python 做到这一点吗?我使用了 NLTK 的搜索技术,但我不知道如何合并两个列表的结果。谢谢
如果您的数据框(前两列)名为 df
,您可以执行以下操作:
df['Result'] = (df['string'].str.contains('|'.join(list1))
& df['string'].str.contains('|'.join(list2)))
结果:
string Result
0 The quick brown fox jumps over the lazy dog True
1 fast and furious was a good movie False
针对您的评论,以下内容可能符合您的要求:
words = set(list1).union(set(list2))
df['Result_2'] = [[*words.intersection(s.split())] for s in df['string'].tolist()]
结果:
... Result Result_2
... True [dog, quick, brown, over]
... False [movie]
另一种选择是拆分字符串并在列表理解中使用 set.intersection
和 all
:
s_lists = [set(list1), set(list2)]
df['Result'] = [all(s_lst.intersection(s.split()) for s_lst in s_lists) for s in df['string'].tolist()]
输出:
index string Result
0 1 The quick brown fox jumps over the lazy dog True
1 2 fast and furious was a good movie False