Python - 在 DataFrame 列(抓取的文本)和字符串列表之间查找匹配的字符串
Python - Find matching string(s) between DataFrame column (scraped text) and list of strings
我很难将 DataFrame 列中的字符串与字符串列表进行比较。
我给你解释一下:
我为个人项目从社交媒体收集数据,除此之外我创建了一个字符串列表,如下所示:
the_list = ['AI', 'NLP', 'approach', 'AR Cloud', 'Army_Intelligence', 'Artificial general intelligence', 'Artificial tissue', 'artificial_insemination', 'artificial_intelligence', 'augmented intelligence', 'augmented reality', 'authentification', 'automaton', 'Autonomous driving', 'Autonomous vehicles', 'bidirectional brain-machine interfaces', 'Biodegradable', 'biodegradable', 'Biotech', 'biotech', 'biotechnology', 'BMI', 'BMIs', 'body_mass_index', 'bourdon', 'Bradypus_tridactylus', 'cognitive computing', 'commercial UAVs', 'Composite AI', 'connected home', 'conversational systems', 'conversational user interfaces', 'dawdler', 'Decentralized web', 'Deep fakes', 'Deep learning', 'defrayal']
还有其他词,但这只是给你一个想法。
我的目标是将此列表的每个单词与包含标题和帖子消息(来自 reddit)的 2 个现有 DF 列进行比较。明确地说,我想创建一个新列,用于显示在我的列表与包含帖子的列之间匹配的单词。
到目前为止,这是我所做的:
the_list = ['AI', 'NLP', 'approach', 'AR Cloud', 'Army_Intelligence', 'Artificial general intelligence', 'Artificial tissue', 'artificial_insemination', 'artificial_intelligence', 'augmented intelligence', 'augmented reality', 'authentification', 'automaton', 'Autonomous driving', 'Autonomous vehicles', 'bidirectional brain-machine interfaces', 'Biodegradable', 'biodegradable', 'Biotech', 'biotech', 'biotechnology', 'BMI', 'BMIs', 'body_mass_index', 'bourdon', 'Bradypus_tridactylus', 'cognitive computing', 'commercial UAVs', 'Composite AI', 'connected home', 'conversational systems', 'conversational user interfaces', 'dawdler', 'Decentralized web', 'Deep fakes', 'Deep learning', 'defrayal']
df['matched text'] = df.text_lemmatized.str.extract('({0})'.format('|'.join(the_list)), flags = re.IGNORECASE)
df = df[~pd.isna(df['matched text'])]
df
>>Outpout:
title_lemmatized text_lemmatized matched_word(s)
0 Title1 'claim thorough vet...' 'ai'
1 Title@ 'Yeaaah today iota...' 'IoT'
Here the output result for more details.
问题:主要问题是它向我返回与列表匹配的字母(不是实际单词)。
示例:
--> the_list = 'ai'(人工智能)或IoT(物联网)
--> df['text_lemmatized'] 文本中有单词 'claim',则 'ai' 将是匹配项。或 'Iota' 将匹配 'IoT'.
我的愿望:
title_lemmatized text_lemmatized matched_word(s)
0 Title1 'AI claim that Iot devises...' 'AI', 'IoT'
1 Title2 'The claim story about...'
2 Title3 'augmented reality and ai are...' 'augmented reality', 'ai'
3 Title4 'AI ai or artificial intelligence' 'AI', 'ai', 'artificial intelligence'
非常感谢:)
您必须在正则表达式模式中添加单词边界 '\b'
。来自 re module docs:
\b
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of word characters. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.
除此之外,您想使用 Series.str.findall
(或 Series.str.extractall
)而不是 Series.str.extract
来查找所有匹配项。
这应该有效
the_list = ['AI', 'NLP', 'approach', 'AR Cloud', 'Army_Intelligence', 'Artificial general intelligence', 'Artificial tissue', 'artificial_insemination', 'artificial_intelligence', 'augmented intelligence', 'augmented reality', 'authentification', 'automaton', 'Autonomous driving', 'Autonomous vehicles', 'bidirectional brain-machine interfaces', 'Biodegradable', 'biodegradable', 'Biotech', 'biotech', 'biotechnology', 'BMI', 'BMIs', 'body_mass_index', 'bourdon', 'Bradypus_tridactylus', 'cognitive computing', 'commercial UAVs', 'Composite AI', 'connected home', 'conversational systems', 'conversational user interfaces', 'dawdler', 'Decentralized web', 'Deep fakes', 'Deep learning', 'defrayal']
pat = r'\b({0})\b'.format('|'.join(the_list))
df['matched text'] = df.text_lemmatized.str.findall(pat, flags = re.IGNORECASE).map(", ".join)
我很难将 DataFrame 列中的字符串与字符串列表进行比较。
我给你解释一下: 我为个人项目从社交媒体收集数据,除此之外我创建了一个字符串列表,如下所示:
the_list = ['AI', 'NLP', 'approach', 'AR Cloud', 'Army_Intelligence', 'Artificial general intelligence', 'Artificial tissue', 'artificial_insemination', 'artificial_intelligence', 'augmented intelligence', 'augmented reality', 'authentification', 'automaton', 'Autonomous driving', 'Autonomous vehicles', 'bidirectional brain-machine interfaces', 'Biodegradable', 'biodegradable', 'Biotech', 'biotech', 'biotechnology', 'BMI', 'BMIs', 'body_mass_index', 'bourdon', 'Bradypus_tridactylus', 'cognitive computing', 'commercial UAVs', 'Composite AI', 'connected home', 'conversational systems', 'conversational user interfaces', 'dawdler', 'Decentralized web', 'Deep fakes', 'Deep learning', 'defrayal']
还有其他词,但这只是给你一个想法。
我的目标是将此列表的每个单词与包含标题和帖子消息(来自 reddit)的 2 个现有 DF 列进行比较。明确地说,我想创建一个新列,用于显示在我的列表与包含帖子的列之间匹配的单词。
到目前为止,这是我所做的:
the_list = ['AI', 'NLP', 'approach', 'AR Cloud', 'Army_Intelligence', 'Artificial general intelligence', 'Artificial tissue', 'artificial_insemination', 'artificial_intelligence', 'augmented intelligence', 'augmented reality', 'authentification', 'automaton', 'Autonomous driving', 'Autonomous vehicles', 'bidirectional brain-machine interfaces', 'Biodegradable', 'biodegradable', 'Biotech', 'biotech', 'biotechnology', 'BMI', 'BMIs', 'body_mass_index', 'bourdon', 'Bradypus_tridactylus', 'cognitive computing', 'commercial UAVs', 'Composite AI', 'connected home', 'conversational systems', 'conversational user interfaces', 'dawdler', 'Decentralized web', 'Deep fakes', 'Deep learning', 'defrayal']
df['matched text'] = df.text_lemmatized.str.extract('({0})'.format('|'.join(the_list)), flags = re.IGNORECASE)
df = df[~pd.isna(df['matched text'])]
df
>>Outpout:
title_lemmatized text_lemmatized matched_word(s)
0 Title1 'claim thorough vet...' 'ai'
1 Title@ 'Yeaaah today iota...' 'IoT'
Here the output result for more details.
问题:主要问题是它向我返回与列表匹配的字母(不是实际单词)。
示例:
--> the_list = 'ai'(人工智能)或IoT(物联网)
--> df['text_lemmatized'] 文本中有单词 'claim',则 'ai' 将是匹配项。或 'Iota' 将匹配 'IoT'.
我的愿望:
title_lemmatized text_lemmatized matched_word(s)
0 Title1 'AI claim that Iot devises...' 'AI', 'IoT'
1 Title2 'The claim story about...'
2 Title3 'augmented reality and ai are...' 'augmented reality', 'ai'
3 Title4 'AI ai or artificial intelligence' 'AI', 'ai', 'artificial intelligence'
非常感谢:)
您必须在正则表达式模式中添加单词边界 '\b'
。来自 re module docs:
\b
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of word characters. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.
除此之外,您想使用 Series.str.findall
(或 Series.str.extractall
)而不是 Series.str.extract
来查找所有匹配项。
这应该有效
the_list = ['AI', 'NLP', 'approach', 'AR Cloud', 'Army_Intelligence', 'Artificial general intelligence', 'Artificial tissue', 'artificial_insemination', 'artificial_intelligence', 'augmented intelligence', 'augmented reality', 'authentification', 'automaton', 'Autonomous driving', 'Autonomous vehicles', 'bidirectional brain-machine interfaces', 'Biodegradable', 'biodegradable', 'Biotech', 'biotech', 'biotechnology', 'BMI', 'BMIs', 'body_mass_index', 'bourdon', 'Bradypus_tridactylus', 'cognitive computing', 'commercial UAVs', 'Composite AI', 'connected home', 'conversational systems', 'conversational user interfaces', 'dawdler', 'Decentralized web', 'Deep fakes', 'Deep learning', 'defrayal']
pat = r'\b({0})\b'.format('|'.join(the_list))
df['matched text'] = df.text_lemmatized.str.findall(pat, flags = re.IGNORECASE).map(", ".join)