Python - 在 DataFrame 列(抓取的文本)和字符串列表之间查找匹配的字符串

Python - Find matching string(s) between DataFrame column (scraped text) and list of strings

我很难将 DataFrame 列中的字符串与字符串列表进行比较。

我给你解释一下: 我为个人项目从社交媒体收集数据,除此之外我创建了一个字符串列表,如下所示:

the_list = ['AI', 'NLP', 'approach', 'AR Cloud', 'Army_Intelligence', 'Artificial general intelligence', 'Artificial tissue', 'artificial_insemination', 'artificial_intelligence', 'augmented intelligence', 'augmented reality', 'authentification', 'automaton', 'Autonomous driving', 'Autonomous vehicles', 'bidirectional brain-machine interfaces', 'Biodegradable', 'biodegradable', 'Biotech', 'biotech', 'biotechnology', 'BMI', 'BMIs', 'body_mass_index', 'bourdon', 'Bradypus_tridactylus', 'cognitive computing', 'commercial UAVs', 'Composite AI', 'connected home', 'conversational systems', 'conversational user interfaces', 'dawdler', 'Decentralized web', 'Deep fakes', 'Deep learning', 'defrayal']

还有其他词,但这只是给你一个想法。

我的目标是将此列表的每个单词与包含标题和帖子消息(来自 reddit)的 2 个现有 DF 列进行比较。明确地说,我想创建一个新列,用于显示在我的列表与包含帖子的列之间匹配的单词。

到目前为止,这是我所做的:

the_list = ['AI', 'NLP', 'approach', 'AR Cloud', 'Army_Intelligence', 'Artificial general intelligence', 'Artificial tissue', 'artificial_insemination', 'artificial_intelligence', 'augmented intelligence', 'augmented reality', 'authentification', 'automaton', 'Autonomous driving', 'Autonomous vehicles', 'bidirectional brain-machine interfaces', 'Biodegradable', 'biodegradable', 'Biotech', 'biotech', 'biotechnology', 'BMI', 'BMIs', 'body_mass_index', 'bourdon', 'Bradypus_tridactylus', 'cognitive computing', 'commercial UAVs', 'Composite AI', 'connected home', 'conversational systems', 'conversational user interfaces', 'dawdler', 'Decentralized web', 'Deep fakes', 'Deep learning', 'defrayal']

df['matched text'] = df.text_lemmatized.str.extract('({0})'.format('|'.join(the_list)), flags = re.IGNORECASE)
df = df[~pd.isna(df['matched text'])]

df

>>Outpout:

      title_lemmatized   text_lemmatized        matched_word(s)
0         Title1       'claim thorough vet...'      'ai'
1         Title@       'Yeaaah today iota...'       'IoT'

Here the output result for more details.

问题:主要问题是它向我返回与列表匹配的字母(不是实际单词)。

示例:

--> the_list = 'ai'(人工智能)或IoT(物联网)

--> df['text_lemmatized'] 文本中有单词 'claim',则 'ai' 将是匹配项。或 'Iota' 将匹配 'IoT'.

我的愿望:

   title_lemmatized       text_lemmatized             matched_word(s)
0    Title1         'AI claim that Iot devises...'      'AI', 'IoT'
1    Title2         'The claim story about...'
2    Title3         'augmented reality and ai are...'   'augmented reality', 'ai'
3    Title4         'AI ai or artificial intelligence'  'AI', 'ai', 'artificial intelligence'

非常感谢:)

您必须在正则表达式模式中添加单词边界 '\b'。来自 re module docs:

\b

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of word characters. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

除此之外,您想使用 Series.str.findall(或 Series.str.extractall)而不是 Series.str.extract 来查找所有匹配项。

这应该有效

the_list = ['AI', 'NLP', 'approach', 'AR Cloud', 'Army_Intelligence', 'Artificial general intelligence', 'Artificial tissue', 'artificial_insemination', 'artificial_intelligence', 'augmented intelligence', 'augmented reality', 'authentification', 'automaton', 'Autonomous driving', 'Autonomous vehicles', 'bidirectional brain-machine interfaces', 'Biodegradable', 'biodegradable', 'Biotech', 'biotech', 'biotechnology', 'BMI', 'BMIs', 'body_mass_index', 'bourdon', 'Bradypus_tridactylus', 'cognitive computing', 'commercial UAVs', 'Composite AI', 'connected home', 'conversational systems', 'conversational user interfaces', 'dawdler', 'Decentralized web', 'Deep fakes', 'Deep learning', 'defrayal']

pat = r'\b({0})\b'.format('|'.join(the_list))
df['matched text'] = df.text_lemmatized.str.findall(pat, flags = re.IGNORECASE).map(", ".join)