Python - 检查列表中的关键字是否在字符串中(作为一个完整的单词)& return 找到关键字

Python - Check if keyword from list is in string (as a whole word) & return found keyword

还没有找到专门针对这个想法的解决方案,所以这是我的问题。

我有一个关键字列表,我想将这些关键字与从网站上抓取的字符串进行匹配。此列表存储在自己的 Python 文件“关键字”中,内容如下:

keywords = [
    "FDA",
    "Contract",
    "Vaccine",
    "Efficacy",
    "SARS",
    "COVID-19",
    "Cancer",
    "Exclusive",
    "Explosive",
    "Hydrogen",
    "Positive",
    "Phase"
]

文件已导入并访问此列表,我可以使用 Keywords.keywords

#1 匹配关键字与字符串:

我想检查抓取的字符串 article_title = item.select_one('h3 small').find_next_sibling(text=True).strip() 是否包含这些关键字之一。如果是这样,我想搜索更多内容(已经获得代码)。否则,我将 return 到我的 for 循环的开头并搜索下一个标题。

以下是字符串 article_title 的输出示例:

Global Water and Sewage Market Report (2021 to 2030) - COVID-19 Impact and Recovery
Blackbaud CEO Mike Gianoni Named One of 50 Most Influential by Charleston Business Magazine
Statement from Judy R. McReynolds on Signing of HR1319, the American Rescue Plan Act of 2021

通过仅搜索整个单词来匹配关键字列表与字符串的最佳方法是什么?我在 SO 上找到了多种方法,但它们似乎都有缺陷,人们指出这些缺陷让我感到困惑。

#2 将搜索到的关键字存入变量:

当与关键字匹配时,我将找到的 article_title 变量和其他变量存储在数据库中以防找到关键字。但是,我还想将导致条目的关键字存储在我的数据库中。这让我知道每个关键字被找到了多少次。我存储找到的关键字的变量应该称为 article_keyword。有没有办法不仅可以将关键字与字符串匹配,还可以存储找到的关键字?如果是,我很乐意帮助您完成此操作。

如果提供的信息不够,请通过评论告诉我,我会添加完整的代码。只是出于缩短问题的原因而将其遗漏了。

这是一种使用 regex 的方法:

import re

keywords = [
    "FDA",
    "Contract",
    "Vaccine",
    "Efficacy",
    "SARS",
    "COVID-19",
    "Cancer",
    "Exclusive",
    "Explosive",
    "Hydrogen",
    "Positive",
    "Phase"
]

titles = [
    "Global Water and Sewage Market Report (2021 to 2030) - COVID-19 Impact and Recovery",
    "Blackbaud CEO Mike Gianoni Named One of 50 Most Influential by Charleston Business Magazine",
    "Statement from Judy R. McReynolds on Signing of HR1319, the American Rescue Plan Act of 2021",
]

pattern = '|'.join(f"\b{k}\b" for k in keywords)  # Whole words only                                                      
matches = {k: 0 for k in keywords}
for title in titles:
    for match in re.findall(pattern, title):
        matches[match] += 1

您可以遍历列表并使用 'in' 运算符,我们可以检查它是否存在于字符串中:

strings = ["Global Water and Sewage Market Report (2021 to 2030) - COVID-19 Impact and Recovery", "Blackbaud CEO Mike Gianoni Named One of 50 Most Influential by Charleston Business Magazine", "Statement from Judy R. McReynolds on Signing of HR1319, the American Rescue Plan Act of 2021"]

keywords = [
    "FDA",
    "Contract",
    "Vaccine",
    "Efficacy",
    "SARS",
    "COVID-19",
    "Cancer",
    "Exclusive",
    "Explosive",
    "Hydrogen",
    "Positive",
    "Phase"
]

article_keywords = {}

for string in strings:
    for word in keywords:
        if word in string:
            article_keywords[string] = word
            break

print(article_keywords)

在字典(article_keywords)中,键是字符串,值是找到的第一个关键字。