计算一组单词在文本中出现的次数

Count the number of times a group of words appear in a text

我有 4 个单词列表和一个按单词分类的文本。

animals = ["cat", "dog", "fish"]
colours = ["blue", "red", "green"]
food = ["pasta", "chips", "beef"]
sport = ["football", "basketball", "tennis"]

text = ["Once","upon","a","time",.......]

我想计算这些列表中的单词在特定文本中出现的次数,但作为每个列表中单词的总和。因此,结果将显示在整个文本中出现了 10 个动物词、20 个颜色词、6 个食物词和 13 个运动词。

我实际处理的数据非常大,所以需要任何能快速运行的东西。

感谢您的帮助!

animalOccurences = 0

for word in text:
    if word in animals:
        animalOccurences += 1

在这里,我循环遍历 text 列表中的每个单词,并检查该单词是否在 animals 列表中。如果是,那么我将 1 添加到 animalOccurences 变量

您可以将类别更改为 dictset 个对象(这将允许 O(1) 成员资格测试):

categories = {'animals': {'cat', 'dog', 'fish'},
              'colours': {'blue', 'green', 'red'},
              'food': {'beef', 'chips', 'pasta'},
              'sport': {'basketball', 'football', 'tennis'}}

然后遍历单词并对每个类别集执行成员资格测试:

def count_words(text, categories):
    counts = dict.fromkeys(categories, 0)
    for word in text:
        for cat_name, cat_words in categories.items():
            counts[cat_name] += word in cat_words
    return counts

用法:

In [19]: text = "Once upon a time there was a proper minimal reproducible example given by the OP without anybody having to ask for it".split()

In [20]: count_words(text, categories)
Out[20]: {'animals': 0, 'colours': 0, 'food': 0, 'sport': 0}

In [21]: text = ("cat dog fish "*3).split()

In [22]: count_words(text, categories)
Out[22]: {'animals': 9, 'colours': 0, 'food': 0, 'sport': 0}