计算字符串观察中单词列表的出现次数

Question

我列出了学术文章摘要中出现次数最多的前 10 个词。我想计算这些词在我的数据集观察中出现的次数。

前 10 个词是：

top10 = ['model','language','models','task', 'data', 'paper', 'results', 'information', 'text','performance']

前 3 个观察结果的示例是：

column[0:3] = ['The models are showing a great performance.',
'The information and therefor the data in the text are good enough to fulfill the task.',
'Data in this way results in the best information and thus performance'.]

提供的代码应该 return 特定观察中所有单词的总出现次数列表。我尝试了以下代码，但它给出了错误：count() takes at most 3 arguments (10 given).

我的代码：

count = 0
for sentence in column:
    for word in sentence.split():
        count += word.lower().count('model','language','models','task', 'data', 'paper', 'results', 'information', 'text','performance')

我还想将所有单词小写并删除标点符号。所以输出应该是这样的：

output = (2, 4, 4)

第一次观察统计top10列表中的2个词，分别是models和performance

第二个观察统计top10列表中的4个词，分别是information、data、text和task

第三次观察统计4个字的数据、结果、数据、信息和表现

希望你能帮帮我！

Answer 1

您可以使用正则表达式进行拆分，只检查它是否在前 10 名中。

count =[]
for i,sentence in enumerate(column):
    c = 0
    for word in re.findall('\w+',sentence):
        c += int(word.lower() in top10)
    count += [c]

计数 = [2, 4, 4]

计算字符串观察中单词列表的出现次数

Count the occurrences of a wordlist within a string observation

python

string

find-occurrences

multiple-occurrence