如何创建一个函数，在 Python 中先对 ngrams 进行评分？

Question

假设我想用名为 dictionary:

的字典对 text 进行评分

text = "I would like to reduce carbon emissions"

dictionary = pd.DataFrame({'text': ["like","reduce","carbon","emissions","reduce carbon emissions"],'score': [1,-1,-1,-1,1]})

我想编写一个函数，将 text 中的 dictionary 中的每一项相加。但是，这样的规则必须有细微差别：优先考虑 ngrams 而不是 unigrams。

具体来说，如果我对 text 中的 dictionary 中的一元组求和，我得到：自 like =1, reduce=-1, carbon =-1,emissions=-1 以来的 1+(-1)+(-1)+(-1)=-2。这不是我想要的。该函数必须说明以下内容：

先考虑ngrams（例子中reduce carbon emissions），如果ngrams集合不为空，则赋予相应的值，否则ngrams集合为空则考虑unigrams；
如果 ngram 集非空，则忽略所选 ngram 中的那些单个单词（unigram）（例如，忽略已经在“减少碳排放”中的“reduce”、“carbon”和“emissions” ").

这样的函数应该给我这个输出：+2 since like =1 + reduce carbon emissions = 1.

我是 Python 的新手，我被卡住了。谁能帮我解决这个问题？

谢谢！

Answer 1

我会按长度对关键字进行降序排序，因此可以保证 re 会在 one-gram 之前匹配 ngrams：

import re

pat = '|'.join(sorted(dictionary.text, key=len, reverse=True))

found = re.findall(fr'\b({pat})\b', text)

输出：

['like', 'reduce carbon emissions']

要获得预期的输出：

scores = dictionary.set_index('text')['score']

scores.re_index(found).sum()

如何创建一个函数，在 Python 中先对 ngrams 进行评分？

How to create a function that scores ngrams before unigrams in Python?

python

module

sentiment-analysis

pandas

vader