PorterStemmer() 以不同方式修剪句子中的最后一个单词

PorterStemmer() trims the last word in a sentence differently

我有以下离线环境代码:

import pandas as pd
import re
from nltk.stem import PorterStemmer

test = {'grams':  ['First value because one does two THREE', 'Second value because three and three four', 'Third donkey three']}
test = pd.DataFrame(test, columns = ['grams'])
STOPWORDS = {'and', 'does', 'because'}

def rower(x):
    cleanQ = []  
    for i in range(len(x)): cleanQ.append(re.sub(r'[\b\(\)\\"\'\/\[\]\s+\,\.:\?;]', ' ', x[i]).lower())
    
    splitQ = []
    for row in cleanQ: splitQ.append(row.split())
    splitQ[:] = [[word for word in sub if word not in STOPWORDS] for sub in splitQ]
    splitQ = list(map(' '.join, splitQ))
    print(splitQ)
    
    originQ = []    
    for i in splitQ: 
        originQ.append(PorterStemmer().stem(i))
    print(originQ)
    
rower(test.grams)

产生这个:

['first value one two three', 'second value three three four', 'third donkey three']
['first value one two thre', 'second value three three four', 'third donkey thre']

第一个列表显示应用 PorterStemmer() 函数之前的句子。第二个列表显示应用 PorterStemmer() 函数后的句子。

如您所见,只有当单词位于句子的最后一个单词时,PorterStemmer() 才会将单词 three 修剪为 thre。当单词 three 不是最后一个单词时,three 保持 three。我似乎无法弄清楚为什么要这样做。我还担心,如果我将 rower(x) 函数应用于其他句子,它可能会在我不注意的情况下产生类似的结果。

如何防止 PorterStemmer 以不同方式对待最后一个词?

这里的主要错误是您将多个单词而不是一次一个单词传递给词干分析器。整个字符串(third donkey three)被认为是一个词,最后一部分正在被提取。

import pandas as pd
import re
from nltk.stem import PorterStemmer

test = {'grams': ['First value because one does two THREE', 'Second value because three and three four',
                  'Third donkey three']}
test = pd.DataFrame(test, columns=['grams'])
STOPWORDS = {'and', 'does', 'because'}

ps = PorterStemmer()

def rower(x):
    cleanQ = []
    for i in range(len(x)): cleanQ.append(re.sub(r'[\b\(\)\\"\'\/\[\]\s+\,\.:\?;]', ' ', x[i]).lower())

    splitQ = []
    for row in cleanQ: splitQ.append(row.split())
    splitQ = [[word for word in sub if word not in STOPWORDS] for sub in splitQ]
    print('IN:', splitQ)
    originQ = [[ps.stem(word) for word in sent] for sent in splitQ]
    print('OUT:', originQ)


rower(test.grams)

输出:

IN: [['first', 'value', 'one', 'two', 'three'], ['second', 'value', 'three', 'three', 'four'], ['third', 'donkey', 'three']]
OUT: [['first', 'valu', 'one', 'two', 'three'], ['second', 'valu', 'three', 'three', 'four'], ['third', 'donkey', 'three']]

对于为什么词干提取会遗漏某些词的最后 'e' 有很好的解释。如果输出不符合您的预期,请考虑使用词形还原器。

How to stop NLTK stemmer from removing the trailing “e”?