词干提取过程在 Python 中不起作用

Question

我有一个文本文件，我在删除 stopwords 后尝试 stem 但是当我运行它时似乎没有任何变化。我的文件名为 data0。

这是我的代码：

## Removing stopwords and tokenizing by words (split each word)
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

data0 = word_tokenize(data0)
data0 = ' '.join([word for word in data0 if word not in (stopwords.words('english'))])
print(data0)

## Stemming the data
from nltk.stem import PorterStemmer

ps = PorterStemmer()
data0 = ps.stem(data0)
print(data0)

然后我得到以下内容（为便于阅读而包装）：

For us around Aberdeen , question `` What oil industry ? ( Evening Express , October 26 ) touch deja vu . That question asked almost since day first drop oil pumped North Sea . In past 30 years seen constant cycle ups downs , booms busts industry . I predict happen next . There period worry uncertainty scrabble find something keep local economy buoyant oil gone . Then upturn see jobs investment oil , everyone breathe sigh relief quest diversify go back burner . That downfall . Major industries prone collapse . Look nation 's defunct shipyards extinct coal steel industries . That 's vital n't panic downturns , start planning sensibly future . Our civic business leaders need constantly looking something secure prosperity - tourism , technology , bio-science emerging industries . We need economically strong rather waiting see happens oil roller coaster hits buffers . N JonesEllon

代码的第一部分工作正常（删除停用词和标记化），但我们的第二部分（词干）不起作用。知道为什么吗？

Answer 1

根据 Stemmer 文档 http://www.nltk.org/howto/stem.html，Stemmer 似乎设计为一次调用一个单词。

在

中的每个单词上尝试运行

[word for word in data0 if word not in (stopwords.words('english'))]

即在调用 join

之前

例如

stemmed_list = []
for str in [word for word in data0 if word not in (stopwords.words('english'))]:
    stemmed_list.append(ps.stem(str))

编辑：评论回复。我运行以下 - 它似乎是正确的：

>>> from nltk.stem import PorterStemmer
>>> ps = PorterStemmer()
>>> data0 = '''<Your Data0 string>'''
>>> words = data0.split(" ")
>>> stemmed_words = map(ps.stem, words)
>>> print(list(stemmed_words))  # list cast needed because of 'map'
[..., 'industri', ..., 'diversifi']

我认为没有可以直接应用于文本的词干分析器，但您可以将其包装在您自己的函数中，该函数接受 'ps' 和文本：

def my_stem(text, stemmer):
    words = text.split(" ")
    stemmed_words = map(stemmer, words)
    result = " ".join(list(stemmed_words))
    return result

Answer 2

这是我过去所做的 w/NLTK:

st = PorterStemmer()

def stem_tokens(tokens):
    for item in tokens:
        yield st.stem(item)

def go(text):
    tokens = nltk.word_tokenize(text)

    return ' '.join(stem_tokens(tokens))

词干提取过程在 Python 中不起作用

Stemming process not working in Python

python

stemming