为什么 NLTK 中的 PortStemmer 将我的 "string" 转换为 u"string"

Question

import nltk
import string
from nltk.corpus import stopwords


from collections import Counter

def get_tokens():
    with open('comet_interest.xml','r') as bookmark:
        text=bookmark.read()
        lowers=text.lower()

        no_punctuation=lowers.translate(None,string.punctuation)
        tokens=nltk.word_tokenize(no_punctuation)
        return tokens
#remove stopwords
tokens=get_tokens()
filtered = [w for w in tokens if not w in stopwords.words('english')]
count = Counter(filtered)
print count.most_common(10)

#stemming
from nltk.stem.porter import *

def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

stemmer = PorterStemmer()
stemmed = stem_tokens(filtered, stemmer)
count = Counter(stemmed)
print count.most_common(10)

结果显示如下：

[('analysis', 13), ('spatial', 11), ('feb', 8), ('cdata', 8), ('description', 7), ('item', 6), ('many', 6), ('pm', 6), ('link', 6), ('research', 5)]

[(u'analysi', 13), (u'spatial', 11), (u'use', 11), (u'feb', 8), (u'cdata', 8), (u'scienc', 7), (u'descript', 7), (u'item', 6), (u'includ', 6), (u'mani' , 6)]

第二个词干提取有什么问题，为什么每个词都有一个"u"词头？

Answer 1

正如@kindall 指出的那样，这是因为 unicode 字符串。

但更具体地说，这是因为 NLTK 使用 from __future__ import unicode_literals 默认情况下将 ALL 字符串转换为 unicode，请参阅 https://github.com/nltk/nltk/blob/develop/nltk/stem/porter.py#L87

所以让我们在 python 2.x 中尝试一个实验：

$ python
>>> from nltk.stem import PorterStemmer
>>> porter = PorterStemmer()
>>> word = "analysis"
>>> word
'analysis'
>>> porter.stem(word)
u'analysi'

我们看到词干突然变成了 unicode。

然后，让我们尝试导入 unicode_literals:

>>> from nltk.stem import PorterStemmer
>>> porter = PorterStemmer()
>>> word = "analysis"
>>> word
'analysis'
>>> porter.stem(word)
u'analysi'
>>> from __future__ import print_function, unicode_literals
>>> word
'analysis'
>>> word2 = "analysis"
>>> word2
u'analysis'

请注意，所有字符串仍保持为字符串，但导入后的任何新字符串变量 unicode_literals 将默认变为 unicode。

为什么 NLTK 中的 PortStemmer 将我的 "string" 转换为 u"string"

Why did PortStemmer in NLTK converts my "string" into u"string"

python

sax

stemming

nltk