为什么 NLTK 中的 PortStemmer 将我的 "string" 转换为 u"string"

Why did PortStemmer in NLTK converts my "string" into u"string"

import nltk
import string
from nltk.corpus import stopwords


from collections import Counter

def get_tokens():
    with open('comet_interest.xml','r') as bookmark:
        text=bookmark.read()
        lowers=text.lower()

        no_punctuation=lowers.translate(None,string.punctuation)
        tokens=nltk.word_tokenize(no_punctuation)
        return tokens
#remove stopwords
tokens=get_tokens()
filtered = [w for w in tokens if not w in stopwords.words('english')]
count = Counter(filtered)
print count.most_common(10)

#stemming
from nltk.stem.porter import *

def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

stemmer = PorterStemmer()
stemmed = stem_tokens(filtered, stemmer)
count = Counter(stemmed)
print count.most_common(10)

结果显示如下:

[('analysis', 13), ('spatial', 11), ('feb', 8), ('cdata', 8), ('description', 7), ('item', 6), ('many', 6), ('pm', 6), ('link', 6), ('research', 5)]

[(u'analysi', 13), (u'spatial', 11), (u'use', 11), (u'feb', 8), (u'cdata', 8), (u'scienc', 7), (u'descript', 7), (u'item', 6), (u'includ', 6), (u'mani' , 6)]

第二个词干提取有什么问题,为什么每个词都有一个"u"词头?

正如@kindall 指出的那样,这是因为 unicode 字符串。

但更具体地说,这是因为 NLTK 使用 from __future__ import unicode_literals 默认情况下将 ALL 字符串转换为 unicode,请参阅 https://github.com/nltk/nltk/blob/develop/nltk/stem/porter.py#L87

所以让我们在 python 2.x 中尝试一个实验:

$ python
>>> from nltk.stem import PorterStemmer
>>> porter = PorterStemmer()
>>> word = "analysis"
>>> word
'analysis'
>>> porter.stem(word)
u'analysi'

我们看到词干突然变成了 unicode。

然后,让我们尝试导入 unicode_literals:

>>> from nltk.stem import PorterStemmer
>>> porter = PorterStemmer()
>>> word = "analysis"
>>> word
'analysis'
>>> porter.stem(word)
u'analysi'
>>> from __future__ import print_function, unicode_literals
>>> word
'analysis'
>>> word2 = "analysis"
>>> word2
u'analysis'

请注意,所有字符串仍保持为字符串,但导入后的任何新字符串变量 unicode_literals 将默认变为 unicode。