如何为 nltk 词形还原器提供（或生成）标签

Question

我有一组文档，我想将它们转换成这样的形式，这样我就可以计算这些文档中单词的 tfidf（这样每个文档都由 tfidf 数字的向量表示).

我认为调用 WordNetLemmatizer.lemmatize(word)，然后调用 PorterStemmer 就足够了 - 但所有 'have'、'has'、'had' 等都没有由词形还原器转换为 'have' - 它也适用于其他词。然后我读到，我应该为词形还原器提供提示 - 代表单词类型的标签 - 无论是名词、动词、形容词等。

我的问题是 - 如何获得这些标签？我应该在这些文件上执行什么才能得到这个？

我正在使用 python3.4，并且一次对单个单词进行词形还原 + 词干化。我尝试了 WordNetLemmatizer、nltk 的 EnglishStemmer 以及 stemming.porter2 的 stem()。

Answer 1

好吧，我在谷歌上搜索了更多，我发现了如何获得这些标签。第一个必须做一些预处理，以确保文件将被标记化（在我的例子中，它是关于删除从 pdf 转换为 txt 后留下的一些东西）。

然后这些文件必须被分词成句子，然后每个句子变成词数组，然后可以被nltk标注器标注。这样就可以完成词形还原，然后在其上添加词干。

from nltk.tokenize import sent_tokenize, word_tokenize
# use sent_tokenize to split text into sentences, and word_tokenize to
# to split sentences into words
from nltk.tag import pos_tag
# use this to generate array of tuples (word, tag)
# it can be then translated into wordnet tag as in
# [this response][1]. 
from nltk.stem.wordnet import WordNetLemmatizer
from stemming.porter2 import stem

# code from response mentioned above
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''    


with open(myInput, 'r') as f:
    data = f.read()
    sentences = sent_tokenize(data)
    ignoreTypes = ['TO', 'CD', '.', 'LS', ''] # my choice
    lmtzr = WordNetLemmatizer()
    for sent in sentences:
        words = word_tokenize(sentence)
        tags = pos_tag(words)
        for (word, type) in tags:
            if type in ignoreTypes:
                continue
            tag = get_wordnet_pos(type)
            if tag == '':
                continue
            lema = lmtzr.lemmatize(word, tag)
            stemW = stem(lema)

此时我得到词干 stemW 然后我可以将其写入文件，并使用它们来计算每个文档的 tfidf 向量。

如何为 nltk 词形还原器提供（或生成）标签

How to provide (or generate) tags for nltk lemmatizers

python

stemming

nltk

lemmatization