词干和旅鼠词

Stemming and lemming words

我有一个文本文档需要使用词干提取和词形还原。我已经清理了数据并将其标记化并删除了停用词

我需要做的是将列表作为输入,return 一个字典,该字典应该有键 'original stem 和 lemmma。值是第 n 个以这种方式转换的词

  snowball stemmer is defined as Stemmer()
  and WordNetLemmatizer is defined as lemmatizer()

这是我写的代码,但它确实给了我们一个错误

def find_roots(token_list, n):
n = 2
original = tokens
stem = [ele for sub in original for idx, ele in 
enumerate(sub.split()) if idx == (n - 1)]
stem = stemmer(stem)
lemma = [ele for sub in original for idx, ele in 
enumerate(sub.split()) if idx == (n - 1)]
lemma = lemmatizer()
return 

如有任何帮助,我们将不胜感激

我真的不明白你在列表推导中试图做什么,所以我就写下我会怎么做:

from nltk import WordNetLemmatizer, SnowballStemmer

lemmatizer = WordNetLemmatizer()
stemmer = SnowballStemmer("english")


def find_roots(token_list, n):
    token = token_list[n]
    stem = stemmer.stem(token)
    lemma = lemmatizer.lemmatize(token)
    return {"original": token, "stem": stem, "lemma": lemma}


roots_dict = find_roots(["said", "talked", "walked"], n=2)
print(roots_dict)
> {'original': 'walked', 'stem': 'walk', 'lemma': 'walked'}

你可以用 spacy 做你想做的事情,如下所示:(在许多情况下 spacynltk 表现更好。)

# $ pip install -U spacy

import spacy
from nltk import WordNetLemmatizer, SnowballStemmer

sp = spacy.load('en_core_web_sm')
lemmatizer = WordNetLemmatizer()
stemmer = SnowballStemmer("english")


words = ['compute', 'computer', 'computed', 'computing', 'said', 'talked', 'walked']
for word in words:
    print(f'Orginal Word : {word}')
    print(f'Stemmer with nltk : {stemmer.stem(word)}')
    print(f'Lemmatization with nltk : {lemmatizer.lemmatize(word)}')
    
    sp_word = sp(word)
    print(f'Lemmatization with spacy : {sp_word[0].lemma_}')

输出:

Orginal Word : compute
Stemmer with nltk : comput
Lemmatization with nltk : compute
Lemmatization with spacy : compute
Orginal Word : computer
Stemmer with nltk : comput
Lemmatization with nltk : computer
Lemmatization with spacy : computer
Orginal Word : computed
Stemmer with nltk : comput
Lemmatization with nltk : computed
Lemmatization with spacy : compute
Orginal Word : computing
Stemmer with nltk : comput
Lemmatization with nltk : computing
Lemmatization with spacy : compute
Orginal Word : said
Stemmer with nltk : said
Lemmatization with nltk : said
Lemmatization with spacy : say
Orginal Word : talked
Stemmer with nltk : talk
Lemmatization with nltk : talked
Lemmatization with spacy : talk
Orginal Word : walked
Stemmer with nltk : walk
Lemmatization with nltk : walked
Lemmatization with spacy : walk