词干和旅鼠词
Stemming and lemming words
我有一个文本文档需要使用词干提取和词形还原。我已经清理了数据并将其标记化并删除了停用词
我需要做的是将列表作为输入,return 一个字典,该字典应该有键 'original stem 和 lemmma。值是第 n 个以这种方式转换的词
snowball stemmer is defined as Stemmer()
and WordNetLemmatizer is defined as lemmatizer()
这是我写的代码,但它确实给了我们一个错误
def find_roots(token_list, n):
n = 2
original = tokens
stem = [ele for sub in original for idx, ele in
enumerate(sub.split()) if idx == (n - 1)]
stem = stemmer(stem)
lemma = [ele for sub in original for idx, ele in
enumerate(sub.split()) if idx == (n - 1)]
lemma = lemmatizer()
return
如有任何帮助,我们将不胜感激
我真的不明白你在列表推导中试图做什么,所以我就写下我会怎么做:
from nltk import WordNetLemmatizer, SnowballStemmer
lemmatizer = WordNetLemmatizer()
stemmer = SnowballStemmer("english")
def find_roots(token_list, n):
token = token_list[n]
stem = stemmer.stem(token)
lemma = lemmatizer.lemmatize(token)
return {"original": token, "stem": stem, "lemma": lemma}
roots_dict = find_roots(["said", "talked", "walked"], n=2)
print(roots_dict)
> {'original': 'walked', 'stem': 'walk', 'lemma': 'walked'}
你可以用 spacy
做你想做的事情,如下所示:(在许多情况下 spacy
比 nltk
表现更好。)
# $ pip install -U spacy
import spacy
from nltk import WordNetLemmatizer, SnowballStemmer
sp = spacy.load('en_core_web_sm')
lemmatizer = WordNetLemmatizer()
stemmer = SnowballStemmer("english")
words = ['compute', 'computer', 'computed', 'computing', 'said', 'talked', 'walked']
for word in words:
print(f'Orginal Word : {word}')
print(f'Stemmer with nltk : {stemmer.stem(word)}')
print(f'Lemmatization with nltk : {lemmatizer.lemmatize(word)}')
sp_word = sp(word)
print(f'Lemmatization with spacy : {sp_word[0].lemma_}')
输出:
Orginal Word : compute
Stemmer with nltk : comput
Lemmatization with nltk : compute
Lemmatization with spacy : compute
Orginal Word : computer
Stemmer with nltk : comput
Lemmatization with nltk : computer
Lemmatization with spacy : computer
Orginal Word : computed
Stemmer with nltk : comput
Lemmatization with nltk : computed
Lemmatization with spacy : compute
Orginal Word : computing
Stemmer with nltk : comput
Lemmatization with nltk : computing
Lemmatization with spacy : compute
Orginal Word : said
Stemmer with nltk : said
Lemmatization with nltk : said
Lemmatization with spacy : say
Orginal Word : talked
Stemmer with nltk : talk
Lemmatization with nltk : talked
Lemmatization with spacy : talk
Orginal Word : walked
Stemmer with nltk : walk
Lemmatization with nltk : walked
Lemmatization with spacy : walk
我有一个文本文档需要使用词干提取和词形还原。我已经清理了数据并将其标记化并删除了停用词
我需要做的是将列表作为输入,return 一个字典,该字典应该有键 'original stem 和 lemmma。值是第 n 个以这种方式转换的词
snowball stemmer is defined as Stemmer()
and WordNetLemmatizer is defined as lemmatizer()
这是我写的代码,但它确实给了我们一个错误
def find_roots(token_list, n):
n = 2
original = tokens
stem = [ele for sub in original for idx, ele in
enumerate(sub.split()) if idx == (n - 1)]
stem = stemmer(stem)
lemma = [ele for sub in original for idx, ele in
enumerate(sub.split()) if idx == (n - 1)]
lemma = lemmatizer()
return
如有任何帮助,我们将不胜感激
我真的不明白你在列表推导中试图做什么,所以我就写下我会怎么做:
from nltk import WordNetLemmatizer, SnowballStemmer
lemmatizer = WordNetLemmatizer()
stemmer = SnowballStemmer("english")
def find_roots(token_list, n):
token = token_list[n]
stem = stemmer.stem(token)
lemma = lemmatizer.lemmatize(token)
return {"original": token, "stem": stem, "lemma": lemma}
roots_dict = find_roots(["said", "talked", "walked"], n=2)
print(roots_dict)
> {'original': 'walked', 'stem': 'walk', 'lemma': 'walked'}
你可以用 spacy
做你想做的事情,如下所示:(在许多情况下 spacy
比 nltk
表现更好。)
# $ pip install -U spacy
import spacy
from nltk import WordNetLemmatizer, SnowballStemmer
sp = spacy.load('en_core_web_sm')
lemmatizer = WordNetLemmatizer()
stemmer = SnowballStemmer("english")
words = ['compute', 'computer', 'computed', 'computing', 'said', 'talked', 'walked']
for word in words:
print(f'Orginal Word : {word}')
print(f'Stemmer with nltk : {stemmer.stem(word)}')
print(f'Lemmatization with nltk : {lemmatizer.lemmatize(word)}')
sp_word = sp(word)
print(f'Lemmatization with spacy : {sp_word[0].lemma_}')
输出:
Orginal Word : compute
Stemmer with nltk : comput
Lemmatization with nltk : compute
Lemmatization with spacy : compute
Orginal Word : computer
Stemmer with nltk : comput
Lemmatization with nltk : computer
Lemmatization with spacy : computer
Orginal Word : computed
Stemmer with nltk : comput
Lemmatization with nltk : computed
Lemmatization with spacy : compute
Orginal Word : computing
Stemmer with nltk : comput
Lemmatization with nltk : computing
Lemmatization with spacy : compute
Orginal Word : said
Stemmer with nltk : said
Lemmatization with nltk : said
Lemmatization with spacy : say
Orginal Word : talked
Stemmer with nltk : talk
Lemmatization with nltk : talked
Lemmatization with spacy : talk
Orginal Word : walked
Stemmer with nltk : walk
Lemmatization with nltk : walked
Lemmatization with spacy : walk