使用 spacy 和 nltk 的单词词形还原没有给出正确的词条
Lemmatization of words using spacy and nltk not giving correct lemma
我想得到下面给出的列表中的词的词形还原词:
(例如)
words = ['Funnier','Funniest','mightiest','tighter']
当我做spacy的时候,
import spacy
nlp = spacy.load('en')
words = ['Funnier','Funniest','mightiest','tighter','biggify']
doc = spacy.tokens.Doc(nlp.vocab, words=words)
for items in doc:
print(items.lemma_)
我得到的引理如下:
Funnier
Funniest
mighty
tight
当我使用 nltk WordNetLemmatizer
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ['Funnier','Funniest','mightiest','tighter','biggify']
for token in words:
print(token + ' --> ' + lemmatizer.lemmatize(token))
我得到了:
Funnier : Funnier
Funniest : Funniest
mightiest : mightiest
tighter : tighter
任何人都可以帮忙。
谢谢。
词形还原完全取决于您在获取特定单词的词元时使用的词性标记。
# Define the sentence to be lemmatized
sentence = "The striped bats are hanging on their feet for best"
# Tokenize: Split the sentence into words
word_list = nltk.word_tokenize(sentence)
print(word_list)
#> ['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']
# Lemmatize list of words and join
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
print(lemmatized_output)
#> The striped bat are hanging on their foot for best
以上代码是如何在单词和句子上使用 wordnet lemmatizer 的简单示例。
注意它做得不好。因为,‘are’没有像预期的那样转换为‘be’,‘hanging’也没有转换为‘hang’。如果我们提供正确的“词性”标记(POS 标记)作为 lemmatize() 的第二个参数,则可以纠正此问题。
有时,同一个词可以根据含义/上下文有多个词条。
print(lemmatizer.lemmatize("stripes", 'v'))
#> strip
print(lemmatizer.lemmatize("stripes", 'n'))
#> stripe
对于上面的例子(),指定对应的pos标签:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ['Funnier','Funniest','mightiest','tighter','biggify']
for token in words:
print(token + ' --> ' + lemmatizer.lemmatize(token, wordnet.ADJ_SAT))
我想得到下面给出的列表中的词的词形还原词:
(例如)
words = ['Funnier','Funniest','mightiest','tighter']
当我做spacy的时候,
import spacy
nlp = spacy.load('en')
words = ['Funnier','Funniest','mightiest','tighter','biggify']
doc = spacy.tokens.Doc(nlp.vocab, words=words)
for items in doc:
print(items.lemma_)
我得到的引理如下:
Funnier
Funniest
mighty
tight
当我使用 nltk WordNetLemmatizer
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ['Funnier','Funniest','mightiest','tighter','biggify']
for token in words:
print(token + ' --> ' + lemmatizer.lemmatize(token))
我得到了:
Funnier : Funnier
Funniest : Funniest
mightiest : mightiest
tighter : tighter
任何人都可以帮忙。
谢谢。
词形还原完全取决于您在获取特定单词的词元时使用的词性标记。
# Define the sentence to be lemmatized
sentence = "The striped bats are hanging on their feet for best"
# Tokenize: Split the sentence into words
word_list = nltk.word_tokenize(sentence)
print(word_list)
#> ['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']
# Lemmatize list of words and join
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
print(lemmatized_output)
#> The striped bat are hanging on their foot for best
以上代码是如何在单词和句子上使用 wordnet lemmatizer 的简单示例。
注意它做得不好。因为,‘are’没有像预期的那样转换为‘be’,‘hanging’也没有转换为‘hang’。如果我们提供正确的“词性”标记(POS 标记)作为 lemmatize() 的第二个参数,则可以纠正此问题。
有时,同一个词可以根据含义/上下文有多个词条。
print(lemmatizer.lemmatize("stripes", 'v'))
#> strip
print(lemmatizer.lemmatize("stripes", 'n'))
#> stripe
对于上面的例子(),指定对应的pos标签:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ['Funnier','Funniest','mightiest','tighter','biggify']
for token in words:
print(token + ' --> ' + lemmatizer.lemmatize(token, wordnet.ADJ_SAT))