如何使用 Stemmer 或 Lemmatizer 来提取特定单词的词干

Question

我目前正在尝试收集一个大语料库（大约 80 万个句子）。我设法只阻止了基本的。现在的问题是，我只想提取一个特定的单词，例如，此方法仅适用于引理是原始单词的子串的情况。例如，单词 apples 的后缀是 apple 和 's'。但如果不是子串，它不会像单词teeth一样拆分成tooth。

我也读过有关 lemmatizer WordNet 的文章，我们可以在其中为 pos 添加一个参数，例如动词、名词或形容词。有什么方法可以应用上面的方法吗？

提前致谢！

Answer 1

这里有一个完整的例子 -

import nltk
from nltk.corpus import wordnet
from difflib import get_close_matches as gcm
from itertools import chain
from nltk.stem.porter import *

texts = [ " apples are good. My teeth will fall out.",
          " roses are red. cars are great to have"]

lmtzr = nltk.WordNetLemmatizer()
stemmer = PorterStemmer()

for text in texts:
    tokens = nltk.word_tokenize(text) # should sent tokenize it first
    token_lemma = [ lmtzr.lemmatize(token) for token in tokens ] # take your pick here between lemmatizer and wordnet synset.
    wn_lemma = [ gcm(word, list(set(list(chain(*[i.lemma_names() for i in wordnet.synsets(word)]))))) for word in tokens ]
    #print(wn_lemma) # works for unconventional words like 'teeth' --> tooth. You might want to take a closer look
    tokens_final = [ stemmer.stem(tokens[i]) if len(tokens[i]) > len(token_lemma[i]) else token_lemma[i] for i in range(len(tokens)) ]
    print(tokens_final)

输出

['appl', 'are', 'good', '.', 'My', 'teeth', 'will', 'fall', 'out', '.']
['rose', 'are', 'red', '.', 'car', 'are', 'great', 'to', 'have']

说明

注意stemmer.stem(tokens[i]) if len(tokens[i]) > len(token_lemma[i]) else token_lemma[i]这就是奇迹发生的地方。如果词形还原词是主词的子集，那么该词将被词干化，否则它只是保持词形还原。

备注

您尝试的词形还原有一些边缘情况。 WordnetLemmatizer 不够聪明，无法处理像 'teeth' --> 'tooth' 这样的异常情况。在这些情况下，您可能想看看 Wordnet.synset ，它可能会派上用场。

我在评论中加入了一个小案例供您调查。

如何使用 Stemmer 或 Lemmatizer 来提取特定单词的词干

How to use Stemmer or Lemmatizer to stem specific word

python

java

nlp

stemming

lemmatization