除非 POS 是显式的，否则 WordNetLemmatizer 不会返回正确的引理 - Python NLTK

Question

我正在对 Ted 数据集抄本进行词形还原。我注意到一些奇怪的事情：并非所有单词都被词形还原。可以说，

selected -> select

这是对的。

但是，involved !-> involve 和 horsing !-> horse 除非我明确输入 'v'（动词）属性。

在 python 终端上，我得到了正确的输出，但在我的 code:

中却没有

>>> from nltk.stem import WordNetLemmatizer
>>> from nltk.corpus import wordnet
>>> lem = WordNetLemmatizer()
>>> lem.lemmatize('involved','v')
u'involve'
>>> lem.lemmatize('horsing','v')
u'horse'

代码的相关部分是这样的：

for l in LDA_Row[0].split('+'):
    w=str(l.split('*')[1])
    word=lmtzr.lemmatize(w)
    wordv=lmtzr.lemmatize(w,'v')
    print wordv, word
    # if word is not wordv:
    #   print word, wordv

整个代码是here。

问题是什么？

Answer 1

lemmatizer 需要正确的 POS 标签才准确，如果你使用 WordNetLemmatizer.lemmatize() 的默认设置，默认标签是名词，见 https://github.com/nltk/nltk/blob/develop/nltk/stem/wordnet.py#L39

要解决此问题，请始终在词形还原之前对数据进行 POS 标记，例如

>>> from nltk.stem import WordNetLemmatizer
>>> from nltk import pos_tag, word_tokenize
>>> wnl = WordNetLemmatizer()
>>> sent = 'This is a foo bar sentence'
>>> pos_tag(word_tokenize(sent))
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'NN'), ('bar', 'NN'), ('sentence', 'NN')]
>>> for word, tag in pos_tag(word_tokenize(sent)):
...     wntag = tag[0].lower()
...     wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
...     if not wntag:
...             lemma = word
...     else:
...             lemma = wnl.lemmatize(word, wntag)
...     print lemma
... 
This
be
a
foo
bar
sentence

注意'is -> be'，即

>>> wnl.lemmatize('is')
'is'
>>> wnl.lemmatize('is', 'v')
u'be'

用你的例子回答问题：

>>> sent = 'These sentences involves some horsing around'
>>> for word, tag in pos_tag(word_tokenize(sent)):
...     wntag = tag[0].lower()
...     wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
...     lemma = wnl.lemmatize(word, wntag) if wntag else word
...     print lemma
... 
These
sentence
involve
some
horse
around

请注意，WordNetLemmatizer 有一些怪癖：

wordnet lemmatization and pos tagging in python
Python NLTK Lemmatization of the word 'further' with wordnet

此外，NLTK 的默认词性标注器正在进行一些重大更改以提高准确性：

关于词形还原器的开箱即用/现成解决方案，您可以查看 https://github.com/alvations/pywsd and how I've made some try-excepts to catch words that are not in WordNet, see https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L66

除非 POS 是显式的，否则 WordNetLemmatizer 不会返回正确的引理 - Python NLTK

WordNetLemmatizer not returning the right lemma unless POS is explicit - Python NLTK

python

nlp

nltk

wordnet

lemmatization