除非 POS 是显式的,否则 WordNetLemmatizer 不会返回正确的引理 - Python NLTK
WordNetLemmatizer not returning the right lemma unless POS is explicit - Python NLTK
我正在对 Ted 数据集抄本进行词形还原。我注意到一些奇怪的事情:
并非所有单词都被词形还原。可以说,
selected -> select
这是对的。
但是,involved !-> involve
和 horsing !-> horse
除非我明确输入 'v'(动词)属性。
在 python 终端上,我得到了正确的输出,但在我的 code:
中却没有
>>> from nltk.stem import WordNetLemmatizer
>>> from nltk.corpus import wordnet
>>> lem = WordNetLemmatizer()
>>> lem.lemmatize('involved','v')
u'involve'
>>> lem.lemmatize('horsing','v')
u'horse'
代码的相关部分是这样的:
for l in LDA_Row[0].split('+'):
w=str(l.split('*')[1])
word=lmtzr.lemmatize(w)
wordv=lmtzr.lemmatize(w,'v')
print wordv, word
# if word is not wordv:
# print word, wordv
整个代码是here。
问题是什么?
lemmatizer 需要正确的 POS 标签才准确,如果你使用 WordNetLemmatizer.lemmatize()
的默认设置,默认标签是名词,见 https://github.com/nltk/nltk/blob/develop/nltk/stem/wordnet.py#L39
要解决此问题,请始终在词形还原之前对数据进行 POS 标记,例如
>>> from nltk.stem import WordNetLemmatizer
>>> from nltk import pos_tag, word_tokenize
>>> wnl = WordNetLemmatizer()
>>> sent = 'This is a foo bar sentence'
>>> pos_tag(word_tokenize(sent))
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'NN'), ('bar', 'NN'), ('sentence', 'NN')]
>>> for word, tag in pos_tag(word_tokenize(sent)):
... wntag = tag[0].lower()
... wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
... if not wntag:
... lemma = word
... else:
... lemma = wnl.lemmatize(word, wntag)
... print lemma
...
This
be
a
foo
bar
sentence
注意'is -> be',即
>>> wnl.lemmatize('is')
'is'
>>> wnl.lemmatize('is', 'v')
u'be'
用你的例子回答问题:
>>> sent = 'These sentences involves some horsing around'
>>> for word, tag in pos_tag(word_tokenize(sent)):
... wntag = tag[0].lower()
... wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
... lemma = wnl.lemmatize(word, wntag) if wntag else word
... print lemma
...
These
sentence
involve
some
horse
around
请注意,WordNetLemmatizer 有一些怪癖:
- wordnet lemmatization and pos tagging in python
- Python NLTK Lemmatization of the word 'further' with wordnet
此外,NLTK 的默认词性标注器正在进行一些重大更改以提高准确性:
关于词形还原器的开箱即用/现成解决方案,您可以查看 https://github.com/alvations/pywsd and how I've made some try-excepts to catch words that are not in WordNet, see https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L66
我正在对 Ted 数据集抄本进行词形还原。我注意到一些奇怪的事情: 并非所有单词都被词形还原。可以说,
selected -> select
这是对的。
但是,involved !-> involve
和 horsing !-> horse
除非我明确输入 'v'(动词)属性。
在 python 终端上,我得到了正确的输出,但在我的 code:
中却没有>>> from nltk.stem import WordNetLemmatizer
>>> from nltk.corpus import wordnet
>>> lem = WordNetLemmatizer()
>>> lem.lemmatize('involved','v')
u'involve'
>>> lem.lemmatize('horsing','v')
u'horse'
代码的相关部分是这样的:
for l in LDA_Row[0].split('+'):
w=str(l.split('*')[1])
word=lmtzr.lemmatize(w)
wordv=lmtzr.lemmatize(w,'v')
print wordv, word
# if word is not wordv:
# print word, wordv
整个代码是here。
问题是什么?
lemmatizer 需要正确的 POS 标签才准确,如果你使用 WordNetLemmatizer.lemmatize()
的默认设置,默认标签是名词,见 https://github.com/nltk/nltk/blob/develop/nltk/stem/wordnet.py#L39
要解决此问题,请始终在词形还原之前对数据进行 POS 标记,例如
>>> from nltk.stem import WordNetLemmatizer
>>> from nltk import pos_tag, word_tokenize
>>> wnl = WordNetLemmatizer()
>>> sent = 'This is a foo bar sentence'
>>> pos_tag(word_tokenize(sent))
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'NN'), ('bar', 'NN'), ('sentence', 'NN')]
>>> for word, tag in pos_tag(word_tokenize(sent)):
... wntag = tag[0].lower()
... wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
... if not wntag:
... lemma = word
... else:
... lemma = wnl.lemmatize(word, wntag)
... print lemma
...
This
be
a
foo
bar
sentence
注意'is -> be',即
>>> wnl.lemmatize('is')
'is'
>>> wnl.lemmatize('is', 'v')
u'be'
用你的例子回答问题:
>>> sent = 'These sentences involves some horsing around'
>>> for word, tag in pos_tag(word_tokenize(sent)):
... wntag = tag[0].lower()
... wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
... lemma = wnl.lemmatize(word, wntag) if wntag else word
... print lemma
...
These
sentence
involve
some
horse
around
请注意,WordNetLemmatizer 有一些怪癖:
- wordnet lemmatization and pos tagging in python
- Python NLTK Lemmatization of the word 'further' with wordnet
此外,NLTK 的默认词性标注器正在进行一些重大更改以提高准确性:
关于词形还原器的开箱即用/现成解决方案,您可以查看 https://github.com/alvations/pywsd and how I've made some try-excepts to catch words that are not in WordNet, see https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L66