为什么 NLTK 库中有不同的词形还原器?
Why are there different Lemmatizers in NLTK library?
>> from nltk.stem import WordNetLemmatizer as lm1
>> from nltk import WordNetLemmatizer as lm2
>> from nltk.stem.wordnet import WordNetLemmatizer as lm3
对我来说,这三个都以相同的方式工作,但只是想确认一下,它们提供的内容有什么不同吗?
不,它们没有什么不同,它们都是一样的。
from nltk.stem import WordNetLemmatizer as lm1
from nltk import WordNetLemmatizer as lm2
from nltk.stem.wordnet import WordNetLemmatizer as lm3
lm1 == lm2
>>> True
lm2 == lm3
>>> True
lm1 == lm3
>>> True
正如 erip 更正的那样,发生这种情况的原因是:
Class(WordNetLemmatizer
) 最初是用 nltk.stem.wordnet 写的,所以你可以 from nltk.stem.wordnet import WordNetLemmatizer as lm3
这也是在 nltk 中导入的 __init__.py file 所以你可以做 from nltk import WordNetLemmatizer as lm2
并且也在 __init__.py nltk.stem 模块中导入,因此您可以 from nltk.stem import WordNetLemmatizer as lm1
答:都是一样的
inspect
检查对象是否相同的有用工具
>>> import inspect
>>> from nltk.stem import WordNetLemmatizer as wnl1
>>> from nltk.stem.wordnet import WordNetLemmatizer as wnl2
>>> inspect.getfile(wnl1)
'/Library/Python/2.7/site-packages/nltk/stem/wordnet.pyc'
# They come from the same file:
>>> inspect.getfile(wnl1) == inspect.getfile(wnl2)
True
>>> print inspect.getdoc(wnl1)
WordNet Lemmatizer
Lemmatize using WordNet's built-in morphy function.
Returns the input word unchanged if it cannot be found in WordNet.
>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> print(wnl.lemmatize('dogs'))
dog
>>> print(wnl.lemmatize('churches'))
church
>>> print(wnl.lemmatize('aardwolves'))
aardwolf
>>> print(wnl.lemmatize('abaci'))
abacus
>>> print(wnl.lemmatize('hardrock'))
hardrock
你也可以查看源代码:
>>> print inspect.getsource(wnl1)
class WordNetLemmatizer(object):
"""
WordNet Lemmatizer
Lemmatize using WordNet's built-in morphy function.
Returns the input word unchanged if it cannot be found in WordNet.
>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> print(wnl.lemmatize('dogs'))
dog
>>> print(wnl.lemmatize('churches'))
church
>>> print(wnl.lemmatize('aardwolves'))
aardwolf
>>> print(wnl.lemmatize('abaci'))
abacus
>>> print(wnl.lemmatize('hardrock'))
hardrock
"""
def __init__(self):
pass
def lemmatize(self, word, pos=NOUN):
lemmas = wordnet._morphy(word, pos)
return min(lemmas, key=len) if lemmas else word
def __repr__(self):
return '<WordNetLemmatizer>'
# They have the same source code too:
>>> print inspect.getsource(wnl1) == inspect.getsource(wnl2)
True
WordNetLemmatizer
的 NLTK 导入结构如下所示:
\nltk
__init__.py
\stem.
__init__.py
wordnet.py # This is where WordNetLemmatizer code resides.
我们从 WordNetLemmatizer
位于 nltk.stem.wordnet.py
https://github.com/nltk/nltk/blob/develop/nltk/stem/wordnet.py#L15 的最低点开始,所以你可以这样做:
from nltk.stem.wordnet import WordNetLemmatizer
从nltk.stem.init.py,我们在https://github.com/nltk/nltk/blob/develop/nltk/stem/init.py#L30看到上面的导入允许nltk.stem
访问WordNetLemmatizer,所以你可以做到
from nltk.stem import WordNetLemmatizer
从nltk.__init__.py
我们看到:
from nltk.stem import *
这允许最顶层 nltk
导入访问 nltk.stem
有权访问的所有内容。所以在顶层nltk
,我们可以做:
from nltk import WordNetLemmatizer
不过要注意一件事,NOT 总是这样 objects/modules 在 NLTK 中具有相同的名称指的是同一个对象,例如:
>>> from nltk.corpus import wordnet as wn1
>>> from nltk.corpus.reader import wordnet as wn2
>>> wn1 == wn2
False
>>> wn1.synsets('dog')
[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]
>>> wn2.synsets('dog')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'synsets'
第一个 wordnet wn1
是一个 LazyCorpusLoader
对象,它将打开 nltk_data
中的 wordnet 文件,它允许您访问同义词集:https://github.com/nltk/nltk/blob/develop/nltk/corpus/init.py#L246
第二个 wn2
是驻留在 nltk.corpus.wordnet.py
中的 wordnet.py
文件本身:https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py
在以下情况下会变得更加棘手:
>>> from nltk.corpus import wordnet as wn1
>>> from nltk.corpus.reader import wordnet as wn2
>>> from nltk.stem import wordnet as wn3
>>> wn3 == wn1
False
>>> wn3 == wn2
False
在wn3
的情况下,它指的是包含WordNetLemmatizer
的文件nltk.stem.wordnet.py
,与wordnet语料库对象或语料库无关reader 用于 wordnet。