POS-Tagger 非常慢
POS-Tagger is incredibly slow
我正在使用 nltk
通过首先删除给定的停用词从句子中生成 n-gram。然而,nltk.pos_tag()
在我的 CPU (Intel i7) 上非常慢,最多需要 0.6 秒。
输出:
['The first time I went, and was completely taken by the live jazz band and atmosphere, I ordered the Lobster Cobb Salad.']
0.620481014252
["It's simply the best meal in NYC."]
0.640982151031
['You cannot go wrong at the Red Eye Grill.']
0.644664049149
代码:
for sentence in source:
nltk_ngrams = None
if stop_words is not None:
start = time.time()
sentence_pos = nltk.pos_tag(word_tokenize(sentence))
print time.time() - start
filtered_words = [word for (word, pos) in sentence_pos if pos not in stop_words]
else:
filtered_words = ngrams(sentence.split(), n)
这真的那么慢还是我做错了什么?
使用pos_tag_sents
标记多个句子:
>>> import time
>>> from nltk.corpus import brown
>>> from nltk import pos_tag
>>> from nltk import pos_tag_sents
>>> sents = brown.sents()[:10]
>>> start = time.time(); pos_tag(sents[0]); print time.time() - start
0.934092998505
>>> start = time.time(); [pos_tag(s) for s in sents]; print time.time() - start
9.5061340332
>>> start = time.time(); pos_tag_sents(sents); print time.time() - start
0.939551115036
如果您正在寻找另一个在 Python 中具有快速性能的词性标注器,您可能想尝试 RDRPOSTagger. For example, on English POS tagging, the tagging speed is 8K words/second for a single threaded implementation in Python, using a computer of Core 2Duo 2.4GHz. You can get faster tagging speed by simply using the multi-threaded mode. RDRPOSTagger obtains very competitive accuracies in comparison to state-of-the-art taggers and now supports pre-trained models for 40 languages. See experimental results in this paper。
nltk pos_tag is defined as:
from nltk.tag.perceptron import PerceptronTagger
def pos_tag(tokens, tagset=None):
tagger = PerceptronTagger()
return _pos_tag(tokens, tagset, tagger)
所以每次调用 pos_tag 都会实例化 perceptrontagger 模块,它需要大量的计算 time.You 可以通过直接调用 tagger.tag 自己来节省这次时间:
from nltk.tag.perceptron import PerceptronTagger
tagger=PerceptronTagger()
sentence_pos = tagger.tag(word_tokenize(sentence))
我正在使用 nltk
通过首先删除给定的停用词从句子中生成 n-gram。然而,nltk.pos_tag()
在我的 CPU (Intel i7) 上非常慢,最多需要 0.6 秒。
输出:
['The first time I went, and was completely taken by the live jazz band and atmosphere, I ordered the Lobster Cobb Salad.']
0.620481014252
["It's simply the best meal in NYC."]
0.640982151031
['You cannot go wrong at the Red Eye Grill.']
0.644664049149
代码:
for sentence in source:
nltk_ngrams = None
if stop_words is not None:
start = time.time()
sentence_pos = nltk.pos_tag(word_tokenize(sentence))
print time.time() - start
filtered_words = [word for (word, pos) in sentence_pos if pos not in stop_words]
else:
filtered_words = ngrams(sentence.split(), n)
这真的那么慢还是我做错了什么?
使用pos_tag_sents
标记多个句子:
>>> import time
>>> from nltk.corpus import brown
>>> from nltk import pos_tag
>>> from nltk import pos_tag_sents
>>> sents = brown.sents()[:10]
>>> start = time.time(); pos_tag(sents[0]); print time.time() - start
0.934092998505
>>> start = time.time(); [pos_tag(s) for s in sents]; print time.time() - start
9.5061340332
>>> start = time.time(); pos_tag_sents(sents); print time.time() - start
0.939551115036
如果您正在寻找另一个在 Python 中具有快速性能的词性标注器,您可能想尝试 RDRPOSTagger. For example, on English POS tagging, the tagging speed is 8K words/second for a single threaded implementation in Python, using a computer of Core 2Duo 2.4GHz. You can get faster tagging speed by simply using the multi-threaded mode. RDRPOSTagger obtains very competitive accuracies in comparison to state-of-the-art taggers and now supports pre-trained models for 40 languages. See experimental results in this paper。
nltk pos_tag is defined as:
from nltk.tag.perceptron import PerceptronTagger
def pos_tag(tokens, tagset=None):
tagger = PerceptronTagger()
return _pos_tag(tokens, tagset, tagger)
所以每次调用 pos_tag 都会实例化 perceptrontagger 模块,它需要大量的计算 time.You 可以通过直接调用 tagger.tag 自己来节省这次时间:
from nltk.tag.perceptron import PerceptronTagger
tagger=PerceptronTagger()
sentence_pos = tagger.tag(word_tokenize(sentence))