NLTK - 大语料库的统计数据非常慢
NLTK - statistics count extremely slow with big corpus
我想查看关于我的语料库的基本统计数据,例如 word/sentence 计数器、分布等。
我有一个 tokens_corpus_reader_ready.txt
,其中包含 137.000 行标记例句,格式如下:
Zur/APPRART Zeit/NN kostenlos/ADJD aber/KON auch/ADV nur/ADV 11/CARD kW./NN
Zur/APPRART Zeit/NN anscheinend/ADJD kostenlos/ADJD ./$.
...
我还有一个 TaggedCorpusReader(),我有一个 describe() 方法用于:
class CSCorpusReader(TaggedCorpusReader):
def __init__(self):
TaggedCorpusReader.__init__(self, raw_corpus_path, 'tokens_corpus_reader_ready.txt')
def describe(self):
"""
Performs a single pass of the corpus and
returns a dictionary with a variety of metrics
concerning the state of the corpus.
modified method from https://github.com/foxbook/atap/blob/master/snippets/ch03/reader.py
"""
started = time.time()
# Structures to perform counting.
counts = nltk.FreqDist()
tokens = nltk.FreqDist()
# Perform single pass over paragraphs, tokenize and count
for sent in self.sents():
print(time.time())
counts['sents'] += 1
for word in self.words():
counts['words'] += 1
tokens[word] += 1
return {
'sents': counts['sents'],
'words': counts['words'],
'vocab': len(tokens),
'lexdiv': float(counts['words']) / float(len(tokens)),
'secs': time.time() - started,
}
如果我 运行 在 IPython 中这样描述方法:
>> corpus = CSCorpusReader()
>> print(corpus.describe())
每句话之间有大约7秒的延迟:
1543770777.502544
1543770784.383989
1543770792.2057862
1543770798.992075
1543770805.819034
1543770812.599932
...
如果我运行同样的事情在tokens_corpus_reader_ready.txt
中只用几句话输出时间是完全合理的:
1543771884.739753
1543771884.74035
1543771884.7408729
1543771884.7413561
{'sents': 4, 'words': 212, 'vocab': 42, 'lexdiv': 5.0476190476190474, 'secs': 0.002869129180908203}
这种行为从何而来,我该如何解决?
编辑 1
通过不是每次访问语料库本身而是对列表进行操作,每个句子的时间减少到大约 3 秒,但仍然很长:
sents = list(self.sents())
words = list(self.words())
# Perform single pass over paragraphs, tokenize and count
for sent in sents:
print(time.time())
counts['sents'] += 1
for word in words:
counts['words'] += 1
tokens[word] += 1
这就是您的问题:对于每个句子,您使用 words()
方法阅读 整个语料库 。怪不得这么久。
for sent in self.sents():
print(time.time())
counts['sents'] += 1
for word in self.words():
counts['words'] += 1
tokens[word] += 1
事实上,一个句子已经被标记化为单词,所以这就是你的意思:
for sent in self.sents():
print(time.time())
counts['sents'] += 1
for word in sent:
counts['words'] += 1
tokens[word] += 1
我想查看关于我的语料库的基本统计数据,例如 word/sentence 计数器、分布等。
我有一个 tokens_corpus_reader_ready.txt
,其中包含 137.000 行标记例句,格式如下:
Zur/APPRART Zeit/NN kostenlos/ADJD aber/KON auch/ADV nur/ADV 11/CARD kW./NN Zur/APPRART Zeit/NN anscheinend/ADJD kostenlos/ADJD ./$.
...
我还有一个 TaggedCorpusReader(),我有一个 describe() 方法用于:
class CSCorpusReader(TaggedCorpusReader):
def __init__(self):
TaggedCorpusReader.__init__(self, raw_corpus_path, 'tokens_corpus_reader_ready.txt')
def describe(self):
"""
Performs a single pass of the corpus and
returns a dictionary with a variety of metrics
concerning the state of the corpus.
modified method from https://github.com/foxbook/atap/blob/master/snippets/ch03/reader.py
"""
started = time.time()
# Structures to perform counting.
counts = nltk.FreqDist()
tokens = nltk.FreqDist()
# Perform single pass over paragraphs, tokenize and count
for sent in self.sents():
print(time.time())
counts['sents'] += 1
for word in self.words():
counts['words'] += 1
tokens[word] += 1
return {
'sents': counts['sents'],
'words': counts['words'],
'vocab': len(tokens),
'lexdiv': float(counts['words']) / float(len(tokens)),
'secs': time.time() - started,
}
如果我 运行 在 IPython 中这样描述方法:
>> corpus = CSCorpusReader()
>> print(corpus.describe())
每句话之间有大约7秒的延迟:
1543770777.502544
1543770784.383989
1543770792.2057862
1543770798.992075
1543770805.819034
1543770812.599932
...
如果我运行同样的事情在tokens_corpus_reader_ready.txt
中只用几句话输出时间是完全合理的:
1543771884.739753
1543771884.74035
1543771884.7408729
1543771884.7413561
{'sents': 4, 'words': 212, 'vocab': 42, 'lexdiv': 5.0476190476190474, 'secs': 0.002869129180908203}
这种行为从何而来,我该如何解决?
编辑 1
通过不是每次访问语料库本身而是对列表进行操作,每个句子的时间减少到大约 3 秒,但仍然很长:
sents = list(self.sents())
words = list(self.words())
# Perform single pass over paragraphs, tokenize and count
for sent in sents:
print(time.time())
counts['sents'] += 1
for word in words:
counts['words'] += 1
tokens[word] += 1
这就是您的问题:对于每个句子,您使用 words()
方法阅读 整个语料库 。怪不得这么久。
for sent in self.sents():
print(time.time())
counts['sents'] += 1
for word in self.words():
counts['words'] += 1
tokens[word] += 1
事实上,一个句子已经被标记化为单词,所以这就是你的意思:
for sent in self.sents():
print(time.time())
counts['sents'] += 1
for word in sent:
counts['words'] += 1
tokens[word] += 1