从 NLTK 的 Penn Treebank 语料库样本创建字典?
Create Dictionary from Penn Treebank Corpus sample from NLTK?
我知道 Treebank 语料库已经被标记,但与 Brown 语料库不同,我不知道如何获得标签字典。例如,
>>> from nltk.corpus import brown
>>> wordcounts = nltk.ConditionalFreqDist(brown.tagged_words())
这不适用于 Treebank 语料库?
快速解决方案:
>>> from nltk.corpus import treebank
>>> from nltk import ConditionalFreqDist as cfd
>>> from itertools import chain
>>> treebank_tagged_words = list(chain(*list(chain(*[[tree.pos() for tree in treebank.parsed_sents(pf)] for pf in treebank.fileids()]))))
>>> wordcounts = cfd(treebank_tagged_words)
>>> treebank_tagged_words[0]
(u'Pierre', u'NNP')
>>> wordcounts[u'Pierre']
FreqDist({u'NNP': 1})
>>> treebank_tagged_words[100]
(u'asbestos', u'NN')
>>> wordcounts[u'asbestos']
FreqDist({u'NN': 11})
有关详细信息,请参阅 https://en.wikipedia.org/wiki/User:Alvations/NLTK_cheatsheet/CorporaReaders#Penn_Tree_Bank
另请参阅:
请注意,来自 NLTK 的 Penn Treebank 样本只有 3000 多个句子,棕色语料库有 50,000 个句子。
将句子分成训练集和测试集:
from nltk.corpus import treebank
from nltk import ConditionalFreqDist as cfd
from itertools import chain
treebank_tagged_sents = list(chain(*[[tree.pos() for tree in treebank.parsed_sents(pf)] for pf in treebank.fileids()]))
total_len = len(treebank_tagged_sents)
train_len = int(90 * total_len /100)
train_set = treebank_tagged_sents[:train_len]
print len(train_set)
train_treebank_tagged_words = cfd(chain(*train_set))
test_set = treebank_tagged_sents[train_len:]
print len(test_set)
test_treebank_tagged_words = cfd(chain(*test_set))
如果你打算使用棕色语料库(不包含解析的句子),你可以使用 tagged_sent()
:
>>> from nltk.corpus import brown
>>> brown_tagged_sents = brown.tagged_sents()
>>> len(brown_tagged_sents)
57340
>>> brown_tagged_sents[0]
[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN'), (u"Atlanta's", u'NP$'), (u'recent', u'JJ'), (u'primary', u'NN'), (u'election', u'NN'), (u'produced', u'VBD'), (u'``', u'``'), (u'no', u'AT'), (u'evidence', u'NN'), (u"''", u"''"), (u'that', u'CS'), (u'any', u'DTI'), (u'irregularities', u'NNS'), (u'took', u'VBD'), (u'place', u'NN'), (u'.', u'.')]
>>> total_len = len(brown_tagged_sents)
>>> train_len = int(90 * total_len/100)
>>> train_set = brown_tagged_sents[:train_len]
>>> train_brown_tagged_words = cfd(chain(*train_set))
>>> train_brown_tagged_words['asbestos']
FreqDist({u'NN': 1})
如@alexis 所述,除非您在句子级别拆分语料库。 tagged_words()
函数也存在于Penn Treebank API in NLTK:
>>> from nltk.corpus import treebank
>>> from nltk.corpus import brown
>>> treebank.tagged_words()
[(u'Pierre', u'NNP'), (u'Vinken', u'NNP'), ...]
>>> brown.tagged_words()
[(u'The', u'AT'), (u'Fulton', u'NP-TL'), ...]
>>> type(treebank.tagged_words())
<class 'nltk.corpus.reader.util.ConcatenatedCorpusView'>
>>> type(brown.tagged_words())
<class 'nltk.corpus.reader.util.ConcatenatedCorpusView'>
>>> from nltk import ConditionalFreqDist as cfd
>>> cfd(brown.tagged_words())
<ConditionalFreqDist with 56057 conditions>
>>> cfd(treebank.tagged_words())
<ConditionalFreqDist with 12408 conditions>
我知道 Treebank 语料库已经被标记,但与 Brown 语料库不同,我不知道如何获得标签字典。例如,
>>> from nltk.corpus import brown
>>> wordcounts = nltk.ConditionalFreqDist(brown.tagged_words())
这不适用于 Treebank 语料库?
快速解决方案:
>>> from nltk.corpus import treebank
>>> from nltk import ConditionalFreqDist as cfd
>>> from itertools import chain
>>> treebank_tagged_words = list(chain(*list(chain(*[[tree.pos() for tree in treebank.parsed_sents(pf)] for pf in treebank.fileids()]))))
>>> wordcounts = cfd(treebank_tagged_words)
>>> treebank_tagged_words[0]
(u'Pierre', u'NNP')
>>> wordcounts[u'Pierre']
FreqDist({u'NNP': 1})
>>> treebank_tagged_words[100]
(u'asbestos', u'NN')
>>> wordcounts[u'asbestos']
FreqDist({u'NN': 11})
有关详细信息,请参阅 https://en.wikipedia.org/wiki/User:Alvations/NLTK_cheatsheet/CorporaReaders#Penn_Tree_Bank
另请参阅:
请注意,来自 NLTK 的 Penn Treebank 样本只有 3000 多个句子,棕色语料库有 50,000 个句子。
将句子分成训练集和测试集:
from nltk.corpus import treebank
from nltk import ConditionalFreqDist as cfd
from itertools import chain
treebank_tagged_sents = list(chain(*[[tree.pos() for tree in treebank.parsed_sents(pf)] for pf in treebank.fileids()]))
total_len = len(treebank_tagged_sents)
train_len = int(90 * total_len /100)
train_set = treebank_tagged_sents[:train_len]
print len(train_set)
train_treebank_tagged_words = cfd(chain(*train_set))
test_set = treebank_tagged_sents[train_len:]
print len(test_set)
test_treebank_tagged_words = cfd(chain(*test_set))
如果你打算使用棕色语料库(不包含解析的句子),你可以使用 tagged_sent()
:
>>> from nltk.corpus import brown
>>> brown_tagged_sents = brown.tagged_sents()
>>> len(brown_tagged_sents)
57340
>>> brown_tagged_sents[0]
[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN'), (u"Atlanta's", u'NP$'), (u'recent', u'JJ'), (u'primary', u'NN'), (u'election', u'NN'), (u'produced', u'VBD'), (u'``', u'``'), (u'no', u'AT'), (u'evidence', u'NN'), (u"''", u"''"), (u'that', u'CS'), (u'any', u'DTI'), (u'irregularities', u'NNS'), (u'took', u'VBD'), (u'place', u'NN'), (u'.', u'.')]
>>> total_len = len(brown_tagged_sents)
>>> train_len = int(90 * total_len/100)
>>> train_set = brown_tagged_sents[:train_len]
>>> train_brown_tagged_words = cfd(chain(*train_set))
>>> train_brown_tagged_words['asbestos']
FreqDist({u'NN': 1})
如@alexis 所述,除非您在句子级别拆分语料库。 tagged_words()
函数也存在于Penn Treebank API in NLTK:
>>> from nltk.corpus import treebank
>>> from nltk.corpus import brown
>>> treebank.tagged_words()
[(u'Pierre', u'NNP'), (u'Vinken', u'NNP'), ...]
>>> brown.tagged_words()
[(u'The', u'AT'), (u'Fulton', u'NP-TL'), ...]
>>> type(treebank.tagged_words())
<class 'nltk.corpus.reader.util.ConcatenatedCorpusView'>
>>> type(brown.tagged_words())
<class 'nltk.corpus.reader.util.ConcatenatedCorpusView'>
>>> from nltk import ConditionalFreqDist as cfd
>>> cfd(brown.tagged_words())
<ConditionalFreqDist with 56057 conditions>
>>> cfd(treebank.tagged_words())
<ConditionalFreqDist with 12408 conditions>