使用 PerceptronTagger 读取我自己的 NLTK 词性标注数据集
Read my own dataset for NLTK Part of Speech tagging using PerceptronTagger
我是 NLTK 的新手,对 python 还是很陌生。我想使用我自己的数据集来训练和测试 NLTK 的感知器标注器。训练和测试数据具有以下格式(它只是保存在一个 txt 文件中):
Pierre NNP
Vinken NNP
, ,
61 CD
years NNS
old JJ
, ,
will MD
join VB
the DT
board NN
as IN
a DT
nonexecutive JJ
director NN
Nov. NNP
29 CD
. .
我想在数据上调用这些函数:
perceptron_tagger = nltk.tag.perceptron.PerceptronTagger(load=False)
perceptron_tagger.train(train_data)
accuracy = perceptron_tagger.evaluate(test_data)
我已经尝试了一些方法,但我就是无法弄清楚数据应该采用什么格式。任何帮助将不胜感激!谢谢
train()
和 evaluate()
函数的输入 PerceptronTagger
需要一个元组列表列表,其中每个内部列表是一个列表,每个元组是一对字符串。
鉴于 train.txt
和 test.txt
:
$ cat train.txt
This foo
is foo
a foo
sentence bar
. .
That foo
is foo
another foo
sentence bar
in foo
conll bar
format bar
. .
$ cat test.txt
What foo
is foo
this foo
sentence bar
? ?
How foo
about foo
that foo
sentence bar
? ?
将CoNLL格式的文件读入元组列表。
# Using https://github.com/alvations/lazyme
>>> from lazyme import per_section
>>> tagged_train_sentences = [[tuple(token.split('\t')) for token in sent] for sent in per_section(open('train.txt'))]
# Or otherwise
>>> def per_section(it, is_delimiter=lambda x: x.isspace()):
... """
... From
... """
... ret = []
... for line in it:
... if is_delimiter(line):
... if ret:
... yield ret # OR ''.join(ret)
... ret = []
... else:
... ret.append(line.rstrip()) # OR ret.append(line)
... if ret:
... yield ret
...
>>>
>>> tagged_test_sentences = [[tuple(token.split('\t')) for token in sent] for sent in per_section(open('test.txt'))]
>>> tagged_test_sentences
[[('What', 'foo'), ('is', 'foo'), ('this', 'foo'), ('sentence', 'bar'), ('?', '?')], [('How', 'foo'), ('about', 'foo'), ('that', 'foo'), ('sentence', 'bar'), ('?', '?')]]
现在您可以 train/evaluate 标注器:
>>> from lazyme import per_section
>>> tagged_train_sentences = [[tuple(token.split('\t')) for token in sent] for sent in per_section(open('train.txt'))]
>>> from nltk.tag.perceptron import PerceptronTagger
>>> pct = PerceptronTagger(load=False)
>>> pct.train(tagged_train_sentences)
>>> pct.tag('Where do I find a foo bar sentence ?'.split())
[('Where', 'foo'), ('do', 'foo'), ('I', '.'), ('find', 'foo'), ('a', 'foo'), ('foo', 'bar'), ('bar', 'foo'), ('sentence', 'bar'), ('?', '.')]
>>> tagged_test_sentences = [[tuple(token.split('\t')) for token in sent] for sent in per_section(open('test.txt'))]
>>> pct.evaluate(tagged_test_sentences)
0.8
我是 NLTK 的新手,对 python 还是很陌生。我想使用我自己的数据集来训练和测试 NLTK 的感知器标注器。训练和测试数据具有以下格式(它只是保存在一个 txt 文件中):
Pierre NNP
Vinken NNP
, ,
61 CD
years NNS
old JJ
, ,
will MD
join VB
the DT
board NN
as IN
a DT
nonexecutive JJ
director NN
Nov. NNP
29 CD
. .
我想在数据上调用这些函数:
perceptron_tagger = nltk.tag.perceptron.PerceptronTagger(load=False)
perceptron_tagger.train(train_data)
accuracy = perceptron_tagger.evaluate(test_data)
我已经尝试了一些方法,但我就是无法弄清楚数据应该采用什么格式。任何帮助将不胜感激!谢谢
train()
和 evaluate()
函数的输入 PerceptronTagger
需要一个元组列表列表,其中每个内部列表是一个列表,每个元组是一对字符串。
鉴于 train.txt
和 test.txt
:
$ cat train.txt
This foo
is foo
a foo
sentence bar
. .
That foo
is foo
another foo
sentence bar
in foo
conll bar
format bar
. .
$ cat test.txt
What foo
is foo
this foo
sentence bar
? ?
How foo
about foo
that foo
sentence bar
? ?
将CoNLL格式的文件读入元组列表。
# Using https://github.com/alvations/lazyme
>>> from lazyme import per_section
>>> tagged_train_sentences = [[tuple(token.split('\t')) for token in sent] for sent in per_section(open('train.txt'))]
# Or otherwise
>>> def per_section(it, is_delimiter=lambda x: x.isspace()):
... """
... From
... """
... ret = []
... for line in it:
... if is_delimiter(line):
... if ret:
... yield ret # OR ''.join(ret)
... ret = []
... else:
... ret.append(line.rstrip()) # OR ret.append(line)
... if ret:
... yield ret
...
>>>
>>> tagged_test_sentences = [[tuple(token.split('\t')) for token in sent] for sent in per_section(open('test.txt'))]
>>> tagged_test_sentences
[[('What', 'foo'), ('is', 'foo'), ('this', 'foo'), ('sentence', 'bar'), ('?', '?')], [('How', 'foo'), ('about', 'foo'), ('that', 'foo'), ('sentence', 'bar'), ('?', '?')]]
现在您可以 train/evaluate 标注器:
>>> from lazyme import per_section
>>> tagged_train_sentences = [[tuple(token.split('\t')) for token in sent] for sent in per_section(open('train.txt'))]
>>> from nltk.tag.perceptron import PerceptronTagger
>>> pct = PerceptronTagger(load=False)
>>> pct.train(tagged_train_sentences)
>>> pct.tag('Where do I find a foo bar sentence ?'.split())
[('Where', 'foo'), ('do', 'foo'), ('I', '.'), ('find', 'foo'), ('a', 'foo'), ('foo', 'bar'), ('bar', 'foo'), ('sentence', 'bar'), ('?', '.')]
>>> tagged_test_sentences = [[tuple(token.split('\t')) for token in sent] for sent in per_section(open('test.txt'))]
>>> pct.evaluate(tagged_test_sentences)
0.8