捕捉单词并重写
Capture words and rewrite
用nlpnet做了一个词分类器(http://nilc.icmc.usp.br/nlpnet/index.html)。目标是使用给定的标注器单独提取单词。
响应码
import nlpnet
import codecs
import itertools
TAGGER = nlpnet.POSTagger('pos-pt', language='pt')
def TAGGER_txt(text):
return (list(TAGGER.tag(text)))
with codecs.open('document.txt', encoding='utf8') as original_file:
with codecs.open('document_teste.txt', 'w') as output_file:
for line in original_file.readlines():
print (line)
words = TAGGER_txt(line)
all_words = list(itertools.chain(*words))
nouns = [word[0] for word in all_words if word[1]=='V']
print (nouns)
结果
O gato esta querendo comer o ratão
['gato', 'ratão']
我认为这可能是您所需要的本质。 请查看编辑后的版本。
正如您在问题中所说,标记 Sentence
的结果类似于 tagged
。如果您只想要 Sentence
中的名词,您可以使用 nouns =
.
之后的表达式恢复它们
Sentence = " O gato esta querendo comer o rato "
tagged = [('O', 'ADJ'), ('gato', 'N'), ('esta', 'V'), ('querendo', 'V'), ('comer', 'V'), ('o', 'ADJ'), ('rato', 'N')]
nouns = [t[0] for t in tagged if t[1]=='N']
print (nouns)
输出:
['gato', 'rato']
编辑: 我不清楚你想要什么。这是另一种可能性。
- 我还没有安装 nlpnet,因为那会很麻烦,而且我自己不会使用它。
- 我用TAGGER_txt模拟TAGGER.txt。
- 我已将编码更改为 Latin-1。它用于 header 和
codecs.open
.
.
# -*- coding: Latin-1 -*-
import codecs
import itertools
def TAGGER_txt(text): ## simulate TAGGER.txt
return [[(u'O', u'ART'), (u'gato', u'N'), (u'esta', u'PROADJ'), (u'querendo', u'V'), (u'comer', u'V'), (u'o', u'ART'), (u'ratão', u'N')]]
with codecs.open('document.txt', encoding='Latin-1') as original_file:
with codecs.open('document_test.txt', 'w') as output_file:
for line in original_file.readlines():
print (line)
words = TAGGER_txt(line)
all_words = list(itertools.chain(*words))
nouns = [word[0] for word in all_words if word[1]=='N']
print (nouns)
输出:
O gato esta querendo comer o ratão
['gato', 'ratão']
Question: ... dump to a file the sentences that contain more than N occurrences of a particular POS
Note: Assuming 'document.txt'
contains one Sentence per Line!
def is_worth_saving(tags, pos, pos_count):
"""
:param tags: nlpnet tags from ONE Sentence
:param pos: The POS to filter
:param pos_count: Number of 'param pos'
:return:
True if 'tags' contain more than 'pos_count' occurrences of 'pos'
False otherwise
"""
pos_found = 0
# Iterate tags
for word, _pos in tags:
if _pos == pos:
pos_found += 1
return pos_found >= pos_count
if __name__ == '__main__':
with open('document.txt') as in_fh, open('document_test.txt', 'w') as out_fh:
for sentence in in_fh:
print('Sentence:{}'.format(sentence[:-1]))
tags = TAGGER.tag(sentence)
# As your Example Sentence has only **2** Verbs,
# pass 'pos_count=2'
if is_worth_saving(tags[0], 'V', 2):
out_fh.write(sentence)
print (tags[0])
Output:
Sentence:O gato esta querendo comer o ratão
[(u'O', u'ART'), (u'gato', u'N'), (u'esta', u'PROADJ'), (u'querendo', u'V'), (u'comer', u'V'), (u'o', u'ART'), (u'rat', u'N')]
使用 Python 测试:3.4.2 和 2.7.9
用nlpnet做了一个词分类器(http://nilc.icmc.usp.br/nlpnet/index.html)。目标是使用给定的标注器单独提取单词。
响应码
import nlpnet
import codecs
import itertools
TAGGER = nlpnet.POSTagger('pos-pt', language='pt')
def TAGGER_txt(text):
return (list(TAGGER.tag(text)))
with codecs.open('document.txt', encoding='utf8') as original_file:
with codecs.open('document_teste.txt', 'w') as output_file:
for line in original_file.readlines():
print (line)
words = TAGGER_txt(line)
all_words = list(itertools.chain(*words))
nouns = [word[0] for word in all_words if word[1]=='V']
print (nouns)
结果
O gato esta querendo comer o ratão
['gato', 'ratão']
我认为这可能是您所需要的本质。 请查看编辑后的版本。
正如您在问题中所说,标记 Sentence
的结果类似于 tagged
。如果您只想要 Sentence
中的名词,您可以使用 nouns =
.
Sentence = " O gato esta querendo comer o rato "
tagged = [('O', 'ADJ'), ('gato', 'N'), ('esta', 'V'), ('querendo', 'V'), ('comer', 'V'), ('o', 'ADJ'), ('rato', 'N')]
nouns = [t[0] for t in tagged if t[1]=='N']
print (nouns)
输出:
['gato', 'rato']
编辑: 我不清楚你想要什么。这是另一种可能性。
- 我还没有安装 nlpnet,因为那会很麻烦,而且我自己不会使用它。
- 我用TAGGER_txt模拟TAGGER.txt。
- 我已将编码更改为 Latin-1。它用于 header 和
codecs.open
.
.
# -*- coding: Latin-1 -*-
import codecs
import itertools
def TAGGER_txt(text): ## simulate TAGGER.txt
return [[(u'O', u'ART'), (u'gato', u'N'), (u'esta', u'PROADJ'), (u'querendo', u'V'), (u'comer', u'V'), (u'o', u'ART'), (u'ratão', u'N')]]
with codecs.open('document.txt', encoding='Latin-1') as original_file:
with codecs.open('document_test.txt', 'w') as output_file:
for line in original_file.readlines():
print (line)
words = TAGGER_txt(line)
all_words = list(itertools.chain(*words))
nouns = [word[0] for word in all_words if word[1]=='N']
print (nouns)
输出:
O gato esta querendo comer o ratão
['gato', 'ratão']
Question: ... dump to a file the sentences that contain more than N occurrences of a particular POS
Note: Assuming
'document.txt'
contains one Sentence per Line!
def is_worth_saving(tags, pos, pos_count):
"""
:param tags: nlpnet tags from ONE Sentence
:param pos: The POS to filter
:param pos_count: Number of 'param pos'
:return:
True if 'tags' contain more than 'pos_count' occurrences of 'pos'
False otherwise
"""
pos_found = 0
# Iterate tags
for word, _pos in tags:
if _pos == pos:
pos_found += 1
return pos_found >= pos_count
if __name__ == '__main__':
with open('document.txt') as in_fh, open('document_test.txt', 'w') as out_fh:
for sentence in in_fh:
print('Sentence:{}'.format(sentence[:-1]))
tags = TAGGER.tag(sentence)
# As your Example Sentence has only **2** Verbs,
# pass 'pos_count=2'
if is_worth_saving(tags[0], 'V', 2):
out_fh.write(sentence)
print (tags[0])
Output:
Sentence:O gato esta querendo comer o ratão [(u'O', u'ART'), (u'gato', u'N'), (u'esta', u'PROADJ'), (u'querendo', u'V'), (u'comer', u'V'), (u'o', u'ART'), (u'rat', u'N')]
使用 Python 测试:3.4.2 和 2.7.9