Python 中的词级词性标注

Question

我正在尝试为每行中的每个单词做 pos 标记（每行包含几个句子）。

我有这个代码：

import nltk import pos_tag
import nltk.tokenize import word_tokenize

f = open('C:\Users\test_data.txt')
data = f.readlines()

#Parse the text file for NER with POS Tagging
for line in data:
    tokens = nltk.word_tokenize(line)
    tagged = nltk.pos_tag(tokens)
    entities = nltk.chunk.ne_chunk(tagged)
    print entities
f.close()

但是代码为每一行提供了一个标记，输出如下所示：

[('The apartment is brand new and pristine in its cleanliness.', 'NNP'), ('"Awesome little place in the mountains."', 'NNP'), ('Very comfortable place close to the fatima luas stop. I love this place. \njose and vadym are very welcoming and treated me very well. \nwill stay again hopefully.', 'NNP'), ('Very helpful and communicative host. Excellent location, well connected to public transport . Room was a bit too small for a couple and the lack of cupboards was sorely felt.\n\nOtherwise quite clean and well maintained.', 'NNP'), ('Everything was exactly as described. It is beautiful. ', 'NNP')]

我的代码有'tokenizer'，我不知道我的代码有什么问题。我需要为每个单词而不是每一行添加 pos 标签。但是每一行仍然应该用括号或类似的东西来分块（或区分）。

Answer 1

（从我电脑上运行的纯复制粘贴）

运行您的代码（注意简单的导入语句）：

#!/usr/bin/env python3
# encoding: utf-8
import nltk
f = open('/home/matthieu/Téléchargements/testtext.txt')
data = f.readlines()

for line in data:
    tokens = nltk.word_tokenize(line)
    tagged = nltk.pos_tag(tokens)
    entities = nltk.chunk.ne_chunk(tagged)
    print(entities)
f.close()

在以下 unicode 原始文本文件（3 行）中：

(this is a first example.)(Another sentence in another parentheses.)
(onlyone in that line)
this is a second one wihtout parenthesis. (Another sentence in another parentheses.)

我得到以下结果：

(S
(/(
this/DT
is/VBZ
a/DT
first/JJ
example/NN
./.
)/)
(/(
Another/DT
sentence/NN
in/IN
another/DT
parentheses/NNS
./.
)/))
(S (/( onlyone/NN in/IN that/DT line/NN )/))
(S
this/DT
...

如您所见，没有特别的问题。您是否正确解析了您的 csv 数据？ csv 对你有用吗？您是否尝试使用简单的文本文件？

Python 中的词级词性标注

Word-level pos tagging in Python

python

tokenize

pos-tagger