如何以编程方式确定单词的词性标记？

Question

一直在想如何准确判断一个词的词性标记。我玩过诸如 Stanford NLP 等词性标注器，但它们时好时坏，因为像 "respond" 这样的词有时被标记为 NN（名词），而它是动词 (VB).

查询 wordnet 或字典转储会更准确吗？例如"respond"这个词是一个动词，也可以是一个名词。或者也许从 ngrams 推断或添加基于频率的健全性检查？

Answer 1

你试过TextBlob了吗？我的一个朋友正在上语言学课程，他们都用它来标记 POS。这是一个 Python 库。您可以通过包管理器安装 pip。

$ pip install -U textblob

使用时，

>> from textblob import TextBlob

还有更详细的tutorial。你也可以安装他们的语料库 NLTK。（我不会postlink，只是搜索一下，教程很多）

Answer 2

词性标注器传统上基于语料库中单词的概率分布。因此，将用例扩展到新的文本主体通常会产生更高的错误率，因为单词的分布不同。

其他模型并非严格意义上的概率分布，例如神经网络，需要进行训练，但两者的逻辑相同。

例如，如果我通过使用 Hamlet 中的标记句子来定义我的概率分布来为 Shakespeare 文本制作一个词性标注器，然后尝试对 Biomedical 文本进行词性标注，它可能不会表现很好。

因此，为了提高准确性，您应该使用与您的特定领域相似的文本正文进行训练。

NLTK 中当前性能最好的词性标注器是默认的 Perceptron 标注器，它使用预训练模型。以下是您将如何训练自己的模型以提高准确性。

import nltk,math
# get data to train and test
tagged_sentences = [sentence for sentence in nltk.corpus.brown.tagged_sents(categories='news',tagset='universal')]
# hold out 20% for testing, get index for 20% split
split_idx = math.floor(len(tagged_sentences)*0.2)
# testing sentences are words only, list(list(word))
testing_sentences = [[word for word,_ in test_sent] for test_sent in tagged_sentences[0:split_idx]]
# training sentences words and tags, list(list(word,tag))
training_sentences = tagged_sentences[split_idx:] 
# create instance of perceptron POS tagger
perceptron_tagger = nltk.tag.perceptron.PerceptronTagger(load=False)
perceptron_tagger.train(training_sentences)
pos_tagged_sentences = [perceptron_tagger.tag([word for word,_ in test_sentence]) for test_sentence in testing_sentences]

在 perceptron_tagger.train() 完成 training_sentences 后，您可以使用 perceptron_tagger.tag() 获得 pos_tagged_sentences，这对您的域更有用并产生更高的准确性。

如果操作得当，它们将产生高精度的结果。从 my basic tests，他们显示以下结果：

Metrics for <nltk.tag.perceptron.PerceptronTagger object at 0x7f34904d1748>
 Accuracy : 0.965636914654
 Precision: 0.965271747376
 Recall   : 0.965636914654
 F1-Score : 0.965368188021

Answer 3

词性标注是一个出乎意料的难题，考虑到人类做起来是多么容易。词性标注器已经使用许多不同的方法编写，斯坦福标注器是最好的英语通用标注器之一。（有关相当权威的比较，请参阅 here。）因此，如果您建议的方法有任何好处——并且其中一些是——，那么它们已经在使用中了。

如果你认为你可以构建一个更好的标注器，一定要试一试；这将是一次很棒的学习经历。但是，如果您无法在其功能上击败最先进的词性标注器，请不要感到惊讶。

如何以编程方式确定单词的词性标记？

How to programmatically determine the Parts of Speech tag of a word?

grammar

nlp

nltk

pos-tagger