我公报上的实体无法识别

Entities on my gazette are not recognized

我想创建一个自定义 NER 模型。我就是这么做的:

训练数据(斯坦福-ner.tsv):

Hello    O
!    O
My    O
name    O
is    O
Damiano    PERSON
.    O

属性(斯坦福-ner.prop):

trainFile = stanford-ner.tsv
serializeTo = ner-model.ser.gz
map = word=0,answer=1
maxLeft=1
useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useDisjunctive=true
useSequences=true
usePrevSequences=true
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useGazettes=true
gazette=gazzetta.txt
cleanGazette=true

公报 gazzetta.txt):

PERSON John
PERSON Andrea

我通过命令行构建模型:

java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier  -prop stanford-ner.prop

并测试:

java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier  -loadClassifier ner-model.ser.gz -textFile test.txt

我用以下文本做了两个测试:

>>> 测试 1 <<<

>>> 测试 2 <<<

如您所见,仅找到 "Damiano" 个实体。该实体在我的训练数据中,但 "John"(第二次测试)在公报中。那么问题来了。

为什么无法识别 John 实体?

在此先感谢您。

正如Stanford FAQ所说,

If a gazette is used, this does not guarantee that words in the gazette are always used as a member of the intended class, and it does not guarantee that words outside the gazette will not be chosen. It simply provides another feature for the CRF to train against. If the CRF has higher weights for other features, the gazette features may be overwhelmed.

If you want something that will recognize text as a member of a class if and only if it is in a list of words, you might prefer either the regexner or the tokensregex tools included in Stanford CoreNLP. The CRF NER is not guaranteed to accept all words in the gazette as part of the expected class, and it may also accept words outside the gazette as part of the class.

顺便说一句,以 'unit-test' 方式测试机器学习管道不是一个好的做法,即仅使用一两个示例,因为它应该处理更大的数据量,并且,更重要的是,它本质上是概率性的。

如果您想检查您的公报文件是否实际被使用,最好采用现有示例(请参阅上面链接的页面底部的 austen.gaz.propausten.gaz.txt 示例)和用你自己的名字替换多个名字,然后检查。如果失败,首先尝试更改您的测试,例如添加更多名称、重新组织文本等等。

Why does John entity is not recognized ?

在我看来,您的最小示例最有可能将 "Damiano" 作为 PERSON 类别添加到地名词典中。目前,训练数据允许模型学习 "Damiano" 是一个 PERSON 标签,但我认为这与地名词典类别无关(即两边都有 PERSON 是不够的)。

gazzette 只会帮助从训练数据中提取额外的特征,如果你的训练数据中没有出现这些词或与标记的标记有任何联系,你的模型将不会从中受益。我建议的实验之一是将 Damiano 添加到您的公报中。