如何训练 Stanford NLP NER Extraction 模型跳过重复词？

Question

我正在尝试使用 .NET Framework 和 StanFord NER 模型 从文本中提取 NER。我有一个像

这样的文本

大家好，我是李四。 Body质量指数为27。Body表面积为2.3m。

为此我创建了 tsv 文件来训练模型。如下：

Hello   O
,   O
I   O
am  O
John    PERSON
Doe.    PERSON
Body    BMI
Mass    BMI
index   BMI
is  O
27. O
And O
Body    O
Surface O
Area    O
is  O
2.3m.   O

道具文件如下

trainFileList = train/standford_train.tsv
serializeTo = dummy-ner-model-eng.ser.gz
map = word=0,answer=1

useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useDisjunctive=true

并使用下面的 java 命令

java -mx1g -cp stanford-ner.jar;lib/* edu.stanford.nlp.ie.crf.CRFClassifier -annotators 'tokenize,ssplit,pos,lemma,ner,regexner' -prop train/prop.txt

所以，我面临的问题是 Body 由于 Body 质量指数 和 [=27 中的重复，标记 BMI 出现了两次=]Body表面积.

有什么方法可以省略第二个 body 标签？

Answer 1

您需要生成更多训练数据，其中包含 Body 未标记为 BMI 的示例。如果您只是寻找特定的模式，则使用基于规则的方法可能会获得更好的结果。 Stanford CoreNLP 中有用于构建基于规则的 NER 的工具。

更多信息：https://stanfordnlp.github.io/CoreNLP/tokensregex.html

如何训练 Stanford NLP NER Extraction 模型跳过重复词？

How to train Stanford NLP NER Extraction model to skip the repeating words?

nlp

stanford-nlp

named-entity-recognition