使用命名实体训练模型

Question

我正在使用命名实体查看 standford corenlp REcognizer.I 有不同类型的输入文本，我需要将其标记为我自己的 Entity.So 我开始训练自己的模型，但似乎没有要工作了。

例如：我输入的文本字符串是"Book of 49 Magazine Articles on Toyota Land Cruiser 1956-1987 Gold Portfolio http://t.co/EqxmY1VmLg http://t.co/F0Vefuoj9Q"

我通过这些例子来训练我自己的模型，并且只寻找我感兴趣的一些词。

我的简-奥斯汀-艾玛-ch1.tsv看起来像这样

Toyota  PERS
Land Cruiser    PERS

从上面的输入文本来看，我只对这两个词感兴趣。一个是 Toyota 又是Land Cruiser.

austin.prop看起来像这样

trainFile = jane-austen-emma-ch1.tsv
serializeTo = ner-model.ser.gz
map = word=0,answer=1
useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
useDisjunctive=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC

运行下面命令生成ner-model.ser.gz文件

java -cp stanford-corenlp-3.4.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop austen.prop

public static void main(String[] args) {
        String serializedClassifier = "edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz";
        String serializedClassifier2 = "C:/standford-ner/ner-model.ser.gz";
        try {
            NERClassifierCombiner classifier = new NERClassifierCombiner(false, false, 
                    serializedClassifier2,serializedClassifier);
            String ss = "Book of 49 Magazine Articles on Toyota Land Cruiser 1956-1987 Gold Portfolio http://t.co/EqxmY1VmLg http://t.co/F0Vefuoj9Q";
            System.out.println("---");
            List<List<CoreLabel>> out = classifier.classify(ss);
            for (List<CoreLabel> sentence : out) {
              for (CoreLabel word : sentence) {
                System.out.print(word.word() + '/' + word.get(AnswerAnnotation.class) + ' ');
              }
              System.out.println();
            }

        } catch (ClassCastException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }  catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    }

这是我得到的输出

Book/PERS of/PERS 49/O Magazine/PERS Articles/PERS on/O Toyota/PERS Land/PERS Cruiser/PERS 1956-1987/PERS Gold/O Portfolio/PERS http://t.co/EqxmY1VmLg/PERS http://t.co/F0Vefuoj9Q/PERS

我认为它 wrong.I 正在寻找 Toyota/PERS 和土地 Cruiser/PERS（这是一个多值领域。

非常感谢 Help.Any 的帮助。

Answer 1

NERClassifier* 是单词级别的，也就是说，它标记的是单词，而不是短语。鉴于此，分类器似乎表现良好。如果需要，您可以用连字符连接构成短语的单词。因此，在您的标记示例和测试示例中，您会将 "Land Cruiser" 变为 "Land_Cruiser"。

Answer 2

我相信您还应该在 trainFile 中放入 0 个实体的示例。正如您给出的那样，trainFile 对于完成学习来说太简单了，它需要 0 和 PERSON 示例 所以它不需要'不要将所有内容注释为 PERSON。你没有教它关于你不感兴趣的实体。像这样说：

Toyota  PERS
of    0
Portfolio    0
49    0

等等。

此外，对于 短语级别 识别，您应该研究 regexner，其中您可以拥有模式（模式是对我们好）。我正在使用 API 处理此问题，我有以下代码：

Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner");
props.put("regexner.mapping", customLocationFilename);

与以下 customLocationFileName:

Make Believe Town   figure of speech    ORGANIZATION
( /Hello/ [{ ner:PERSON }]+ )   salut   PERSON
Bachelor of (Arts|Laws|Science|Engineering) DEGREE
( /University/ /of/ [{ ner:LOCATION }] )    SCHOOL

和文本：Hello Mary Keller was born on 4th of July and took a Bachelor of Science. Partial invoice (€100,000, so roughly 40%) for the consignment C27655 we shipped on 15th August to University of London from the Make Believe Town depot. INV2345 is for the balance.. Customer contact (Sigourney Weaver) says they will pay this on the usual credit terms (30 days).

我得到的输出

Hello Mary Keller is a salut
4th of July is a DATE
Bachelor of Science is a DEGREE
$ 100,000 is a MONEY
40 % is a PERCENT
15th August is a DATE
University of London is a ORGANIZATION
Make Believe Town is a figure of speech
Sigourney Weaver is a PERSON
30 days is a DURATION

有关如何执行此操作的更多信息，您可以查看 example 让我前进。

使用命名实体训练模型

Train model using Named entity

nlp

named-entity-recognition

pos-tagger

stanford-nlp

sentiment-analysis