如何抑制斯坦福 NER 分类器中不匹配的词？

Question

我是 Stanford NLP 和 NER 的新手，正在尝试使用货币和国家/地区的数据集训练自定义分类器。

我在 training-data-currency.tsv 中的训练数据看起来像 -

USD CURRENCY
GBP CURRENCY

并且，training-data-countries.tsv中的训练数据看起来像-

USA COUNTRY
UK  COUNTRY

而且，分类器属性看起来像 -

trainFileList = classifiers/training-data-currency.tsv,classifiers/training-data-countries.tsv
ner.model=classifiers/english.conll.4class.distsim.crf.ser.gz,classifiers/english.muc.7class.distsim.crf.ser.gz,classifiers/english.all.3class.distsim.crf.ser.gz
serializeTo = classifiers/my-classification-model.ser.gz
map = word=0,answer=1

useClassFeature=true
useWord=true
useNGrams=true
#no ngrams will be included that do not contain either the
#beginning or end of the word
noMidNGrams=true
useDisjunctive=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
#the next 4 deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC

Java 查找类别的代码是 -

LinkedHashMap<String, LinkedHashSet<String>> map = new<String, LinkedHashSet<String>> LinkedHashMap();
NERClassifierCombiner classifier = null;
try {
    classifier = new NERClassifierCombiner(true, true, 
            "C:\Users\perso\Downloads\stanford-ner-2015-04-20\stanford-ner-2015-04-20\classifiers\my-classification-model.ser.gz"
            );
} catch (IOException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
}
List<List<CoreLabel>> classify = classifier.classify("Zambia");
for (List<CoreLabel> coreLabels : classify) {
    for (CoreLabel coreLabel : coreLabels) {

        String word = coreLabel.word();
        String category = coreLabel
                .get(CoreAnnotations.AnswerAnnotation.class);
        if (!"O".equals(category)) {
            if (map.containsKey(category)) {
                map.get(category).add(word);
            } else {
                LinkedHashSet<String> temp = new LinkedHashSet<String>();
                temp.add(word);
                map.put(category, temp);
            }
            System.out.println(word + ":" + category);
        }

    }

}

当我运行上面的代码输入 "USD" 或 "UK" 时，我得到的预期结果是 "CURRENCY" 或 "COUNTRY"。但是，当我输入类似 "Russia" 的内容时，return 值是 "CURRENCY"，它来自属性中的第一个训练文件。我期望 'O' 会被 returned 用于这些值，这在我的训练数据中不存在。

我怎样才能实现这种行为？任何我出错的地方都会很有帮助。

Answer 1

嗨，我会尽力帮忙的！

所以在我看来你有一个应该称为 "CURRENCY" 的字符串列表，还有一个应该称为 "COUNTRY" 的字符串列表，等等...

并且您想要根据您的列表来标记字符串。所以当你看到"RUSSIA"时，你希望它被标记为"COUNTRY"，当你看到"USD"时，你希望它被标记为"CURRENCY"。

我认为这些工具对你更有帮助（尤其是第一个）：

http://nlp.stanford.edu/software/regexner/

http://nlp.stanford.edu/software/tokensregex.shtml

NERClassifierCombiner 旨在训练大量标记的句子并查看各种特征，包括大写和周围的单词，以猜测给定单词的 NER 标签。

但在我看来，在您的情况下，您只想根据预定义列表明确标记某些序列。所以我会探索上面提供的链接。

如果您需要更多帮助，请告诉我，我很乐意跟进！

如何抑制斯坦福 NER 分类器中不匹配的词？

How to suppress unmatched words in Stanford NER classifiers?

nlp

named-entity-recognition

stanford-nlp