斯坦福数字名称实体识别

Question

我有一个问题，我正在尝试使用 Stanford 从文本中识别数字名称实体，以防我有例如 2000 万它正在像这样检索 "Number":["20-5 ","million-6"]，我如何优化答案以便将 2000 万聚集在一起？以及如何忽略上面示例中的 (5,6) 之类的索引号？我正在使用 java 语言。

    public void extractNumbers(String text) throws  IOException {
    number = new HashMap<String, ArrayList<String>>();
    n= new ArrayList<String>();
    edu.stanford.nlp.pipeline.Annotation document = new edu.stanford.nlp.pipeline.Annotation(text);
    pipeline.annotate(document);
    List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
    for (CoreMap sentence : sentences) {
        for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {

            if (!token.get(CoreAnnotations.NamedEntityTagAnnotation.class).equals("O")) {

                if (token.get(CoreAnnotations.NamedEntityTagAnnotation.class).equals("NUMBER")) {
                  n.add(token.toString());
        number.put("Number",n);
                }
            }

        }

    }

Answer 1

要从 CoreLabel class 的任何对象中获取准确的文本，只需使用 token.originalText() 而不是 token.toString()

如果您需要这些令牌中的任何其他内容，请查看 CoreLabel 的 javadoc。

斯坦福数字名称实体识别

Number name entity recognition in Stanford

nlp

stanford-nlp

opennlp