使用 Java 的 BreakIterator 解决边缘案例

Resolving an Edge-Case while using Java's BreakIterator

我正在做一个将 NLP 应用于临床数据的辅助项目,我正在使用 Java 的 BreakIterator 将文本分成句子以供进一步分析。在使用 BreakIterator 时,我遇到了 BreakIterator 无法识别以数值开头的句子的问题。

示例:

String text = "1) No acute osseous abnormality. 2) Mild to moderate disc space narrowing at the L4-5 level. This is another sentence."

预期输出:

1) No acute osseous abnormality.
2) Mild to moderate disc space narrowing at the L4-5 level.
This is another sentence.

实际输出:

1) No acute osseous abnormality. 2) Mild to moderate disc space narrowing at the L4-5 level.
This is another sentence.

代码:

import java.text.BreakIterator;
import java.util.*;

public class Test {
   public static void main(String[] args) {
      String text = "1) No acute osseous abnormality. 2) Mild to moderate disc space narrowing at the L4-5 level. This is another sentence";
      Locale locale = Locale.US;
      BreakIterator splitIntoSentences = BreakIterator.getSentenceInstance(locale);
      splitIntoSentences.setText(text);
      int index = 0;
      while (splitIntoSentences.next() != BreakIterator.DONE) {
        String sentence = text.substring(index, splitIntoSentences.current());
         System.out.println(sentence);
         index = splitIntoSentences.current();
      }
   }
}

如有任何帮助,我们将不胜感激。我试图在网上找到答案,但无济于事。

我现在使用 Apache OpenNLP 而不是使用 BreakIterator,而且效果很好!