使用 Java 的 BreakIterator 解决边缘案例
Resolving an Edge-Case while using Java's BreakIterator
我正在做一个将 NLP 应用于临床数据的辅助项目,我正在使用 Java 的 BreakIterator 将文本分成句子以供进一步分析。在使用 BreakIterator 时,我遇到了 BreakIterator 无法识别以数值开头的句子的问题。
示例:
String text = "1) No acute osseous abnormality. 2) Mild to moderate disc space narrowing at the L4-5 level. This is another sentence."
预期输出:
1) No acute osseous abnormality.
2) Mild to moderate disc space narrowing at the L4-5 level.
This is another sentence.
实际输出:
1) No acute osseous abnormality. 2) Mild to moderate disc space narrowing at the L4-5 level.
This is another sentence.
代码:
import java.text.BreakIterator;
import java.util.*;
public class Test {
public static void main(String[] args) {
String text = "1) No acute osseous abnormality. 2) Mild to moderate disc space narrowing at the L4-5 level. This is another sentence";
Locale locale = Locale.US;
BreakIterator splitIntoSentences = BreakIterator.getSentenceInstance(locale);
splitIntoSentences.setText(text);
int index = 0;
while (splitIntoSentences.next() != BreakIterator.DONE) {
String sentence = text.substring(index, splitIntoSentences.current());
System.out.println(sentence);
index = splitIntoSentences.current();
}
}
}
如有任何帮助,我们将不胜感激。我试图在网上找到答案,但无济于事。
我现在使用 Apache OpenNLP 而不是使用 BreakIterator,而且效果很好!
我正在做一个将 NLP 应用于临床数据的辅助项目,我正在使用 Java 的 BreakIterator 将文本分成句子以供进一步分析。在使用 BreakIterator 时,我遇到了 BreakIterator 无法识别以数值开头的句子的问题。
示例:
String text = "1) No acute osseous abnormality. 2) Mild to moderate disc space narrowing at the L4-5 level. This is another sentence."
预期输出:
1) No acute osseous abnormality.
2) Mild to moderate disc space narrowing at the L4-5 level.
This is another sentence.
实际输出:
1) No acute osseous abnormality. 2) Mild to moderate disc space narrowing at the L4-5 level.
This is another sentence.
代码:
import java.text.BreakIterator;
import java.util.*;
public class Test {
public static void main(String[] args) {
String text = "1) No acute osseous abnormality. 2) Mild to moderate disc space narrowing at the L4-5 level. This is another sentence";
Locale locale = Locale.US;
BreakIterator splitIntoSentences = BreakIterator.getSentenceInstance(locale);
splitIntoSentences.setText(text);
int index = 0;
while (splitIntoSentences.next() != BreakIterator.DONE) {
String sentence = text.substring(index, splitIntoSentences.current());
System.out.println(sentence);
index = splitIntoSentences.current();
}
}
}
如有任何帮助,我们将不胜感激。我试图在网上找到答案,但无济于事。
我现在使用 Apache OpenNLP 而不是使用 BreakIterator,而且效果很好!