优化在段落中查找单词

Question

我正在搜索段落中的单词，但是长段落需要很长时间。因此，我想删除在段落中找到的单词以减少我必须经历的单词数。或者，如果有更好的方法来提高效率，请告知！

List<String> list = new ArrayList<>();
for (String word : wordList) {
    String regex = ".*\b" + Pattern.quote(word) + "\b.*"; 
    Pattern p = Pattern.compile(regex);
    Matcher m = p.matcher(paragraph);
    if (m.find()) {
        System.out.println("Found: " + word);
        list.add(word);
    }
}

例如，假设我的 wordList 具有以下值 "apple","hungry","pie"

我的 paragraph 是 "I ate an apple, but I am still hungry, so I will eat pie"

我想在paragraph中找到wordList中的单词并消除它们，希望能使上面的代码更快

Answer 1

我不太确定这是否是您所要求的，但是 Java 有一个针对字符串上此类事物的内置函数。

for (String word : wordList) {
    paragraph = paragraph.replaceAll(word,"");
}

一定要在你的单词中包含一个 space，这样它就不会留下两个 space。示例 "foo " 而不是 "foo"

Answer 2

您可以使用

String paragraph = "I ate an apple, but I am still hungry, so I will eat pie";
List<String> wordList = Arrays.asList("apple","hungry","pie");
Pattern p = Pattern.compile("\b(?:" + String.join("|", wordList) + ")\b");
Matcher m = p.matcher(paragraph);
if (m.find()) {  // To find all matches, replace "if" with "while"
    System.out.println("Found " + m.group()); // => Found apple
}

参见Java demo。

正则表达式看起来像 \b(?:word1|word2|wordN)\b 并且会匹配：

\b - 单词边界
(?:word1|word2|wordN) - 非捕获组
\b - 单词边界

既然你说单词中的字符只能是大写字母、数字和带斜杠的连字符，其中none需要转义，所以Pattern.quote这里不重要。此外，由于斜杠和连字符永远不会出现在字符串的 start/end 处，因此您不会遇到通常由 \b 字边界引起的问题。否则，将第一个 "\b" 替换为 "(?<!\w)"，将最后一个替换为 "(?!\w)"。

优化在段落中查找单词

Optimize Finding Words in Paragraph

java

regex

string

parsing

matcher