Java 正则表达式匹配但 String.replaceAll() 不替换匹配的子字符串

Question

public class test {
        public static void main(String[]args) {
            String test1 = "N&oslash;rrebro, Denmark";
            String test2 = "&oslash;";
            String regex = new String("^&\S*;$");
            String value = test1.replaceAll(regex,"");
            System.out.println(test2.matches(regex));
            System.out.println(value);
        }
    }

这给了我以下输出：

true
N&oslash;rrebro, Denmark

这怎么可能？为什么 replaceAll() 没有注册匹配项？

Answer 1

您的正则表达式包括 ^。这使得正则表达式从一开始就匹配。

如果你尝试

test1.matches(regex)

你会得到false.

Answer 2

这是可能的，因为 ^&\S*;$ 模式匹配整个 ø 字符串，但不匹配整个 Nørrebro, Denmark 字符串。 ^ 匹配（此处要求）字符串的开头正好在 & 之前，而 $ 要求 ; 出现在字符串的结尾。

只是去掉^和$ anchors可能行不通，因为\S*是一个贪婪的模式，它可能会过度匹配，例如在 Nørrebro;.

您可以使用 &\w+; 或 &\S+?; 模式，例如：

String test1 = "N&oslash;rrebro, Denmark";
String regex = "&\w+;";
String value = test1.replaceAll(regex,"");
System.out.println(value); // => Nrrebro, Denmark

参见Java demo。

&\w+; 模式匹配 &，然后是任何 1+ 个单词字符，然后是 ;，字符串内的任何位置。 \S*? 匹配除空格以外的任何 0+ 个字符。

Answer 3

您需要了解 ^ 和 $ 的含义。

你可能把它们放在那里是因为你想说：

At the start of each match, I want a &, then 0 or more non-whitespace characters, then a ; at the end of the match.

但是，^和$并不代表每个匹配的开始和结束。表示字符串.

的开始和结束

因此您应该从正则表达式中删除 ^ 和 $：

String regex = "&\S*;";

现在输出：

true
Nrrebro, Denmark

"What character specifies the start and end of the match then?" 你可能会问。好吧，因为你的正则表达式基本上是你匹配的模式，正则表达式的开始是匹配的开始（除非你有 lookbehinds）！

Answer 4

您可以使用这个正则表达式：&(.*?);

        String test1 = "N&oslash;rrebro, Denmark";
        String test2 = "&oslash;";
        String regex = new String("&(.*?);");
        String value = test1.replaceAll(regex,"");
        System.out.println(test2.matches(regex));
        System.out.println(value);

输出：

true 
Nrrebro, Denmark

Java 正则表达式匹配但 String.replaceAll() 不替换匹配的子字符串

Java regex matches but String.replaceAll() doesn't replace matching substrings

java

regex

string

replaceall