正则表达式回溯直到 Java 溢出
Regular Expression backtracks until overflow in Java
下面的表达式:
^(#ifdef FEATURE)+?\s*$((\r\n.*?)*^(#endif)+\s*[\/\/]*\s*(end of)*\s*FEATURE)+?$
在 运行 我编译的 .Jar 文件时覆盖匹配缓冲区。
匹配字符串可以类似于:
this is a junk line
#ifdef FEATURE
#endif // end of FEATURE
this is a junk line
#ifdef FEATURE
this is a junk line that should be matched: HOLasduiqwhei & // FEATURE fjfefj
#endif // h
#endif FEATURE
this is a junk line
因此,粗体字符串应该匹配。错误如下:
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$Curly.match1(Unknown Source)
at java.util.regex.Pattern$Curly.match(Unknown Source)
at java.util.regex.Pattern$Slice.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$Curly.match1(Unknown Source)
at java.util.regex.Pattern$Curly.match(Unknown Source)
at java.util.regex.Pattern$Slice.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$Curly.match1(Unknown Source)
at java.util.regex.Pattern$Curly.match(Unknown Source)
at java.util.regex.Pattern$Slice.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$Curly.match1(Unknown Source)
at java.util.regex.Pattern$Curly.match(Unknown Source)
at java.util.regex.Pattern$Slice.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$Curly.match1(Unknown Source)
at java.util.regex.Pattern$Curly.match(Unknown Source)
at java.util.regex.Pattern$Slice.match(Unknown Source)
欢迎回溯避免 strategy/improvement 表达式。我试过原子组 (?>)
但由于某种原因没有简化。
代码如下:
public String strip(字符串文本) {
ArrayList<String> patterns=new ArrayList<String>();
patterns=readFile("Disabled_Features.txt");
for(int i = 0; i < patterns.size(); ++i)
{
Pattern todoPattern = Pattern.compile("^#ifdef "+patterns.get(i)+"((?:\r?\n(?!#endif (?:// end of )?"+patterns.get(i)+"$).*)*)\r?\n#endif (?:// end of )?"+patterns.get(i)+"$",Pattern.MULTILINE);
Matcher m = todoPattern.matcher(text);
text = m.replaceAll("");
}
return text;
}
我试过@Wiktor写的代码,效果很好
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class TestRegex {
public static void main(String[] args) {
String text = "this is a junk line\n" +
"\n" +
"#ifdef FEATURE \n" +
"#endif // end of FEATURE\n" +
"\n" +
"this is a junk line\n" +
"\n" +
"#ifdef FEATURE\n" +
"\n" +
"this is a junk line that should be matched: HOLasduiqwhei & // FEATURE fjfefj #endif // h\n" +
"\n" +
"#endif FEATURE\n" +
"\n" +
"this is a junk line";
// this version does not use Pattern.MULTILINE, this should reduce the backtraking
Matcher matcher2 = Pattern.compile("\n#ifdef FEATURE((?:\r?\n(?!#endif (?:// end of )?FEATURE).*)*)\r?\n#endif (?:// end of )?FEATURE").matcher(text);
while (matcher2.find()) {
System.out.println(matcher2.group());
}
}
}
这让我认为您的问题是由于输入文件的大小造成的。
因此,如果您的文件太大,您可以将输入实现为 CharSequence
,这样您就可以包装大文本文件。为什么?因为从 Pattern
构建 Matcher
需要 CharSequence
作为参数。
更新:
我尝试实现了 Wiktor 的解决方案:
"^#ifdef "+patterns.get(i)+"((?:\r?\n(?!#endif (?:// end of )?"+patterns.get(i)+"$).*)*)\r?\n#endif (?:// end of )?"+patterns.get(i)+"$"
并且它仅捕获第二个块,但不捕获以下块:
#ifdef FEATURE
Junk captured text
#endif // end of FEATURE
无论如何,当我 运行 罐子仍然溢出时。
下面的表达式:
^(#ifdef FEATURE)+?\s*$((\r\n.*?)*^(#endif)+\s*[\/\/]*\s*(end of)*\s*FEATURE)+?$
在 运行 我编译的 .Jar 文件时覆盖匹配缓冲区。
匹配字符串可以类似于:
this is a junk line
#ifdef FEATURE
#endif // end of FEATUREthis is a junk line
#ifdef FEATURE
this is a junk line that should be matched: HOLasduiqwhei & // FEATURE fjfefj #endif // h
#endif FEATURE
this is a junk line
因此,粗体字符串应该匹配。错误如下:
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$Curly.match1(Unknown Source)
at java.util.regex.Pattern$Curly.match(Unknown Source)
at java.util.regex.Pattern$Slice.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$Curly.match1(Unknown Source)
at java.util.regex.Pattern$Curly.match(Unknown Source)
at java.util.regex.Pattern$Slice.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$Curly.match1(Unknown Source)
at java.util.regex.Pattern$Curly.match(Unknown Source)
at java.util.regex.Pattern$Slice.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$Curly.match1(Unknown Source)
at java.util.regex.Pattern$Curly.match(Unknown Source)
at java.util.regex.Pattern$Slice.match(Unknown Source)
at java.util.regex.Pattern$GroupHead.match(Unknown Source)
at java.util.regex.Pattern$Loop.match(Unknown Source)
at java.util.regex.Pattern$GroupTail.match(Unknown Source)
at java.util.regex.Pattern$Curly.match1(Unknown Source)
at java.util.regex.Pattern$Curly.match(Unknown Source)
at java.util.regex.Pattern$Slice.match(Unknown Source)
欢迎回溯避免 strategy/improvement 表达式。我试过原子组 (?>)
但由于某种原因没有简化。
代码如下:
public String strip(字符串文本) {
ArrayList<String> patterns=new ArrayList<String>();
patterns=readFile("Disabled_Features.txt");
for(int i = 0; i < patterns.size(); ++i)
{
Pattern todoPattern = Pattern.compile("^#ifdef "+patterns.get(i)+"((?:\r?\n(?!#endif (?:// end of )?"+patterns.get(i)+"$).*)*)\r?\n#endif (?:// end of )?"+patterns.get(i)+"$",Pattern.MULTILINE);
Matcher m = todoPattern.matcher(text);
text = m.replaceAll("");
}
return text;
}
我试过@Wiktor写的代码,效果很好
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class TestRegex {
public static void main(String[] args) {
String text = "this is a junk line\n" +
"\n" +
"#ifdef FEATURE \n" +
"#endif // end of FEATURE\n" +
"\n" +
"this is a junk line\n" +
"\n" +
"#ifdef FEATURE\n" +
"\n" +
"this is a junk line that should be matched: HOLasduiqwhei & // FEATURE fjfefj #endif // h\n" +
"\n" +
"#endif FEATURE\n" +
"\n" +
"this is a junk line";
// this version does not use Pattern.MULTILINE, this should reduce the backtraking
Matcher matcher2 = Pattern.compile("\n#ifdef FEATURE((?:\r?\n(?!#endif (?:// end of )?FEATURE).*)*)\r?\n#endif (?:// end of )?FEATURE").matcher(text);
while (matcher2.find()) {
System.out.println(matcher2.group());
}
}
}
这让我认为您的问题是由于输入文件的大小造成的。
因此,如果您的文件太大,您可以将输入实现为 CharSequence
,这样您就可以包装大文本文件。为什么?因为从 Pattern
构建 Matcher
需要 CharSequence
作为参数。
更新:
我尝试实现了 Wiktor 的解决方案:
"^#ifdef "+patterns.get(i)+"((?:\r?\n(?!#endif (?:// end of )?"+patterns.get(i)+"$).*)*)\r?\n#endif (?:// end of )?"+patterns.get(i)+"$"
并且它仅捕获第二个块,但不捕获以下块:
#ifdef FEATURE
Junk captured text
#endif // end of FEATURE
无论如何,当我 运行 罐子仍然溢出时。