正则表达式嵌套括号(忽略括号内和空格)
Regex nested brackets (ignoring inside brackets and whitespace)
我正在尝试创建一个正则表达式模式来读取 bibTex 引文文件并匹配括号内的所有内容。对于那些不知道的人,bibtex 引文如下所示:
@INPROCEEDINGS{Fogel95,
AUTHOR = {L. J. Fogel and P. J. Angeline and D. B. Fogel},
TITLE = {An evolutionary programming approach to self-adaptation
on finite state machines},
BOOKTITLE = {Proceedings of the Fourth International Conference on
Evolutionary Programming},
YEAR = {1995},
pages = {355--365}
}
@ARTICLE{Goldberg91,
AUTHOR = {D. Goldberg},
TITLE = {Real-coded genetic algorithms, virtual alphabets, and blocking},
JOURNAL = {Complex Systems},
YEAR = {1991},
pages = {139--167}
}
@INPROCEEDINGS{Yao96,
AUTHOR = {X. Yao and Y. Liu},
TITLE = {Fast evolutionary programming},
BOOKTITLE = {Proceedings of the 6$^{th}$ Annual Conference on Evolutionary
Programming},
YEAR = {1996},
pages = {451--460}
}
我目前的模式如下:
@(\w+)\{(\w+),\s*((\w+)\s*=\s*(\"|\{)?(.+)(\"|\})?,?\s*)+\}
此模式与第二个引用匹配,但仅与第一和第三个引用的部分匹配。我知道它与第三次引用不匹配的原因是因为引用左侧的括号( 6$^ { th } $ ) 而且我发现它不会匹配引用元素左侧 whitespaces/newlines 的引用
BOOKTITLE = {Proceedings of the Fourth International Conference on
Evolutionary Programming},
//This part of the citation has a newline in the middle of it.
现在我一直在努力修复我的模式,但我发现正则表达式的问题是,我尝试修复 expression/add 新条件的时间越长,就越令人困惑它得到了。我只是想知道我是如何捕获整个引文的,而不考虑内部 brackets/parenthesis。有些引文在“=”符号后根本不包含 brackets/parenthesis。任何帮助以及解释将不胜感激。我看过类似的例子,这些例子只会让我更加困惑,因为仅仅看一眼就很难破译正则表达式。谢谢。
捕获花括号之间所有内容的最简单方法是:
\{([^}]+)}
否定[^}]
包括所有字符而不是大括号,包括换行符。
正则表达式不适用于包含嵌套块的文本。
如果你坚持使用正则表达式,你应该先匹配外部:
@INPROCEEDINGS{Fogel95,
???
}
捕获 ???
,这样您就可以在嵌套循环中匹配它。
外部正则表达式类似于 @(\w+)\{(\w+),([^{}]*(?:\{[^{}]*\}[^{}]*)*)\}
内部正则表达式类似于 (\w+)\s*=\s*\{([^}]*)\}
由于一个字段值可能被包裹在多行中,您需要将其解包。
代码
Pattern pTag = Pattern.compile("@(\w+)" + // tag
"\{" +
"(\w+)" + // name
"," +
"([^{}]*(?:\{[^{}]*\}[^{}]*)*)" + // content
"\}");
Pattern pField = Pattern.compile("(\w+)" + // field
"\s*=\s*" +
"\{" +
"([^}]*)" + // value
"\}");
Pattern pNewline = Pattern.compile("\s*(?:\R\s*)+");
for (Matcher mTag = pTag.matcher(input); mTag.find(); ) {
String tag = mTag.group(1);
String name = mTag.group(2);
String content = mTag.group(3);
for (Matcher mField = pField.matcher(content); mField.find(); ) {
String field = mField.group(1);
String value = mField.group(2);
value = pNewline.matcher(value).replaceAll(" ");
System.out.printf("%-15s %-12s %-11s %s%n", tag, name, field, value);
}
}
测试输入
String input = "@INPROCEEDINGS{Fogel95,\n" +
" AUTHOR = {L. J. Fogel and P. J. Angeline and D. B. Fogel},\n" +
" TITLE = {An evolutionary programming approach to self-adaptation\n" +
" on finite state machines},\n" +
" BOOKTITLE = {Proceedings of the Fourth International Conference on\n" +
" Evolutionary Programming},\n" +
" YEAR = {1995},\n" +
" pages = {355--365}\n" +
"}\n" +
"\n" +
"@ARTICLE{Goldberg91,\n" +
" AUTHOR = {D. Goldberg},\n" +
" TITLE = {Real-coded genetic algorithms, virtual alphabets, and blocking},\n" +
" JOURNAL = {Complex Systems},\n" +
" YEAR = {1991},\n" +
" pages = {139--167}\n" +
"}\n" +
"\n" +
"@INPROCEEDINGS{Yao96,\n" +
" AUTHOR = {X. Yao and Y. Liu},\n" +
" TITLE = {Fast evolutionary programming},\n" +
" BOOKTITLE = {Proceedings of the 6$^{th}$ Annual Conference on Evolutionary\n" +
" Programming},\n" +
" YEAR = {1996},\n" +
" pages = {451--460}\n" +
"}";
输出
INPROCEEDINGS Fogel95 AUTHOR L. J. Fogel and P. J. Angeline and D. B. Fogel
INPROCEEDINGS Fogel95 TITLE An evolutionary programming approach to self-adaptation on finite state machines
INPROCEEDINGS Fogel95 BOOKTITLE Proceedings of the Fourth International Conference on Evolutionary Programming
INPROCEEDINGS Fogel95 YEAR 1995
INPROCEEDINGS Fogel95 pages 355--365
ARTICLE Goldberg91 AUTHOR D. Goldberg
ARTICLE Goldberg91 TITLE Real-coded genetic algorithms, virtual alphabets, and blocking
ARTICLE Goldberg91 JOURNAL Complex Systems
ARTICLE Goldberg91 YEAR 1991
ARTICLE Goldberg91 pages 139--167
据我所知,Andreas 的解决方案可能更好,但如果您想要只是 一个将整个字符串分解为数组的正则表达式字符串,您可以使用以下方法: @(.*){(.*),\s*(.*?)\s*=\s*{(.*?)},(?:\s*(.*) =\s*{([\s\S]*?)},)*?(?:\s*?(.*?) =\s*?{(.*?)})*?\s*?}
我正在尝试创建一个正则表达式模式来读取 bibTex 引文文件并匹配括号内的所有内容。对于那些不知道的人,bibtex 引文如下所示:
@INPROCEEDINGS{Fogel95,
AUTHOR = {L. J. Fogel and P. J. Angeline and D. B. Fogel},
TITLE = {An evolutionary programming approach to self-adaptation
on finite state machines},
BOOKTITLE = {Proceedings of the Fourth International Conference on
Evolutionary Programming},
YEAR = {1995},
pages = {355--365}
}
@ARTICLE{Goldberg91,
AUTHOR = {D. Goldberg},
TITLE = {Real-coded genetic algorithms, virtual alphabets, and blocking},
JOURNAL = {Complex Systems},
YEAR = {1991},
pages = {139--167}
}
@INPROCEEDINGS{Yao96,
AUTHOR = {X. Yao and Y. Liu},
TITLE = {Fast evolutionary programming},
BOOKTITLE = {Proceedings of the 6$^{th}$ Annual Conference on Evolutionary
Programming},
YEAR = {1996},
pages = {451--460}
}
我目前的模式如下:
@(\w+)\{(\w+),\s*((\w+)\s*=\s*(\"|\{)?(.+)(\"|\})?,?\s*)+\}
此模式与第二个引用匹配,但仅与第一和第三个引用的部分匹配。我知道它与第三次引用不匹配的原因是因为引用左侧的括号( 6$^ { th } $ ) 而且我发现它不会匹配引用元素左侧 whitespaces/newlines 的引用
BOOKTITLE = {Proceedings of the Fourth International Conference on
Evolutionary Programming},
//This part of the citation has a newline in the middle of it.
现在我一直在努力修复我的模式,但我发现正则表达式的问题是,我尝试修复 expression/add 新条件的时间越长,就越令人困惑它得到了。我只是想知道我是如何捕获整个引文的,而不考虑内部 brackets/parenthesis。有些引文在“=”符号后根本不包含 brackets/parenthesis。任何帮助以及解释将不胜感激。我看过类似的例子,这些例子只会让我更加困惑,因为仅仅看一眼就很难破译正则表达式。谢谢。
捕获花括号之间所有内容的最简单方法是:
\{([^}]+)}
否定[^}]
包括所有字符而不是大括号,包括换行符。
正则表达式不适用于包含嵌套块的文本。
如果你坚持使用正则表达式,你应该先匹配外部:
@INPROCEEDINGS{Fogel95,
???
}
捕获 ???
,这样您就可以在嵌套循环中匹配它。
外部正则表达式类似于 @(\w+)\{(\w+),([^{}]*(?:\{[^{}]*\}[^{}]*)*)\}
内部正则表达式类似于 (\w+)\s*=\s*\{([^}]*)\}
由于一个字段值可能被包裹在多行中,您需要将其解包。
代码
Pattern pTag = Pattern.compile("@(\w+)" + // tag
"\{" +
"(\w+)" + // name
"," +
"([^{}]*(?:\{[^{}]*\}[^{}]*)*)" + // content
"\}");
Pattern pField = Pattern.compile("(\w+)" + // field
"\s*=\s*" +
"\{" +
"([^}]*)" + // value
"\}");
Pattern pNewline = Pattern.compile("\s*(?:\R\s*)+");
for (Matcher mTag = pTag.matcher(input); mTag.find(); ) {
String tag = mTag.group(1);
String name = mTag.group(2);
String content = mTag.group(3);
for (Matcher mField = pField.matcher(content); mField.find(); ) {
String field = mField.group(1);
String value = mField.group(2);
value = pNewline.matcher(value).replaceAll(" ");
System.out.printf("%-15s %-12s %-11s %s%n", tag, name, field, value);
}
}
测试输入
String input = "@INPROCEEDINGS{Fogel95,\n" +
" AUTHOR = {L. J. Fogel and P. J. Angeline and D. B. Fogel},\n" +
" TITLE = {An evolutionary programming approach to self-adaptation\n" +
" on finite state machines},\n" +
" BOOKTITLE = {Proceedings of the Fourth International Conference on\n" +
" Evolutionary Programming},\n" +
" YEAR = {1995},\n" +
" pages = {355--365}\n" +
"}\n" +
"\n" +
"@ARTICLE{Goldberg91,\n" +
" AUTHOR = {D. Goldberg},\n" +
" TITLE = {Real-coded genetic algorithms, virtual alphabets, and blocking},\n" +
" JOURNAL = {Complex Systems},\n" +
" YEAR = {1991},\n" +
" pages = {139--167}\n" +
"}\n" +
"\n" +
"@INPROCEEDINGS{Yao96,\n" +
" AUTHOR = {X. Yao and Y. Liu},\n" +
" TITLE = {Fast evolutionary programming},\n" +
" BOOKTITLE = {Proceedings of the 6$^{th}$ Annual Conference on Evolutionary\n" +
" Programming},\n" +
" YEAR = {1996},\n" +
" pages = {451--460}\n" +
"}";
输出
INPROCEEDINGS Fogel95 AUTHOR L. J. Fogel and P. J. Angeline and D. B. Fogel
INPROCEEDINGS Fogel95 TITLE An evolutionary programming approach to self-adaptation on finite state machines
INPROCEEDINGS Fogel95 BOOKTITLE Proceedings of the Fourth International Conference on Evolutionary Programming
INPROCEEDINGS Fogel95 YEAR 1995
INPROCEEDINGS Fogel95 pages 355--365
ARTICLE Goldberg91 AUTHOR D. Goldberg
ARTICLE Goldberg91 TITLE Real-coded genetic algorithms, virtual alphabets, and blocking
ARTICLE Goldberg91 JOURNAL Complex Systems
ARTICLE Goldberg91 YEAR 1991
ARTICLE Goldberg91 pages 139--167
据我所知,Andreas 的解决方案可能更好,但如果您想要只是 一个将整个字符串分解为数组的正则表达式字符串,您可以使用以下方法: @(.*){(.*),\s*(.*?)\s*=\s*{(.*?)},(?:\s*(.*) =\s*{([\s\S]*?)},)*?(?:\s*?(.*?) =\s*?{(.*?)})*?\s*?}