正则表达式嵌套括号（忽略括号内和空格）

Question

我正在尝试创建一个正则表达式模式来读取 bibTex 引文文件并匹配括号内的所有内容。对于那些不知道的人，bibtex 引文如下所示：

@INPROCEEDINGS{Fogel95,
  AUTHOR =       {L. J. Fogel and P. J. Angeline and D. B. Fogel},
  TITLE =        {An evolutionary programming approach to self-adaptation
                    on finite state machines},
  BOOKTITLE =    {Proceedings of the Fourth International Conference on
                    Evolutionary Programming},
  YEAR =         {1995},
  pages =        {355--365}
}

@ARTICLE{Goldberg91,
  AUTHOR =       {D. Goldberg},
  TITLE =        {Real-coded genetic algorithms, virtual alphabets, and blocking},
  JOURNAL =      {Complex Systems},
  YEAR =         {1991},
  pages =        {139--167}
}

@INPROCEEDINGS{Yao96,
  AUTHOR =       {X. Yao and Y. Liu},
  TITLE =        {Fast evolutionary programming},
  BOOKTITLE =    {Proceedings of the 6$^{th}$ Annual Conference on Evolutionary
                    Programming},
  YEAR =         {1996},
  pages =        {451--460}
}

我目前的模式如下：

@(\w+)\{(\w+),\s*((\w+)\s*=\s*(\"|\{)?(.+)(\"|\})?,?\s*)+\}

此模式与第二个引用匹配，但仅与第一和第三个引用的部分匹配。我知道它与第三次引用不匹配的原因是因为引用左侧的括号（ 6$^ { th } $ ) 而且我发现它不会匹配引用元素左侧 whitespaces/newlines 的引用

BOOKTITLE =    {Proceedings of the Fourth International Conference on
                Evolutionary Programming},
//This part of the citation has a newline in the middle of it.

现在我一直在努力修复我的模式，但我发现正则表达式的问题是，我尝试修复 expression/add 新条件的时间越长，就越令人困惑它得到了。我只是想知道我是如何捕获整个引文的，而不考虑内部 brackets/parenthesis。有些引文在“=”符号后根本不包含 brackets/parenthesis。任何帮助以及解释将不胜感激。我看过类似的例子，这些例子只会让我更加困惑，因为仅仅看一眼就很难破译正则表达式。谢谢。

Answer 1

捕获花括号之间所有内容的最简单方法是：

\{([^}]+)}

否定[^}]包括所有字符而不是大括号，包括换行符。

Answer 2

正则表达式不适用于包含嵌套块的文本。

如果你坚持使用正则表达式，你应该先匹配外部：

@INPROCEEDINGS{Fogel95,
  ???
}

捕获 ???，这样您就可以在嵌套循环中匹配它。

外部正则表达式类似于 @(\w+)\{(\w+),([^{}]*(?:\{[^{}]*\}[^{}]*)*)\}

内部正则表达式类似于 (\w+)\s*=\s*\{([^}]*)\}

由于一个字段值可能被包裹在多行中，您需要将其解包。

代码

Pattern pTag = Pattern.compile("@(\w+)" + // tag
                               "\{" +
                                  "(\w+)" + // name
                                  "," +
                                  "([^{}]*(?:\{[^{}]*\}[^{}]*)*)" + // content
                               "\}");
Pattern pField = Pattern.compile("(\w+)" + // field
                                 "\s*=\s*" +
                                 "\{" +
                                    "([^}]*)" + // value
                                 "\}");
Pattern pNewline = Pattern.compile("\s*(?:\R\s*)+");
for (Matcher mTag = pTag.matcher(input); mTag.find(); ) {
    String tag = mTag.group(1);
    String name = mTag.group(2);
    String content = mTag.group(3);
    for (Matcher mField = pField.matcher(content); mField.find(); ) {
        String field = mField.group(1);
        String value = mField.group(2);
        value = pNewline.matcher(value).replaceAll(" ");
        System.out.printf("%-15s %-12s %-11s %s%n", tag, name, field, value);
    }
}

测试输入

String input = "@INPROCEEDINGS{Fogel95,\n" +
               "  AUTHOR =       {L. J. Fogel and P. J. Angeline and D. B. Fogel},\n" +
               "  TITLE =        {An evolutionary programming approach to self-adaptation\n" +
               "                    on finite state machines},\n" +
               "  BOOKTITLE =    {Proceedings of the Fourth International Conference on\n" +
               "                    Evolutionary Programming},\n" +
               "  YEAR =         {1995},\n" +
               "  pages =        {355--365}\n" +
               "}\n" +
               "\n" +
               "@ARTICLE{Goldberg91,\n" +
               "  AUTHOR =       {D. Goldberg},\n" +
               "  TITLE =        {Real-coded genetic algorithms, virtual alphabets, and blocking},\n" +
               "  JOURNAL =      {Complex Systems},\n" +
               "  YEAR =         {1991},\n" +
               "  pages =        {139--167}\n" +
               "}\n" +
               "\n" +
               "@INPROCEEDINGS{Yao96,\n" +
               "  AUTHOR =       {X. Yao and Y. Liu},\n" +
               "  TITLE =        {Fast evolutionary programming},\n" +
               "  BOOKTITLE =    {Proceedings of the 6$^{th}$ Annual Conference on Evolutionary\n" +
               "                    Programming},\n" +
               "  YEAR =         {1996},\n" +
               "  pages =        {451--460}\n" +
               "}";

输出

INPROCEEDINGS   Fogel95      AUTHOR      L. J. Fogel and P. J. Angeline and D. B. Fogel
INPROCEEDINGS   Fogel95      TITLE       An evolutionary programming approach to self-adaptation on finite state machines
INPROCEEDINGS   Fogel95      BOOKTITLE   Proceedings of the Fourth International Conference on Evolutionary Programming
INPROCEEDINGS   Fogel95      YEAR        1995
INPROCEEDINGS   Fogel95      pages       355--365
ARTICLE         Goldberg91   AUTHOR      D. Goldberg
ARTICLE         Goldberg91   TITLE       Real-coded genetic algorithms, virtual alphabets, and blocking
ARTICLE         Goldberg91   JOURNAL     Complex Systems
ARTICLE         Goldberg91   YEAR        1991
ARTICLE         Goldberg91   pages       139--167

Answer 3

据我所知，Andreas 的解决方案可能更好，但如果您想要只是一个将整个字符串分解为数组的正则表达式字符串，您可以使用以下方法： @(.*){(.*),\s*(.*?)\s*=\s*{(.*?)},(?:\s*(.*) =\s*{([\s\S]*?)},)*?(?:\s*?(.*?) =\s*?{(.*?)})*?\s*?}

正则表达式嵌套括号（忽略括号内和空格）

Regex nested brackets (ignoring inside brackets and whitespace)

java

regex

parsing

bibtex