正则表达式 (pcre) - 解析 ini-like 文件，在不同的嵌套、不带引号的定界符之间

Question

在 Regex101 上查看：click here

给定标题、（可能）部分名称和字段名称，我想读取字段的值，有点类似于从 ini 文件中读取值。

例如，对于以下文件：
给定：标题：heading2，部分：（空），字段：field1
输出：field names can be repeated among headings.

另一个例子：
给定：标题：heading2，部分：anothersection，字段：field2
输出：Regex is harder when I add an @ in a multi-line string, or if I add backslash-escaped characters like \" and \'.

What happens if I have an empty line in a string?

Also, [this line] isn't actually a section.

另一个例子：
给定：标题：aaaaaaa，部分：（空），字段：bbbbb
输出:（无输出；指定的标题、部分或字段不存在）

但是，我的文件与 ini 文件不同。虽然 ini 文件也有部分，但我的就像一系列串联的 ini 文件，用 @ .. 分隔它们：

@ heading1
field1: "single-line strings are quoted only sometimes."
field2: "strings that span
multiple lines
are always quoted."
field3: this single-line string is unquoted.

@ heading2
field1: field names can be repeated among headings.
field2: "Regex is harder when I add an
@ in a multi-line string, or if I add
backslash-escaped characters like \" and \'.

What happens if I have an empty line in a string?

Also,
[this line]
isn't actually a section."
field3: this field comes after field2
[sectionname]
field1: the same field name under a different section.
[anothersection]
field1: a second section under the same heading
field4: field number four

@ heading3
field1: value value value value value
field2: "quoted string
quoted string
quoted string"
unique: unique field name

我希望能够指定一个标题，可能是一个部分名称和一个字段名称。如果该字段存在于指定的 header 名称下，则捕获组中的值。我也想匹配整个标题，不管值是否被捕获。

我已经走到这一步了：

^@ heading2$[\s\S]*?^(?:field2: \"?((?<=\")[^"\]*(?:\.[^"\]*)*(?=\"$)|(?<!\").*(?!\")$)\"?$|)?$[\s\S]*?(?=@ [^\s]*?|\Z)
(with gm modifiers)

这完成了我想要的大部分工作，包括处理 multi-line 字符串和 backslash-escaped 引号。

但是，我正在努力解决以下问题：

捕获两个不同嵌套定界符之间的文本，忽略引号定界符。

就我而言，我正在努力忽略引用的 @ .. 符号和 [sectionname]s.
匹配两个定界符之间的整个文本，但只搜索它们之间的文本。

在我的例子中，搜索 section/heading 下的字段，如果该字段不存在，则不会过度搜索到下一个 section/heading。

我当前的正则表达式只避免过度进入下一个 section/heading，因为我使用了包含在 ^$ 中的惰性交替。但是，我不能只依赖 @ .. 之前的空行。如果我在引用值中有一个空行，我将无法搜索它后面的任何字段。

我感觉我用错了正则表达式。感谢您的帮助！

Answer 1

I get the feeling I'm using regex wrong.

确实如此！有一种更简单的方法：您可以使用正则表达式 标记化 该输入，然后使用一些代码来理解它。

这是给你的模式（x）：

^@\s*(?<heading>\w+)\s*?$
|^\[(?<section>\w+)\]\s*?$
|^(?<field>\w+):\s*(?:"(?<quotedstr>(?:\.|[^\"]++)*+)"|(?<barestr>.*?))\s*?$

现在看看this demo:

看看不同的颜色如何很好地匹配您可以拥有的不同令牌类型。如果您将文本悬停在 regex101 上，它甚至会在工具提示中告诉您令牌类型（组名）和值。

每场比赛可以是：

一个标题
一段
或一个字段及其值。

因此，只需迭代保持状态（当前标题和当前部分）的匹配项，您就可以轻松应用您想要的任何逻辑。

正则表达式 (pcre) - 解析 ini-like 文件，在不同的嵌套、不带引号的定界符之间

Regex (pcre) - Parse ini-like file, between different nested, unquoted delimiters

regex

parsing

pcre