RegEx 解析中间的 MIME 消息体。如何?

RegEx to parse MIME Message bodies in between. How?

我正在编写一个 IMAP4 备份应用程序。经过大量研究,我找到了 return 所有消息或一系列消息的正确 IMAP 命令。

SS01 UID FETCH 1:* BODY[]

这个漂亮的命令 returns 数据格式如下:

* 1 FETCH (UID 2 BODY[] {7765}
data to be extracted
from here! which can possibly contain
) <--- one or more prior to its final...
)
* 2 FETCH (UID 3 BODY[] {443}
data to be extracted
from here! which can possibly contain
) <--- one or more prior to its final...
)
* 3 FETCH (UID 4 BODY[] {4432}
data to be extracted
from here! which can possibly contain
) <--- one or more prior to its final...
)
* 4 FETCH (UID 5 BODY[] {123}
data to be extracted
from here! which can possibly contain
) <--- one or more prior to its final...
)
SS01 OK Success

我能在本文中找到的唯一独特模式是:

第一条消息以...

开头
1 FETCH (UID 2 BODY[] {7765}

每条不是最后一条的消息都以....

结尾
)
* 2 FETCH (UID 3 BODY[] {443}

最后一条消息以...

结尾
)
SS01 OK Success

我在网站上找到了以下示例,我正在尝试实施但没有成功。

RegEx 模式是:

(?<=This is)(.*)(?=sentence)

这是一个不起作用的最小可重现示例。

(\*\s\d+\s\w+\s\(UID\s\d+\sBODY\[\]\s\{\d+\})(.*\n)(\)\n\*\s\d+\s\w+\s\(UID\s\d+\sBODY\[\]\s\{\d+\})

您可以像这样极大地简化您的正则表达式:

\{\d+\}$[\r\n]+([\s\S]+?)^\)$
  • \{\d+\}$ - 在行尾找到 {digits}
  • [\r\n]+ - 捕获任何新行
  • ([\s\S]+?) - 松散地捕获导致以下内容的所需文本:(阅读以下要点)
  • ^\)$ - 找到只有右括号 )
  • 的行

您想要的文本将在捕获组 #1 中

https://regex101.com/r/A86eEv/1/

var regex = /\{\d+\}$[\r\n]+([\s\S]+?)^\)$/gm;

var text = `* 1 FETCH (UID 2 BODY[] {7765}
data to be extracted
from here!
)
* 2 FETCH (UID 3 BODY[] {443}
data to be extracted
from here!
)
* 3 FETCH (UID 4 BODY[] {4432}
data to be extracted
from here!
)
* 4 FETCH (UID 5 BODY[] {123}
data to be extracted
from here!
)
SS01 OK Success`;

var matches = [...text.matchAll(regex)];
console.log(Array.from(matches,x => x[1].trim()));

你可以使用

/\* \d+ FETCH \(UID \d+ BODY\[] {\d+}\s*([\s\S]*?)(?=\)[\r\n]+(?:\* \d+ FETCH \(UID \d+ BODY\[] {\d+}|SS01 OK Success))/g

参见regex demo。或者,如果您不需要如此彻底地检查所有上下文,请使用

/{\d+}\s*([\s\S]*?)(?=\))/g

详情:

  • \* \d+ FETCH \(UID \d+ BODY\[] {\d+} - *, space, 一个或多个数字, space, FETCH, space, (UID, space, 1+位, space, BODY[], space, {, 一位或多位, }
  • \s* - 零个或多个白色spaces
  • ([\s\S]*?) - 第 1 组(您需要获得的值):尽可能少的任何零个或多个字符
  • (?=\)[\r\n]+(?:\* \d+ FETCH \(UID \d+ BODY\[] {\d+}|SS01 OK Success)) - 正向前瞻,需要紧靠当前位置右侧的以下模​​式序列:
    • \) - 一个 ) 字符
    • [\r\n]+ - 一个或多个 CR 或 LF 字符
    • (?:\* \d+ FETCH \(UID \d+ BODY\[] {\d+}|SS01 OK Success) - 两者之一
      • \* \d+ FETCH \(UID \d+ BODY\[] {\d+} - *, space, 一个或多个数字, space, FETCH, space, (UID, space, 1+位, space, BODY[], space, {, 一位或多位, }
      • | - 或
      • SS01 OK Success - SS01 OK Success 字符串。

JavaScript 演示:

const rx = /\* \d+ FETCH \(UID \d+ BODY\[] {\d+}\s*([\s\S]*?)(?=\)[\r\n]+(?:\* \d+ FETCH \(UID \d+ BODY\[] {\d+}|SS01 OK Success))/g;
const text = '* 1 FETCH (UID 2 BODY[] {7765}\ndata to be extracted\nfrom here!\n)\n* 2 FETCH (UID 3 BODY[] {443}\ndata to be extracted\nfrom here!\n)\n* 3 FETCH (UID 4 BODY[] {4432}\ndata to be extracted\nfrom here!\n)\n* 4 FETCH (UID 5 BODY[] {123}\ndata to be extracted\nfrom here!\n)\nSS01 OK Success';
const matches = [...text.matchAll(rx)];
console.log(Array.from(matches,x => x[1].trim()));

// Or, with the simplified regex:
console.log(
   Array.from(text.matchAll(/{\d+}\s*([\s\S]*?)(?=\))/g), x => x[1].trim())
)