使用正则表达式解析 EML 文本

Question

你能帮我用正则表达式解析 EML 文本吗？

我要单独领取：

1). Content-Transfer-Encoding: base64 和 --=_alternative 之间的文本，如果上面有行 Content-Type: text/html

2). Content-Transfer-Encoding: base64 和 --=_related 之间的文本，如果在 Content-Type: image/jpeg

行上方有两行

请看一下 powershell 中的代码和平：

$text = @"
--=_alternative XXXXXXXXXXXXXX_=
Content-Type: text/html; charset="KOI8-R"
Content-Transfer-Encoding: base64

111111111111111111111111111111111111111111111111111111

--=_alternative XXXXXXXXXXXXXX_=
Content-Type: text/html; charset="KOI8-R"
Content-Transfer-Encoding: base64

222222222222222222222222222222222222222222222222222222
--=_alternative XXXXXXXXXXXXXX_=--
--=_related XXXXXXXXXXXXXX_=--_=
Content-Type: image/jpeg
Content-ID: <_2_XXXXXXXXXXXXXX>
Content-Transfer-Encoding: base64

333333333333333333333333333333333333333333333333333333
--=_related XXXXXXXXXXXXXX_=
Content-Type: image/jpeg
Content-ID: <_2_XXXXXXXXXXXXXX>
Content-Transfer-Encoding: base64
444444444444444444444444444444444444444444444444444444

--=_related XXXXXXXXXXXXXX_=
Content-Type: image/jpeg
Content-ID: <_2_XXXXXXXXXXXXXX>
Content-Transfer-Encoding: base64

555555555555555555555555555555555555555555555555555555
--=_related XXXXXXXXXXXXXX_=--
"@

$regex1 = "(?ms).+?Content-Transfer-Encoding: base64(.+?)--=_alternative"
$text1 = ([regex]::Matches($text,$regex1) | foreach {$_.groups[1].value})
Write-Host "text1 : " -fore red
Write-Host  $text1

#I want to get as output elements (of array, maybe, or one after another)
#1). text between  Content-Transfer-Encoding: base64 and --=_alternative, if there is above line Content-Type: text/html
#this
#1111111111111111111111111111111111111111111111111111111
#then this
#2222222222222222222222222222222222222222222222222222222

$regex2 = "(?ms).+?Content-Transfer-Encoding: base64(.+?)--=_related"
$text2 = ([regex]::Matches($text,$regex2) | foreach {$_.groups[1].value})
#I want to get as output elements (of array, maybe, or one after another)
#2). text between  Content-Transfer-Encoding: base64 and --=_related, if there is two lines above line Content-Type: image/jpeg
#this
#3333333333333333333333333333333333333333333333333333333
#then this
#4444444444444444444444444444444444444444444444444444444
#then this
#5555555555555555555555555555555555555555555555555555555
Write-Host "text2 : " -fore red
Write-Host  $text2

感谢您的帮助。祝你有个愉快的一天。

P.S。基于 Jessie Westlake 的代码，这里是 RegEx 的一个小编辑版本，对我有用：

$files = Get-ChildItem -Path "\<SERVER_NAME>\mailroot\Drop"
Foreach ($file in $files){
    $text = Get-Content $file.FullName

    $RegexText = '(?:Content-Type: text/html.+?Content-Transfer-Encoding: base64(.+?)(?:--=_))'
    $RegexImage = '(?:Content-Type: image/jpeg.+?Content-Transfer-Encoding: base64(.+?)(?:--=_))'

    $TextMatches = [Regex]::Matches($text, $RegexText, [System.Text.RegularExpressions.RegexOptions]::Singleline)
    $ImageMatches = [Regex]::Matches($text, $RegexImage, [System.Text.RegularExpressions.RegexOptions]::Singleline)

    If ($TextMatches[0].Success)
    {
        Write-Host "Found $($TextMatches.Count) Text Matches:"
        Write-Output $TextMatches.ForEach({$_.Groups[1].Value})
    }
    If ($ImageMatches[0].Success)
    {
        Write-Host "Found $($ImageMatches.Count) Image Matches:"
        Write-Output $ImageMatches.ForEach({$_.Groups[1].Value})
    }
}

Answer 1

TL;DR : 直接转到底部的代码...

下面的代码比较难看，请见谅

基本上我只是创建了一个匹配以 Content-Type: text/html 开头的正则表达式。它匹配后面的任何内容，直到遇到换行符 \n、回车符 return \r 或一个接一个的组合 \r\n。

您必须将它们括在括号中才能使用 or | 运算符。我们实际上不想 capture/return 任何这些组，所以我们使用 (?:text-to-match) 的非捕获组语法。如您所见，我们在其他地方使用它。您也可以将捕获组和非捕获组放在彼此内部。

无论如何，继续。匹配新行后，我们要看Content-Transfer-Encoding: base64。这似乎在您的每个示例中都是必需的。

之后我们要识别下一个换行符，就像上次一样。除了这次我们想通过使用 + 来匹配 1 个或多个。我们需要匹配多个的原因是，有时您想要保存的数据前面有一个额外的行。但由于有时它前面没有额外的行，我们需要在加号后面加上问号 +?.

来使其成为 "lazy"

之后是我们将捕获您的实际数据的部分。这将是我们第一次使用实际的捕获组，而不是非捕获组（即没有问号后跟冒号）。

我们想要捕获任何不是新行的内容，因为似乎有时您的数据后跟新行，有时却没有。通过不允许我们捕获任何新行，它还会迫使我们之前的组吞噬我们数据之前的任何额外新行。该捕获组是 ([^(?:\n|\n\r)]+)

我们在那里所做的是将正则表达式包裹在括号中以便捕获它。我们将表达式放在方括号内，因为我们想创建自己的 "class" 个字符。方括号内的任何字符都将是我们的代码要查找的内容。不过，与我们的不同之处在于，我们将克拉 ^ 作为括号内的第一个字符。这意味着不是这些字符中的任何一个。显然我们想要匹配下一行之前的所有内容，所以我们想要捕获任何不是换行符的内容，一次或多次，尽可能多次。

然后我们确保我们的正则表达式锚定到一些结束文本，所以我们不断尝试匹配。从另一个换行符开始，至少匹配一个，但要尽可能少才能使我们的捕获成功 (?:\n|\r|\r\n)+?。

最后，我们确定我们可以停止寻找重要数据的地方。那就是 --=_。我不确定我们是否会偶然发现 "alternative" 词或 "related"，所以我没有走那么远。现在完成了。

一切的关键

如果不添加正则表达式"SingleLine"模式，我们将无法通过换行进行匹配。为了实现这一点，我们必须使用 .NET 语言来创建我们的比赛。我们从 [System.Text.RegularExpressions.RegexOptions] 类型开始加速。选项是 "SingleLine" 和 "MultiLine".

我为 text/html 和 image/jpeg 搜索创建了一个单独的正则表达式。我们将这些匹配的结果保存到它们各自的变量中。

我们可以通过索引 0 索引来测试匹配是否成功，该索引将包含整个匹配对象并访问其 .success 属性，其中 return 是一个布尔值.可以使用 .count 属性访问匹配计数。为了访问特定的组和捕获，我们必须在找到合适的捕获组索引后在其中添加点注释。因为我们只使用一个捕获组而其余的都是非捕获组，我们将有整个文本匹配的 [0] 索引，而 [1] 应该包含我们的捕获组的匹配。因为是一个对象，所以我们要访问值属性.

显然下面的代码需要您的 $text 变量来包含要搜索的数据。

$RegexText = '(?:Content-Type: text/html.+?(?:\n|\r|\r\n)Content-Transfer-Encoding: base64(?:\n|\r|\r\n)+?([^(?:\n|\n\r)]+)(?:\n|\r|\r\n)+?(?:\n|\r|\r\n)(?:--=_))'
$RegexImage = '(?:Content-Type: image/jpeg.+?(?:\n|\r|\r\n)Content-Transfer-Encoding: base64(?:\n|\r|\r\n)+?([^(?:\n|\n\r)]+)(?:\n|\r|\r\n)+?(?:\n|\r|\r\n)(?:--=_))'

$TextMatches = [Regex]::Matches($text, $RegexText, [System.Text.RegularExpressions.RegexOptions]::Singleline)
$ImageMatches = [Regex]::Matches($text, $RegexImage, [System.Text.RegularExpressions.RegexOptions]::Singleline)

If ($TextMatches[0].Success)
{
    Write-Host "Found $($TextMatches.Count) Text Matches:"
    Write-Output $TextMatches.ForEach({$_.Groups[1].Value})
}
If ($ImageMatches[0].Success)
{
    Write-Host "Found $($ImageMatches.Count) Image Matches:"
    Write-Output $ImageMatches.ForEach({$_.Groups[1].Value})
}

上面的代码导致屏幕输出如下：

Found 2 Text Matches:
111111111111111111111111111111111111111111111111111111
222222222222222222222222222222222222222222222222222222
Found 3 Image Matches:
333333333333333333333333333333333333333333333333333333
444444444444444444444444444444444444444444444444444444
555555555555555555555555555555555555555555555555555555

使用正则表达式解析 EML 文本

Parse EML text With Regular Expression

regex

powershell

parsing

eml

email-parsing