了解 VBScript/Javascript 正则表达式的差异以解决子匹配问题

Understanding the Differences in VBScript/Javascript Regex to Solve SubMatch Issue

我有一个在 Python 和其他各种语言中完美运行的正则表达式模式,但未能捕获我在 VBScript 正则表达式(其引擎显然几乎相同)中实现所需的子匹配JavaScript)。有问题的模式如下:

"Sincerely,[\s\n]+([\w\.]+)\s+(\w+)\s+(.+)[\s\n]+(\d+\s.+)[\s\n]+(.+)"

示例测试用例如下:

email received 3/30/17:

Dear Sir,

Hello

Sincerely,

Mr. Robert Thomas
1104 Madison Avenue
New York, NY 10021


email received 3/30/17:

Dear Sir,

Hello

Sincerely,

Ms. Angela Carraway
402 Arlington Drive
Concord, MA 01742

objective 用于全局正则表达式,它从这个示例匹配中提取 5 个子组,变量关键字在这里是 "Sincerely,"。子组应为 Ms.(第一个子组),Angela(第二个子组),Carraway(第三个子组),402 Arlington Drive(第四个子组),Concord, MA 01742(第五个子组)。在 Python 中,它在 Regex 测试器中完美匹配 5 个组,但对于 VBScript(JavaScript 引擎)它匹配整个字符串作为匹配项,但根本没有子组。因此,当我在 Excel VBA 宏中调用子匹配项以写入单元格时,我将所有文本混杂在几个单元格中。我究竟做错了什么?我是否缺少某些禁用捕获子组的字符?如果是这样,这两个引擎之间的关键区别是什么,以便我将来可以避免这种情况,以及如何在此测试用例中修复此模式?我试过在线阅读有关差异的信息,但所说的一切似乎只是应该导致我遇到的问题的微小差异。任何帮助将不胜感激,因为我似乎无法隔离 difference/problem。谢谢!

编辑: 以下是使用正则表达式的 VBA 代码:

Sub regex()
    Dim docxinput As String
    Dim keyword As Variant
    Dim patterninput As Variant
    Dim pattern As String
    Dim regex As New RegExp

    docxinput = Application.GetOpenFilename(Title:="Step #1: Enter Word Document Input File Name")
        Dim wrdApp As Word.Application
        Dim wrdDoc As Word.Document
        Dim strInput As String

        Set wrdApp = CreateObject("Word.Application")
        wrdApp.Visible = False

        Set wrdDoc = wrdApp.Documents.Open(docxinput)
        strInput = wrdDoc.Range.Text

        Debug.Print (strInput)
        wrdDoc.Close 0
        Set wrdDoc = Nothing
        wrdApp.Quit
        Set wrdApp = Nothing

    pattern = "Sincerely,[\s\n]+([\w\.]+)\s+(\w+)\s+(.+)[\s\n]+(\d+\s.+)[\s\n]+(.+)"

    Dim objMatches As MatchCollection

    With regex
        .Global = True
        .MultiLine = True
        .IgnoreCase = False
        .pattern = pattern
    End With

    Set objMatches = regex.Execute(strInput)
    Dim row As Variant

    Dim SubMatches As Variant
    row = 2
    For Each SubMatches In objMatches
        Cells(row, 1).Value = objMatches(0).SubMatches(0)
        Cells(row, 2).Value = objMatches(0).SubMatches(1)
        Cells(row, 3).Value = objMatches(0).SubMatches(2)
        Cells(row, 4).Value = objMatches(0).SubMatches(3)
        Cells(row, 5).Value = objMatches(0).SubMatches(4)
        row = row + 1
    Next
End Sub

这是结果的图片。如您所见,前两个子组有效,但随后正则表达式(或至少我认为)遇到分组错误并将几乎其他内容转储到下一列。然后它移动到第四列,运行 也在那里出现错误。这是代码迭代还是正则表达式本身的问题。我已尝试对代码进行故障排除,但找不到除了正则表达式有问题之外无法正确分解文本的原因。有什么想法吗?

图片:

你的 regex 应该 运行 和 VBA 没有问题... (测试过here

要在 vba 中获得所需的组,请在此处查看 how-to-use-regular-expressions-regex-in-microsoft-excel-both-in-cell-and-loops

编辑: 对于以下输入:

email received 3/30/17:

Dear Sir,

Hello

Sincerely,

Mr. Robert Thomas
1104 Madison Avenue
New York, NY 10021


email received 3/30/17:

Dear Sir,

Hello

Sincerely,

Ms. Angela Carraway
402 Arlington Drive
Concord, MA 01742

放在单元格内 A1

和 vba 代码:

(请注意,我必须更改您的 for each 循环 - 以便它适用于多个匹配项)

Sub myregex()
    Dim keyword As Variant
    Dim patterninput As Variant
    Dim pattern As String
    Dim regex As New RegExp

    Set Myrange = ActiveSheet.Range("A1:A1")
   For Each C In Myrange
   strInput = C.Value
   strPattern = "Sincerely,[\s\n]+([\w\.]+)\s+(\w+)\s+(.+)[\s\n]+(\d+\s.+)[\s\n]+(.+)"

     With regex
                .Global = True
                .MultiLine = True
                .IgnoreCase = False
                .pattern = strPattern
            End With
            If regex.Test(strInput) Then
                 Set objMatches = regex.Execute(strInput)
                 row = 2
                 For Each SubMatches In objMatches
                 Cells(row, 1).Value = objMatches(row - 2).SubMatches(0)
                 Cells(row, 2).Value = objMatches(row - 2).SubMatches(1)
                 Cells(row, 3).Value = objMatches(row - 2).SubMatches(2)
                 Cells(row, 4).Value = objMatches(row - 2).SubMatches(3)
                 Cells(row, 5).Value = objMatches(row - 2).SubMatches(4)
                 row = row + 1
                Next
            Else
                C.Offset(0, 1) = "(Not matched)"
            End If

    Next
End Sub

我得到以下结果:

     A      B       C           D                    E 
  2  Mr.    Robert  Thomas      1104 Madison Avenue  New York, NY 10021
  3  Ms.    Angela  Carraway    402 Arlington Drive  Concord, MA 01742

结论: 一切都按预期工作。