了解 VBScript/Javascript 正则表达式的差异以解决子匹配问题
Understanding the Differences in VBScript/Javascript Regex to Solve SubMatch Issue
我有一个在 Python 和其他各种语言中完美运行的正则表达式模式,但未能捕获我在 VBScript 正则表达式(其引擎显然几乎相同)中实现所需的子匹配JavaScript)。有问题的模式如下:
"Sincerely,[\s\n]+([\w\.]+)\s+(\w+)\s+(.+)[\s\n]+(\d+\s.+)[\s\n]+(.+)"
示例测试用例如下:
email received 3/30/17:
Dear Sir,
Hello
Sincerely,
Mr. Robert Thomas
1104 Madison Avenue
New York, NY 10021
email received 3/30/17:
Dear Sir,
Hello
Sincerely,
Ms. Angela Carraway
402 Arlington Drive
Concord, MA 01742
objective 用于全局正则表达式,它从这个示例匹配中提取 5 个子组,变量关键字在这里是 "Sincerely,"。子组应为 Ms.
(第一个子组),Angela
(第二个子组),Carraway(第三个子组),402 Arlington Drive(第四个子组),Concord, MA 01742(第五个子组)。在 Python 中,它在 Regex 测试器中完美匹配 5 个组,但对于 VBScript(JavaScript 引擎)它匹配整个字符串作为匹配项,但根本没有子组。因此,当我在 Excel VBA 宏中调用子匹配项以写入单元格时,我将所有文本混杂在几个单元格中。我究竟做错了什么?我是否缺少某些禁用捕获子组的字符?如果是这样,这两个引擎之间的关键区别是什么,以便我将来可以避免这种情况,以及如何在此测试用例中修复此模式?我试过在线阅读有关差异的信息,但所说的一切似乎只是应该导致我遇到的问题的微小差异。任何帮助将不胜感激,因为我似乎无法隔离 difference/problem。谢谢!
编辑:
以下是使用正则表达式的 VBA 代码:
Sub regex()
Dim docxinput As String
Dim keyword As Variant
Dim patterninput As Variant
Dim pattern As String
Dim regex As New RegExp
docxinput = Application.GetOpenFilename(Title:="Step #1: Enter Word Document Input File Name")
Dim wrdApp As Word.Application
Dim wrdDoc As Word.Document
Dim strInput As String
Set wrdApp = CreateObject("Word.Application")
wrdApp.Visible = False
Set wrdDoc = wrdApp.Documents.Open(docxinput)
strInput = wrdDoc.Range.Text
Debug.Print (strInput)
wrdDoc.Close 0
Set wrdDoc = Nothing
wrdApp.Quit
Set wrdApp = Nothing
pattern = "Sincerely,[\s\n]+([\w\.]+)\s+(\w+)\s+(.+)[\s\n]+(\d+\s.+)[\s\n]+(.+)"
Dim objMatches As MatchCollection
With regex
.Global = True
.MultiLine = True
.IgnoreCase = False
.pattern = pattern
End With
Set objMatches = regex.Execute(strInput)
Dim row As Variant
Dim SubMatches As Variant
row = 2
For Each SubMatches In objMatches
Cells(row, 1).Value = objMatches(0).SubMatches(0)
Cells(row, 2).Value = objMatches(0).SubMatches(1)
Cells(row, 3).Value = objMatches(0).SubMatches(2)
Cells(row, 4).Value = objMatches(0).SubMatches(3)
Cells(row, 5).Value = objMatches(0).SubMatches(4)
row = row + 1
Next
End Sub
这是结果的图片。如您所见,前两个子组有效,但随后正则表达式(或至少我认为)遇到分组错误并将几乎其他内容转储到下一列。然后它移动到第四列,运行 也在那里出现错误。这是代码迭代还是正则表达式本身的问题。我已尝试对代码进行故障排除,但找不到除了正则表达式有问题之外无法正确分解文本的原因。有什么想法吗?
图片:
你的 regex
应该 运行 和 VBA
没有问题...
(测试过here)
要在 vba
中获得所需的组,请在此处查看 how-to-use-regular-expressions-regex-in-microsoft-excel-both-in-cell-and-loops。
编辑:
对于以下输入:
email received 3/30/17:
Dear Sir,
Hello
Sincerely,
Mr. Robert Thomas
1104 Madison Avenue
New York, NY 10021
email received 3/30/17:
Dear Sir,
Hello
Sincerely,
Ms. Angela Carraway
402 Arlington Drive
Concord, MA 01742
放在单元格内 A1
和 vba 代码:
(请注意,我必须更改您的 for each
循环 - 以便它适用于多个匹配项)
Sub myregex()
Dim keyword As Variant
Dim patterninput As Variant
Dim pattern As String
Dim regex As New RegExp
Set Myrange = ActiveSheet.Range("A1:A1")
For Each C In Myrange
strInput = C.Value
strPattern = "Sincerely,[\s\n]+([\w\.]+)\s+(\w+)\s+(.+)[\s\n]+(\d+\s.+)[\s\n]+(.+)"
With regex
.Global = True
.MultiLine = True
.IgnoreCase = False
.pattern = strPattern
End With
If regex.Test(strInput) Then
Set objMatches = regex.Execute(strInput)
row = 2
For Each SubMatches In objMatches
Cells(row, 1).Value = objMatches(row - 2).SubMatches(0)
Cells(row, 2).Value = objMatches(row - 2).SubMatches(1)
Cells(row, 3).Value = objMatches(row - 2).SubMatches(2)
Cells(row, 4).Value = objMatches(row - 2).SubMatches(3)
Cells(row, 5).Value = objMatches(row - 2).SubMatches(4)
row = row + 1
Next
Else
C.Offset(0, 1) = "(Not matched)"
End If
Next
End Sub
我得到以下结果:
A B C D E
2 Mr. Robert Thomas 1104 Madison Avenue New York, NY 10021
3 Ms. Angela Carraway 402 Arlington Drive Concord, MA 01742
结论:
一切都按预期工作。
我有一个在 Python 和其他各种语言中完美运行的正则表达式模式,但未能捕获我在 VBScript 正则表达式(其引擎显然几乎相同)中实现所需的子匹配JavaScript)。有问题的模式如下:
"Sincerely,[\s\n]+([\w\.]+)\s+(\w+)\s+(.+)[\s\n]+(\d+\s.+)[\s\n]+(.+)"
示例测试用例如下:
email received 3/30/17:
Dear Sir,
Hello
Sincerely,
Mr. Robert Thomas
1104 Madison Avenue
New York, NY 10021
email received 3/30/17:
Dear Sir,
Hello
Sincerely,
Ms. Angela Carraway
402 Arlington Drive
Concord, MA 01742
objective 用于全局正则表达式,它从这个示例匹配中提取 5 个子组,变量关键字在这里是 "Sincerely,"。子组应为 Ms.
(第一个子组),Angela
(第二个子组),Carraway(第三个子组),402 Arlington Drive(第四个子组),Concord, MA 01742(第五个子组)。在 Python 中,它在 Regex 测试器中完美匹配 5 个组,但对于 VBScript(JavaScript 引擎)它匹配整个字符串作为匹配项,但根本没有子组。因此,当我在 Excel VBA 宏中调用子匹配项以写入单元格时,我将所有文本混杂在几个单元格中。我究竟做错了什么?我是否缺少某些禁用捕获子组的字符?如果是这样,这两个引擎之间的关键区别是什么,以便我将来可以避免这种情况,以及如何在此测试用例中修复此模式?我试过在线阅读有关差异的信息,但所说的一切似乎只是应该导致我遇到的问题的微小差异。任何帮助将不胜感激,因为我似乎无法隔离 difference/problem。谢谢!
编辑: 以下是使用正则表达式的 VBA 代码:
Sub regex()
Dim docxinput As String
Dim keyword As Variant
Dim patterninput As Variant
Dim pattern As String
Dim regex As New RegExp
docxinput = Application.GetOpenFilename(Title:="Step #1: Enter Word Document Input File Name")
Dim wrdApp As Word.Application
Dim wrdDoc As Word.Document
Dim strInput As String
Set wrdApp = CreateObject("Word.Application")
wrdApp.Visible = False
Set wrdDoc = wrdApp.Documents.Open(docxinput)
strInput = wrdDoc.Range.Text
Debug.Print (strInput)
wrdDoc.Close 0
Set wrdDoc = Nothing
wrdApp.Quit
Set wrdApp = Nothing
pattern = "Sincerely,[\s\n]+([\w\.]+)\s+(\w+)\s+(.+)[\s\n]+(\d+\s.+)[\s\n]+(.+)"
Dim objMatches As MatchCollection
With regex
.Global = True
.MultiLine = True
.IgnoreCase = False
.pattern = pattern
End With
Set objMatches = regex.Execute(strInput)
Dim row As Variant
Dim SubMatches As Variant
row = 2
For Each SubMatches In objMatches
Cells(row, 1).Value = objMatches(0).SubMatches(0)
Cells(row, 2).Value = objMatches(0).SubMatches(1)
Cells(row, 3).Value = objMatches(0).SubMatches(2)
Cells(row, 4).Value = objMatches(0).SubMatches(3)
Cells(row, 5).Value = objMatches(0).SubMatches(4)
row = row + 1
Next
End Sub
这是结果的图片。如您所见,前两个子组有效,但随后正则表达式(或至少我认为)遇到分组错误并将几乎其他内容转储到下一列。然后它移动到第四列,运行 也在那里出现错误。这是代码迭代还是正则表达式本身的问题。我已尝试对代码进行故障排除,但找不到除了正则表达式有问题之外无法正确分解文本的原因。有什么想法吗?
图片:
你的 regex
应该 运行 和 VBA
没有问题...
(测试过here)
要在 vba
中获得所需的组,请在此处查看 how-to-use-regular-expressions-regex-in-microsoft-excel-both-in-cell-and-loops。
编辑: 对于以下输入:
email received 3/30/17:
Dear Sir,
Hello
Sincerely,
Mr. Robert Thomas
1104 Madison Avenue
New York, NY 10021
email received 3/30/17:
Dear Sir,
Hello
Sincerely,
Ms. Angela Carraway
402 Arlington Drive
Concord, MA 01742
放在单元格内 A1
和 vba 代码:
(请注意,我必须更改您的 for each
循环 - 以便它适用于多个匹配项)
Sub myregex()
Dim keyword As Variant
Dim patterninput As Variant
Dim pattern As String
Dim regex As New RegExp
Set Myrange = ActiveSheet.Range("A1:A1")
For Each C In Myrange
strInput = C.Value
strPattern = "Sincerely,[\s\n]+([\w\.]+)\s+(\w+)\s+(.+)[\s\n]+(\d+\s.+)[\s\n]+(.+)"
With regex
.Global = True
.MultiLine = True
.IgnoreCase = False
.pattern = strPattern
End With
If regex.Test(strInput) Then
Set objMatches = regex.Execute(strInput)
row = 2
For Each SubMatches In objMatches
Cells(row, 1).Value = objMatches(row - 2).SubMatches(0)
Cells(row, 2).Value = objMatches(row - 2).SubMatches(1)
Cells(row, 3).Value = objMatches(row - 2).SubMatches(2)
Cells(row, 4).Value = objMatches(row - 2).SubMatches(3)
Cells(row, 5).Value = objMatches(row - 2).SubMatches(4)
row = row + 1
Next
Else
C.Offset(0, 1) = "(Not matched)"
End If
Next
End Sub
我得到以下结果:
A B C D E
2 Mr. Robert Thomas 1104 Madison Avenue New York, NY 10021
3 Ms. Angela Carraway 402 Arlington Drive Concord, MA 01742
结论: 一切都按预期工作。