使用正则表达式抓取所有标题构建目录（经典ASP）

Question

我仍然尝试开发一个函数，从 HTML 文本中提取所有标题 (h1,h2,h3,..)，并指定一个 id 来构造一个 Table 的内容。

我使用正则表达式制作了一个简单的脚本，但出于某种奇怪的原因，它只收集了 1 个匹配项（最后一个）

这里是我的示例代码：

Function RegExResults(strTarget, strPattern)
    dim regEx
    Set regEx = New RegExp
    regEx.Pattern = strPattern
    regEx.Global = True
    regEx.IgnoreCase = True
    regEx.Multiline = True
    Set RegExResults = regEx.Execute(strTarget)
    Set regEx = Nothing
End Function

htmlstr = "<h1>Documentation</h1><p>Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas.</p><h3 id=""one"">How do you smurf a murf?</h3><p>Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae, ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam egestas semper.</p><h3 id=""two"">How do many licks does a giraffe?</h3><p>Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas.</p>"

regpattern = "<h([1-9]).*id=\""(.*)\"">(.*)</h[1-9]>"

set arrayresult = RegExResults(htmlstr,regpattern) 
For each result in arrayresult
    response.write "count: " & arrayresult.count & "<br><hr>"
    response.write "0: " & result.Submatches(0) & "<br>"
    response.write "1: " & result.Submatches(1) & "<br>"
    response.write "2: " & result.Submatches(2) & "<br>"
Next

我需要提取所有标题加上每个知道标题类型 (1..9) 和用于跳转到正确标题段落的 id 值 (#ID_value)。

我希望有人能帮我找出为什么这没有按预期工作。

谢谢

Answer 1

模式中的 .* 是贪婪的，但您需要惰性来收集所有可能的匹配项。相反，你应该使用 .*?'s.

经过一些改进，模式可能如下所示。

regpattern = "<(h[1-9]).*?id=""(.*?)"">(.*?)</>" 

'  means the same as the 1st group
' backslash (\) is redundant to escape double quotes, so removed it

我强烈建议你看看Repetition with Star and Plus。这篇文章对于理解 Regex 中的懒惰和贪婪重复非常有用。

哦，我差点忘了，You can't parse HTML with Regex，好吧，至少你不应该。

使用正则表达式抓取所有标题构建目录（经典ASP）

Using RegEx to grab all headings to build a ToC (Classic ASP)

regex

asp-classic