VBA 解析 dom 以找到一个特定的 href 值

VBA parse dom to find one particular href value

使用 Excel VBA,因为 URL 的所有法文版本都在一个 .xls 文件中,我必须从中找到大约 400 URLs同一网站的英文版。

了解站点的 dom 结构,我知道我可以:

结构是这样的。 "English" link 变化但总是在这种结构下:

<ul class="global-links">
<li><a title="Nous joindre" href="/fr/coordonnees.html">Nous Joindre</a></li>
<li>|</li>
<li><a title="Carrières" href="/fr/carrieres.html">Carrières</a></li>
<li>|</li>        
<li><a title="English" href="/en/personal.html">English</a></li>
</ul>

我想要的 href 是 link 上有标题 "English" 的那个。

我被卡住了是因为我知道有两种方法可以找到相关文本

我设法测试了它们,但是:

由于双引号,我无法以良好的方式构造要搜索的字符串。我尝试将双引号加倍。我也试过像这样的 chr(34) 方法

"title=" & Chr(34) & "English" & Chr(34) & " href=" & Chr(34)

但我也无法让它工作,它找不到我的字符串。

所以我需要帮助找到标题为 "English" 的 link 的 href 值,可以通过 DOM 搜索或字符串搜索。

最后,既然是for循环,那么创建object/memory使用有什么最佳实践吗?需要如何处理为一个实例创建的 object/connexion 关闭以释放内存或不使其过载?

如有任何帮助,我们将不胜感激。提前致谢。

编辑

可在此处找到示例:https://www.bnc.ca/fr/particuliers.html

编辑给出起始代码

Sub testAlias()
'title="English" href="https://www.nbc.ca
Dim htm As HTMLDocument, table As Object
Set htm = New HTMLDocument

With CreateObject("MSXML2.XMLHTTP")
    .Open "GET", "https://www.bnc.ca/fr/particuliers.html", False
    .send
    htm.body.innerHTML = .responseText

    'Code to continue here...
    'Assuming I need to target the <a> with "English" for title and retrieve its href value
End With
End Sub

编辑 - 循环代码无效

根据 David 的回答,我从工作表上一个单元格中的值创建了一个循环。我需要附加值和我的 dom 用户名。

domain 是 https://www.bnc.ca,以下是我测试过但不起作用的值:

我在 foreach 行上有一个错误,指出 #91 错误(未定义 variable/object 或带有 bloc...)

For Each e In elements(0).ChildNodes

有人可以帮我解决这个问题吗?

Sub testAlias()


For rid = 2 To 3
'Dim sh As Worksheet
'Set sh = ActiveSheet



Dim sh As Worksheet
Set sh = ActiveSheet

Dim url As String
url = "https://www.bnc.ca" & sh.Cells(rid, 1)

'title="English" href="https://www.nbc.ca
Dim http As MSXML2.XMLHTTP
Dim HTMLDoc As MSHTML.HTMLDocument
Dim DOM As Object 'MSXML2.DOMDocument
Dim elements As Object
Dim ele As Object
Dim respText As String

Set http = CreateObject("MSXML2.XMLHTTP")
Set DOM = CreateObject("MSXML2.DOMDocument")
Set HTMLDoc = New MSHTML.HTMLDocument 'for some reason, I can't use CreateObject to do this

'## Create the HTTPRequest
With http
    .Open "GET", url, False
    .send ""
    '## Load the XML to DOM
    respText = .responseText

End With

'## Put in HTML Document
HTMLDoc.body.innerHTML = respText
Debug.Print respText

'## Parse DOM
Set elements = HTMLDoc.getElementsByClassName("global-links")

'If elements Is Not Nothing Then


    '## Assume there is only one class name "global-links"
    For Each e In elements(0).ChildNodes
        If e.innerText = "English" Then
            '## Display the url:
            sh.Cells(rid, 2).Formula = ""
            sh.Cells(rid, 2).Formula = e.ChildNodes(0).href
        End If
    Next
    DoEvents
'End If



Next rid
Application.ScreenUpdating = True
End Sub

你可以用这样的东西来暴力破解它,否则我可能会尝试使用 XPath 或更强大的 DOM 解析应用程序(需要查看更多 XML协助的结构):

Sub foo()
Dim xmlString As String

xmlString = "<ul class=""global-links"">" & _
"<li><a title=""Nous joindre"" href=""/fr/coordonnees.html"">Nous Joindre</a></li>" & _
"<li>|</li>" & _
"<li><a title=""Carrières"" href=""/fr/carrieres.html"">Carrières</a></li>" & _
"<li>|</li>" & _
"<li><a title=""English"" href=""/en/personal.html"">English</a></li>" & _
"</ul>"

Dim DOM As Object
Set DOM = CreateObject("MSXML2.DOMDOCUMENT")
DOM.LoadXML xmlString


Dim elements
Dim e
Set elements = DOM.DocumentElement.GetElementsByTagName("a")

For Each e In elements
    On Error Resume Next
    If e.ParentNode.ParentNode.XML Like "<ul class=""global-links"">*" Then
        If e.XML Like "<a title=""English"" href=*" Then
            MsgBox e.XML
        End If
    End If
Next


End Sub

更新

我无法使用 DOM(尝试加载 HTML.responseText 时一直出现解析错误,所以我转而使用 HTMLDocument 对象:

Sub testAlias()
'title="English" href="https://www.nbc.ca
Dim HTTP As MSXML2.XMLHTTP
Dim HTMLDoc As MSHTML.HTMLDocument
Dim DOM As Object 'MSXML2.DOMDocument
Dim elements As Object
Dim ele As Object
Dim respText As String

Set HTTP = CreateObject("MSXML2.XMLHTTP")
Set DOM = CreateObject("MSXML2.DOMDocument")
Set HTMLDoc = New MSHTML.HTMLDocument 'for some reason, I can't use CreateObject to do this

'## Create the HTTPRequest
With HTTP
    .Open "GET", "https://www.bnc.ca/fr/particuliers.html", False
    .send ""
    '## Load the XML to DOM
    respText = .responseText

End With

'## Put in HTML Document
HTMLDoc.body.innerHTML = respText

'## I tried loading in to DOM but it would not work:
'DOM.LoadXML respText
'If DOM.parseError Then
'    MsgBox DOM.parseError.reason
'    Stop
'End If

'## Parse DOM
Set elements = HTMLDoc.getElementsByClassName("global-links")

'## Assume there is only one class name "global-links"
For Each e In elements(0).ChildNodes
    If e.innerText = "English" Then
        '## Display the url:
        MsgBox e.ChildNodes(0).href
    End If
Next


End Sub