从网页中提取文本，具有唯一的前导文本，但没有唯一的 class 或标签

Question

我正在尝试从一组相似的网页中提取一个唯一的编号。它们都非常相似，但我当前使用的代码使用 MSXML2.XMLHTTP 并标识给定 class 或标记中的文本。

问题是这些网页略有不同，因此代码无法根据项目标准可靠地从所有网页中提取。而且，页面上有许多相同的 classes 和标签，因此没有什么独特之处可以识别。

然而，有一段独特的文字（"ISIN Code:"），然后是我想要的ISIN号码在下一行。我听说过通过 ID 进行解析，但找不到these/don不知道这种方法是如何工作的。

我要拉取的信息是"GB00B6Y7NF43":

<tr>
    <th class="align-left">ISIN code:</th>
    <td> GB00B6Y7NF43 </td>
</tr>

这是我现在使用的大部分代码，用于使用 Item(...) 方法在页面上定位一些其他信息。我不知道我的代码本身是否完全正确，但到目前为止，如果您通过 Item(0) 或 Item(1) 等指定，它会正确提取信息。

Dim request As Object
Dim response As String
Dim html As New HTMLDocument
Dim td As Object
Dim website As String
Dim charge As Variant

With Worksheets("Sheet1")

website = Range("A14").Value

End With

Set request = CreateObject("MSXML2.XMLHTTP")

request.Open "GET", website, False

request.send

response = StrConv(request.responseBody, vbUnicode)

html.body.innerHTML = response

Worksheets("Information").Activate

        r = r + 2:
        Cells(r, 3) = html.getElementsByClassName("header-row").Item(0).innerText
        Cells(r, 5) = html.getElementsByTagName("td").Item(0).innerText
        Cells(r, 4) = html.getElementsByClassName("icon-link pdf-icon").Item(1).href

我的代码是否有另一个 approach/coding style/tweak 来执行此操作？

我可以使用暗淡的 ie / appIe 和类似的方法，但到目前为止，这些方法在 pc 上比简单地处理 html 文本更棘手且更慢。

Answer 1

这是 table 中的最后一个 child，因此您可以链接 lastchild 调用

html.querySelector("[summary='More fund information']").children(0).lastchild.lastchild.innertext

所以

Option Explicit
Public Sub test()
    Dim html As HTMLDocument

    Set html = New HTMLDocument

    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.hl.co.uk/funds/fund-discounts,-prices--and--factsheets/search-results/f/fidelity-asia-class-w-accumulation/key-features", False
        .send
        html.body.innerHTML = .responseText
    End With
    Debug.Print html.querySelector("[summary='More fund information'] ").Children(0).LastChild.LastChild.innerText
End Sub

一种较慢但随着时间的推移可能更稳健的方法可能是收集 table headers 并找到具有所需 ISIN 文本的那个，然后获取 NextSibling (td) 节点.

Option Explicit
Public Sub test()
    Dim html As HTMLDocument

    Set html = New HTMLDocument

    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.hl.co.uk/funds/fund-discounts,-prices--and--factsheets/search-results/f/fidelity-asia-class-w-accumulation/key-features", False
        .send
        html.body.innerHTML = .responseText
    End With

    Dim i As Long, nodes As Object

    Set nodes = html.querySelectorAll("[summary='More fund information'] th")
    For i = 0 To nodes.Length - 1
        If nodes.Item(i).innerText = "ISIN code:" Then
            Debug.Print nodes.Item(i).NextSibling.innerText
            Exit For
        End If
    Next
End Sub

从网页中提取文本，具有唯一的前导文本，但没有唯一的 class 或标签

Pull text, with unique preceding text, but without a unique class or tag from webpage

html

parsing

vba

dom

xml-parsing