使用 class/Tag 名称提取数据并限制结果

Pulling data with a class/Tag Name and limiting results

我正在尝试通过网络抓取以下网站:https://echa.europa.eu/brief-profile/-/briefprofile/100.047.293#ScientificProperties

到目前为止,我有以下代码 return 是每个 h4 标签的内部文本。

Sub getContents()
        
            Dim XMLReq As New MSXML2.XMLHTTP60
            Dim HTMLDoc As New MSHTML.HTMLDocument
            
            Dim SubTag As MSHTML.IHTMLElementCollection
            Dim SubName As MSHTML.IHTMLElement
            
            XMLReq.Open "Get", "https://echa.europa.eu/brief-profile/-/briefprofile/100.047.293", False
            XMLReq.send
            
            If XMLReq.Status <> 200 Then
            
                MsgBox "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
                Exit Sub
            End If
            
            HTMLDoc.body.innerHTML = XMLReq.responseText
            
            Set SubTag = HTMLDoc.getElementsByTagName("dt")
            
            For Each SubName In SubTag
            Debug.Print SubName.innerText
            Next SubName
            
        End Sub

虽然这 return 提供了很多有用的信息,但它也 return 是一个带有标签名称 dt 的元素 有些结果不需要(突出显示),但我不确定如何限制它列表。探索 HTML 似乎没有任何 tag/id 来区分这些。这也让我想知道我是否以最佳方式提取信息?

同样,如果对于列表中的每个未突出显示的项目,我希望在页面上捕获与它们关联的值,例如

C Physical state at 20°C and 1013 hPa Solid (100%) [1]
C Form Crystalline (100%) [1]
C Odour Other (100%) [1]
C Substance type Organic (100%) [1]
    
    And so on...

此信息的标签名称是“dd”,但我不确定如何才能同时 return 这两个结果。我希望立即 window 我可以得到物理和化学性质的列表,并且每个值的右边也是 returned.

尝试这样做时,我有以下导致不匹配错误的代码,但我不明白我做错了什么。

Sub getContents()
        
            Dim XMLReq As New MSXML2.XMLHTTP60
            Dim HTMLDoc As New MSHTML.HTMLDocument
            
            Dim SubTag As MSHTML.IHTMLElementCollection
            Dim SubName As MSHTML.IHTMLElement
            Dim SubInfo As MSHTML.IHTMLElement
            
            XMLReq.Open "Get", "https://echa.europa.eu/brief-profile/-/briefprofile/100.047.293", False
            XMLReq.send
            
            If XMLReq.Status <> 200 Then
            
                MsgBox "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
                Exit Sub
            End If
            
            HTMLDoc.body.innerHTML = XMLReq.responseText
            
            Set SubTag = HTMLDoc.getElementsByTagName("dt")
            Set SubInfo = SubTag.tags("dd")
            
            For Each SubName In SubTag
            Debug.Print SubName.innerText, SubInfo.innerText
            Next SubName
            
        End Sub

我很感激它很长 post 但如果有人可以评论我做错了什么,那就太好了。

更新:

下面的代码更好的实现了即时想要的数据windows.

Sub GetContents()
    
        Dim XMLReq As New MSXML2.XMLHTTP60
        Dim HTMLDoc As New MSHTML.HTMLDocument
        
        Dim SubSectList As MSHTML.IHTMLElement
        Dim SubSects As MSHTML.IHTMLElementCollection
        Dim SubSect As MSHTML.IHTMLElement
    
        XMLReq.Open "Get", "https://echa.europa.eu/brief-profile/-/briefprofile/100.047.293", False
        XMLReq.send
        
        If XMLReq.Status <> 200 Then
        
            MsgBox "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
            Exit Sub
        End If
        
        HTMLDoc.body.innerHTML = XMLReq.responseText
        
        Set SubSectList = HTMLDoc.getElementsByClassName("col-xs-12 col-lg-10 MainContent")(1)
        Set SubSects = SubSectList.getElementsByTagName("dt")

        'Debug.Print SubSects.Length
        
        For Each SubSect In SubSects
        Debug.Print SubSect.innerText & " : "; SubSect.NextSibling.innerText
       Next SubSect
 
    End Sub

我认为您想限制那些作为具有 class EndpointContent 的元素的子元素的 dt;然后你可以链接 nextSibling 移动到相邻的 dd

Option Explicit

Public Sub GetContents()
    Dim XMLReq As New MSXML2.XMLHTTP60
    Dim HTMLDoc As New MSHTML.HTMLDocument
        
    XMLReq.Open "Get", "https://echa.europa.eu/brief-profile/-/briefprofile/100.047.293", False
    XMLReq.send
                  
    HTMLDoc.body.innerHTML = XMLReq.responseText

    Dim i As Long
    
    With HTMLDoc.querySelectorAll(".EndpointContent dt")
        For i = 0 To .Length - 1
            Debug.Print .Item(i).innerText & " : " & .Item(i).NextSibling.NextSibling.innerText
            Debug.Print
        Next
    End With
End Sub