使用 class/Tag 名称提取数据并限制结果
Pulling data with a class/Tag Name and limiting results
我正在尝试通过网络抓取以下网站:https://echa.europa.eu/brief-profile/-/briefprofile/100.047.293#ScientificProperties
到目前为止,我有以下代码 return 是每个 h4
标签的内部文本。
Sub getContents()
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
Dim SubTag As MSHTML.IHTMLElementCollection
Dim SubName As MSHTML.IHTMLElement
XMLReq.Open "Get", "https://echa.europa.eu/brief-profile/-/briefprofile/100.047.293", False
XMLReq.send
If XMLReq.Status <> 200 Then
MsgBox "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
Exit Sub
End If
HTMLDoc.body.innerHTML = XMLReq.responseText
Set SubTag = HTMLDoc.getElementsByTagName("dt")
For Each SubName In SubTag
Debug.Print SubName.innerText
Next SubName
End Sub
虽然这 return 提供了很多有用的信息,但它也 return 是一个带有标签名称 dt
的元素 有些结果不需要(突出显示),但我不确定如何限制它列表。探索 HTML 似乎没有任何 tag/id 来区分这些。这也让我想知道我是否以最佳方式提取信息?
同样,如果对于列表中的每个未突出显示的项目,我希望在页面上捕获与它们关联的值,例如
C Physical state at 20°C and 1013 hPa Solid (100%) [1]
C Form Crystalline (100%) [1]
C Odour Other (100%) [1]
C Substance type Organic (100%) [1]
And so on...
此信息的标签名称是“dd”,但我不确定如何才能同时 return 这两个结果。我希望立即 window 我可以得到物理和化学性质的列表,并且每个值的右边也是 returned.
尝试这样做时,我有以下导致不匹配错误的代码,但我不明白我做错了什么。
Sub getContents()
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
Dim SubTag As MSHTML.IHTMLElementCollection
Dim SubName As MSHTML.IHTMLElement
Dim SubInfo As MSHTML.IHTMLElement
XMLReq.Open "Get", "https://echa.europa.eu/brief-profile/-/briefprofile/100.047.293", False
XMLReq.send
If XMLReq.Status <> 200 Then
MsgBox "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
Exit Sub
End If
HTMLDoc.body.innerHTML = XMLReq.responseText
Set SubTag = HTMLDoc.getElementsByTagName("dt")
Set SubInfo = SubTag.tags("dd")
For Each SubName In SubTag
Debug.Print SubName.innerText, SubInfo.innerText
Next SubName
End Sub
我很感激它很长 post 但如果有人可以评论我做错了什么,那就太好了。
更新:
下面的代码更好的实现了即时想要的数据windows.
Sub GetContents()
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
Dim SubSectList As MSHTML.IHTMLElement
Dim SubSects As MSHTML.IHTMLElementCollection
Dim SubSect As MSHTML.IHTMLElement
XMLReq.Open "Get", "https://echa.europa.eu/brief-profile/-/briefprofile/100.047.293", False
XMLReq.send
If XMLReq.Status <> 200 Then
MsgBox "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
Exit Sub
End If
HTMLDoc.body.innerHTML = XMLReq.responseText
Set SubSectList = HTMLDoc.getElementsByClassName("col-xs-12 col-lg-10 MainContent")(1)
Set SubSects = SubSectList.getElementsByTagName("dt")
'Debug.Print SubSects.Length
For Each SubSect In SubSects
Debug.Print SubSect.innerText & " : "; SubSect.NextSibling.innerText
Next SubSect
End Sub
我认为您想限制那些作为具有 class EndpointContent 的元素的子元素的 dt;然后你可以链接 nextSibling 移动到相邻的 dd
Option Explicit
Public Sub GetContents()
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
XMLReq.Open "Get", "https://echa.europa.eu/brief-profile/-/briefprofile/100.047.293", False
XMLReq.send
HTMLDoc.body.innerHTML = XMLReq.responseText
Dim i As Long
With HTMLDoc.querySelectorAll(".EndpointContent dt")
For i = 0 To .Length - 1
Debug.Print .Item(i).innerText & " : " & .Item(i).NextSibling.NextSibling.innerText
Debug.Print
Next
End With
End Sub
我正在尝试通过网络抓取以下网站:https://echa.europa.eu/brief-profile/-/briefprofile/100.047.293#ScientificProperties
到目前为止,我有以下代码 return 是每个 h4
标签的内部文本。
Sub getContents()
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
Dim SubTag As MSHTML.IHTMLElementCollection
Dim SubName As MSHTML.IHTMLElement
XMLReq.Open "Get", "https://echa.europa.eu/brief-profile/-/briefprofile/100.047.293", False
XMLReq.send
If XMLReq.Status <> 200 Then
MsgBox "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
Exit Sub
End If
HTMLDoc.body.innerHTML = XMLReq.responseText
Set SubTag = HTMLDoc.getElementsByTagName("dt")
For Each SubName In SubTag
Debug.Print SubName.innerText
Next SubName
End Sub
虽然这 return 提供了很多有用的信息,但它也 return 是一个带有标签名称 dt
的元素 有些结果不需要(突出显示),但我不确定如何限制它列表。探索 HTML 似乎没有任何 tag/id 来区分这些。这也让我想知道我是否以最佳方式提取信息?
同样,如果对于列表中的每个未突出显示的项目,我希望在页面上捕获与它们关联的值,例如
C Physical state at 20°C and 1013 hPa Solid (100%) [1]
C Form Crystalline (100%) [1]
C Odour Other (100%) [1]
C Substance type Organic (100%) [1]
And so on...
此信息的标签名称是“dd”,但我不确定如何才能同时 return 这两个结果。我希望立即 window 我可以得到物理和化学性质的列表,并且每个值的右边也是 returned.
尝试这样做时,我有以下导致不匹配错误的代码,但我不明白我做错了什么。
Sub getContents()
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
Dim SubTag As MSHTML.IHTMLElementCollection
Dim SubName As MSHTML.IHTMLElement
Dim SubInfo As MSHTML.IHTMLElement
XMLReq.Open "Get", "https://echa.europa.eu/brief-profile/-/briefprofile/100.047.293", False
XMLReq.send
If XMLReq.Status <> 200 Then
MsgBox "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
Exit Sub
End If
HTMLDoc.body.innerHTML = XMLReq.responseText
Set SubTag = HTMLDoc.getElementsByTagName("dt")
Set SubInfo = SubTag.tags("dd")
For Each SubName In SubTag
Debug.Print SubName.innerText, SubInfo.innerText
Next SubName
End Sub
我很感激它很长 post 但如果有人可以评论我做错了什么,那就太好了。
更新:
下面的代码更好的实现了即时想要的数据windows.
Sub GetContents()
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
Dim SubSectList As MSHTML.IHTMLElement
Dim SubSects As MSHTML.IHTMLElementCollection
Dim SubSect As MSHTML.IHTMLElement
XMLReq.Open "Get", "https://echa.europa.eu/brief-profile/-/briefprofile/100.047.293", False
XMLReq.send
If XMLReq.Status <> 200 Then
MsgBox "Problem" & vbNewLine & XMLReq.Status & " - " & XMLReq.statusText
Exit Sub
End If
HTMLDoc.body.innerHTML = XMLReq.responseText
Set SubSectList = HTMLDoc.getElementsByClassName("col-xs-12 col-lg-10 MainContent")(1)
Set SubSects = SubSectList.getElementsByTagName("dt")
'Debug.Print SubSects.Length
For Each SubSect In SubSects
Debug.Print SubSect.innerText & " : "; SubSect.NextSibling.innerText
Next SubSect
End Sub
我认为您想限制那些作为具有 class EndpointContent 的元素的子元素的 dt;然后你可以链接 nextSibling 移动到相邻的 dd
Option Explicit
Public Sub GetContents()
Dim XMLReq As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
XMLReq.Open "Get", "https://echa.europa.eu/brief-profile/-/briefprofile/100.047.293", False
XMLReq.send
HTMLDoc.body.innerHTML = XMLReq.responseText
Dim i As Long
With HTMLDoc.querySelectorAll(".EndpointContent dt")
For i = 0 To .Length - 1
Debug.Print .Item(i).innerText & " : " & .Item(i).NextSibling.NextSibling.innerText
Debug.Print
Next
End With
End Sub