如何在 VBA 中获取标签的 innerText,排除嵌套标签中的文本?
How to get innerText of a tag in VBA excluding text from nested tags?
我正在使用 VBA 进行网页抓取。下面是 html 结构和我的 VBA 代码。
当我 运行 它时,我收到此文本 ETA : 2020-08-26 (Reference only, the date will be updated according to shipments).
但我只想从中抓取日期 2020-08-26
<div style="font-size: 14px;">
<span class="label" style="font-weight: bolder; font-size: 13px;">ETA : </span>
<br>
2020-08-26
<span style="color: red; font-size: 12px;">(Reference only, the date will be updated according to
shipments).</span>
</div>
VBA代码>
Dim ie As New InternetExplorer
Dim doc As New HTMLDocument
ie.navigate "http://127.0.0.1/wordpress/sample-page/"
Do
DoEvents
Loop Until ie.readyState = READYSTATE_COMPLETE
Set doc = ie.document
Set elems = doc.getElementsByTagName("div")
MsgBox elems(33).innerText
您可以通过字符串操作或通过 DOM 的路径来做到这一点。这是路径的解决方案。
Sub SelectFromDropdown()
Dim url As String
Dim browser As Object
Dim nodeDiv As Object
url = "Your URL Here"
'Initialize Internet Explorer, set visibility,
'call URL and wait until page is fully loaded
Set browser = CreateObject("internetexplorer.application")
browser.Visible = True
browser.navigate url
Do Until browser.readyState = 4: DoEvents: Loop
'Istead of (0) it's (33) in your code
'However, I do not recommend the use of such high indices,
'as they can lead to unstable behaviour. Just add a div tag
'before the index and the macro will not work anymore. This
'does not apply if you loop through an HTML section that has
'been selected as a container of exactly these div tags.
Set nodeDiv = browser.document.getElementsByTagName("div")(0)
'To get only the date you can go through the DOM path
'You want a text node of the DOM (Document Object Model)
'So innertext doesn't work. You need the NodeValue
MsgBox nodeDiv.FirstChild.NextSibling.NextSibling.NextSibling.NextSibling.NodeValue
End Sub
获得字符串后,只需组合使用 Instr
、Mid
和 Trim
即可获取日期:
Sub test()
Dim sSource As String
Dim nStart As Integer
Dim nEnd As Integer
Dim sResult As String
Dim dtDate As Date
sSource = "ETA : 2020-08-26 (Reference only, the date will be updated according to shipments)"
nStart = InStr(sSource, ":")
nEnd = InStr(sSource, "(")
sResult = Trim$(Mid$(sSource, nStart + 1, nEnd - nStart - 1))
If IsDate(sResult) Then
dtDate = CDate(sResult)
MsgBox "Success: " & dtDate
Else
MsgBox sResult & " is not a date"
End If
End Sub
此代码查找 ####-##-##
.
形式的任何日期
Cells.Clear
s = "ETA : 2020-08-26 (Reference only, the date will be updated according to shipments)."
ReDim a(1 To Len(s))
For i = 1 To Len(s)
a(i) = IIf(Mid(s, i, 1) Like "#", "#", Mid(s, i, 1))
Next i
fd = "####-##-##"
Cells(1, 1) = s
aa = Join(a, "")
Cells(2, 1) = aa
Cells(3, 1) = Mid(s, InStr(aa, fd), Len(fd))
Cells(3, 1).NumberFormat = "yyyy-mm-dd"
首先它将字符串拆分成一个数组,并用 #
替换所有数字。然后它使用 InStr
找到模式模板 fd
的匹配项,并使用匹配项中的 return 值到 return 实际日期。
Dim html, divs, d, c
Set html = CreateObject("htmlfile")
html.body.innerHTML = "<div style='font-size: 14px;'><span class='label' style='font-weight: bolder; font-size: 13px;'>ETA : </span>" & _
"<br>2020-08-26" & _
"<span style='color: red; font-size: 12px;'>(Reference only, the date will be updated according toshipments).</span>" & _
"</div>"
Set divs = html.getElementsByTagName("div")
For Each d In divs
For Each c In d.ChildNodes
Debug.Print TypeName(c), c.nodeName, c.NodeValue
Next c
Next d
输出:
HTMLSpanElement SPAN Null
HTMLBRElement BR Null
DispHTMLDOMTextNode #text 2020-08-26
HTMLSpanElement SPAN Null
我正在使用 VBA 进行网页抓取。下面是 html 结构和我的 VBA 代码。
当我 运行 它时,我收到此文本 ETA : 2020-08-26 (Reference only, the date will be updated according to shipments).
但我只想从中抓取日期 2020-08-26
<div style="font-size: 14px;">
<span class="label" style="font-weight: bolder; font-size: 13px;">ETA : </span>
<br>
2020-08-26
<span style="color: red; font-size: 12px;">(Reference only, the date will be updated according to
shipments).</span>
</div>
VBA代码>
Dim ie As New InternetExplorer
Dim doc As New HTMLDocument
ie.navigate "http://127.0.0.1/wordpress/sample-page/"
Do
DoEvents
Loop Until ie.readyState = READYSTATE_COMPLETE
Set doc = ie.document
Set elems = doc.getElementsByTagName("div")
MsgBox elems(33).innerText
您可以通过字符串操作或通过 DOM 的路径来做到这一点。这是路径的解决方案。
Sub SelectFromDropdown()
Dim url As String
Dim browser As Object
Dim nodeDiv As Object
url = "Your URL Here"
'Initialize Internet Explorer, set visibility,
'call URL and wait until page is fully loaded
Set browser = CreateObject("internetexplorer.application")
browser.Visible = True
browser.navigate url
Do Until browser.readyState = 4: DoEvents: Loop
'Istead of (0) it's (33) in your code
'However, I do not recommend the use of such high indices,
'as they can lead to unstable behaviour. Just add a div tag
'before the index and the macro will not work anymore. This
'does not apply if you loop through an HTML section that has
'been selected as a container of exactly these div tags.
Set nodeDiv = browser.document.getElementsByTagName("div")(0)
'To get only the date you can go through the DOM path
'You want a text node of the DOM (Document Object Model)
'So innertext doesn't work. You need the NodeValue
MsgBox nodeDiv.FirstChild.NextSibling.NextSibling.NextSibling.NextSibling.NodeValue
End Sub
获得字符串后,只需组合使用 Instr
、Mid
和 Trim
即可获取日期:
Sub test()
Dim sSource As String
Dim nStart As Integer
Dim nEnd As Integer
Dim sResult As String
Dim dtDate As Date
sSource = "ETA : 2020-08-26 (Reference only, the date will be updated according to shipments)"
nStart = InStr(sSource, ":")
nEnd = InStr(sSource, "(")
sResult = Trim$(Mid$(sSource, nStart + 1, nEnd - nStart - 1))
If IsDate(sResult) Then
dtDate = CDate(sResult)
MsgBox "Success: " & dtDate
Else
MsgBox sResult & " is not a date"
End If
End Sub
此代码查找 ####-##-##
.
Cells.Clear
s = "ETA : 2020-08-26 (Reference only, the date will be updated according to shipments)."
ReDim a(1 To Len(s))
For i = 1 To Len(s)
a(i) = IIf(Mid(s, i, 1) Like "#", "#", Mid(s, i, 1))
Next i
fd = "####-##-##"
Cells(1, 1) = s
aa = Join(a, "")
Cells(2, 1) = aa
Cells(3, 1) = Mid(s, InStr(aa, fd), Len(fd))
Cells(3, 1).NumberFormat = "yyyy-mm-dd"
首先它将字符串拆分成一个数组,并用 #
替换所有数字。然后它使用 InStr
找到模式模板 fd
的匹配项,并使用匹配项中的 return 值到 return 实际日期。
Dim html, divs, d, c
Set html = CreateObject("htmlfile")
html.body.innerHTML = "<div style='font-size: 14px;'><span class='label' style='font-weight: bolder; font-size: 13px;'>ETA : </span>" & _
"<br>2020-08-26" & _
"<span style='color: red; font-size: 12px;'>(Reference only, the date will be updated according toshipments).</span>" & _
"</div>"
Set divs = html.getElementsByTagName("div")
For Each d In divs
For Each c In d.ChildNodes
Debug.Print TypeName(c), c.nodeName, c.NodeValue
Next c
Next d
输出:
HTMLSpanElement SPAN Null
HTMLBRElement BR Null
DispHTMLDOMTextNode #text 2020-08-26
HTMLSpanElement SPAN Null