单击两个按钮后抓取数据
Scrape data after clicking on two buttons
我正在尝试从此网页中抓取每个模型的 href:https://www.aprilia.com/en_EN/index。
显示href数据的html只有点击两个按钮(一个在右上角,一个在左边叫“Models”),一个接着一个才能看到。
Sub get_info()
Dim ie As Object
Dim address, str_chk As String
Dim my_data As Object
Dim oHTML_Element As IHTMLElement
Dim i As Long
address = "https://www.aprilia.com/en_EN/index"
Set ie = CreateObject("InternetExplorer.Application")
ie.navigate address 'the one mentioned above
ie.Visible = False
Do While ie.Busy Or ie.readyState <> 4
DoEvents
Loop
For Each oHTML_Element In ie.document.getElementsByName("button")
If oHTML_Element.className = "header__menu-services__nav button button--icon" Then
oHTML_Element.Click
End If
Next
Application.Wait Now + #12:00:05 AM#
For Each oHTML_Element In ie.document.getElementsByName("Models")
oHTML_Element.Click
Next
Application.Wait Now + #12:00:05 AM#
'==>
Set my_data = html.getElementsByClassName("card-product card-product--family")
For Each elem In my_data
For i = 0 To elem.getElementsByTagName("a").Length - 1
str_chk = elem.getElementsByTagName("a")(i).href
ws.Cells(9 + j, 7).Value = str_chk
j = j + 1
Next i
Next elem
ie.Quit
Set ie = Nothing
End Sub
我得到了
Error '424' - Object Required
我设置的地方 my_data
.
我想这意味着我无法点击这两个按钮,因此 html 代码不可用。
***************** 修改代码:
Sub get_info22()
Dim address As String
Dim ie, ELE, nodes As Object
Dim i As Long
Dim t As Date
Const MAX_WAIT_SEC As Long = 10
address = "https://www.aprilia.com/en_EN/index"
Set ie = CreateObject("InternetExplorer.Application")
ie.navigate address 'the one mentioned above
ie.Visible = False
Do While ie.Busy Or ie.readyState <> 4
DoEvents
Loop
'************** click on first button
t = Timer
Do
On Error Resume Next
Set ELE = ie.document.querySelector(".header__menu-services__nav")
If Timer - t > MAX_WAIT_SEC Then Exit Do
Loop While ELE Is Nothing
If Not ELE Is Nothing Then
ELE.Click
End If
While ie.Busy Or ie.readyState <> 4: DoEvents: Wend
'************** click on second button
Do
On Error Resume Next
Set ELE = ie.document.querySelector("li > button")
If Timer - t > MAX_WAIT_SEC Then Exit Do
Loop While ELE Is Nothing
If Not ELE Is Nothing Then
ELE.Click
End If
While ie.Busy Or ie.readyState <> 4: DoEvents: Wend
'************** get href for each model
Set nodes = ie.document.querySelectorAll(".card-product--family")
ReDim hrefs(nodes.Length - 1)
For i = 0 To nodes.Length - 1
hrefs(i) = nodes.Item(i).href
ActiveSheet.Cells(9 + i, 7).Value = hrefs(i)
Next
首先尝试使用更精确的选择器。对于第一个按钮使用:
ie.document.querySelector(".header__menu-services__nav").click
通过其 classes 之一定位元素。然后暂停一下,例如
While ie.Busy Or ie.ReadyState<>4:DoEvents:Wend
或者,使用明确的等待时间或循环直到出现下一个所需元素。
然后使用 type selectors and child combinator 定位下一个元素,因为您希望 li
元素中的第一个子元素 button
:
ie.document.querySelector("li > button").click
那你还需要等一下。
最后,您可以使用目标元素中的单个 class 和链接,并提取 href
属性并存储在数组中(例如)
Dim nodes As Object, hrefs(), i As Long
Set nodes = ie.Document.querySelectorAll(".card-product")
ReDim hrefs(nodes.Length - 1)
For i = 0 To nodes.Length - 1
hrefs(i) = nodes.Item(i).href
Next
编辑:
似乎页面使用 ajax 来检索列表,这使得这更容易。我显示版本。第一个是点击两次按钮后我只抓取你描述的那些链接;第二个,我也在这里获取模型子类型链接。
在两者中,我都模仿页面为获取该信息而发出的请求。首先,我用 json 解析器解析返回的 json 并提取模型链接。第二个,我正则表达式出所有 href 信息即。所有子模型。
Json图书馆:
我用jsonconverter.bas。从 here 下载原始代码并添加到名为 JsonConverter 的标准模块。然后您需要转到 VBE > 工具 > 参考 > 添加对 Microsoft 脚本运行时的参考。从复制的代码中删除顶部属性行。
1)
Option Explicit
Public Sub ScrapeModelLinks1()
Dim data As Object, links() As Variant, s As String
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.aprilia.com/en_EN/aprilia/en/index?ajax=true", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
s = .responseText
End With
Set data = JsonConverter.ParseJson(s)("pageData")("main")("component-06")("items")
ReDim links(data.Count)
Dim item As Long, base As String
base = "https://www.aprilia.com"
For item = 1 To data.Count
links(item) = base & data(item)("href")
Next
Stop
End Sub
Public Sub ScrapeModelLinks2()
'grab all href which will include model subtypes
Dim s As String
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.aprilia.com/en_EN/aprilia/en/index?ajax=true", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
s = .responseText
End With
Dim re As Object, matches As Object, links() As Variant
Set re = CreateObject("VBScript.RegExp")
re.Pattern = """href"":""(.*?)"""
re.Global = True
Set matches = re.Execute(s)
ReDim links(matches.Count - 1)
Dim item As Long, base As String
base = "https://www.aprilia.com"
For item = 0 To matches.Count - 1
links(item) = base & matches(item).submatches(0)
Next
Stop
End Sub
正则表达式解释:
我正在尝试从此网页中抓取每个模型的 href:https://www.aprilia.com/en_EN/index。
显示href数据的html只有点击两个按钮(一个在右上角,一个在左边叫“Models”),一个接着一个才能看到。
Sub get_info()
Dim ie As Object
Dim address, str_chk As String
Dim my_data As Object
Dim oHTML_Element As IHTMLElement
Dim i As Long
address = "https://www.aprilia.com/en_EN/index"
Set ie = CreateObject("InternetExplorer.Application")
ie.navigate address 'the one mentioned above
ie.Visible = False
Do While ie.Busy Or ie.readyState <> 4
DoEvents
Loop
For Each oHTML_Element In ie.document.getElementsByName("button")
If oHTML_Element.className = "header__menu-services__nav button button--icon" Then
oHTML_Element.Click
End If
Next
Application.Wait Now + #12:00:05 AM#
For Each oHTML_Element In ie.document.getElementsByName("Models")
oHTML_Element.Click
Next
Application.Wait Now + #12:00:05 AM#
'==>
Set my_data = html.getElementsByClassName("card-product card-product--family")
For Each elem In my_data
For i = 0 To elem.getElementsByTagName("a").Length - 1
str_chk = elem.getElementsByTagName("a")(i).href
ws.Cells(9 + j, 7).Value = str_chk
j = j + 1
Next i
Next elem
ie.Quit
Set ie = Nothing
End Sub
我得到了
Error '424' - Object Required
我设置的地方 my_data
.
我想这意味着我无法点击这两个按钮,因此 html 代码不可用。
***************** 修改代码:
Sub get_info22()
Dim address As String
Dim ie, ELE, nodes As Object
Dim i As Long
Dim t As Date
Const MAX_WAIT_SEC As Long = 10
address = "https://www.aprilia.com/en_EN/index"
Set ie = CreateObject("InternetExplorer.Application")
ie.navigate address 'the one mentioned above
ie.Visible = False
Do While ie.Busy Or ie.readyState <> 4
DoEvents
Loop
'************** click on first button
t = Timer
Do
On Error Resume Next
Set ELE = ie.document.querySelector(".header__menu-services__nav")
If Timer - t > MAX_WAIT_SEC Then Exit Do
Loop While ELE Is Nothing
If Not ELE Is Nothing Then
ELE.Click
End If
While ie.Busy Or ie.readyState <> 4: DoEvents: Wend
'************** click on second button
Do
On Error Resume Next
Set ELE = ie.document.querySelector("li > button")
If Timer - t > MAX_WAIT_SEC Then Exit Do
Loop While ELE Is Nothing
If Not ELE Is Nothing Then
ELE.Click
End If
While ie.Busy Or ie.readyState <> 4: DoEvents: Wend
'************** get href for each model
Set nodes = ie.document.querySelectorAll(".card-product--family")
ReDim hrefs(nodes.Length - 1)
For i = 0 To nodes.Length - 1
hrefs(i) = nodes.Item(i).href
ActiveSheet.Cells(9 + i, 7).Value = hrefs(i)
Next
首先尝试使用更精确的选择器。对于第一个按钮使用:
ie.document.querySelector(".header__menu-services__nav").click
通过其 classes 之一定位元素。然后暂停一下,例如
While ie.Busy Or ie.ReadyState<>4:DoEvents:Wend
或者,使用明确的等待时间或循环直到出现下一个所需元素。
然后使用 type selectors and child combinator 定位下一个元素,因为您希望 li
元素中的第一个子元素 button
:
ie.document.querySelector("li > button").click
那你还需要等一下。
最后,您可以使用目标元素中的单个 class 和链接,并提取 href
属性并存储在数组中(例如)
Dim nodes As Object, hrefs(), i As Long
Set nodes = ie.Document.querySelectorAll(".card-product")
ReDim hrefs(nodes.Length - 1)
For i = 0 To nodes.Length - 1
hrefs(i) = nodes.Item(i).href
Next
编辑:
似乎页面使用 ajax 来检索列表,这使得这更容易。我显示版本。第一个是点击两次按钮后我只抓取你描述的那些链接;第二个,我也在这里获取模型子类型链接。
在两者中,我都模仿页面为获取该信息而发出的请求。首先,我用 json 解析器解析返回的 json 并提取模型链接。第二个,我正则表达式出所有 href 信息即。所有子模型。
Json图书馆:
我用jsonconverter.bas。从 here 下载原始代码并添加到名为 JsonConverter 的标准模块。然后您需要转到 VBE > 工具 > 参考 > 添加对 Microsoft 脚本运行时的参考。从复制的代码中删除顶部属性行。
1)
Option Explicit
Public Sub ScrapeModelLinks1()
Dim data As Object, links() As Variant, s As String
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.aprilia.com/en_EN/aprilia/en/index?ajax=true", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
s = .responseText
End With
Set data = JsonConverter.ParseJson(s)("pageData")("main")("component-06")("items")
ReDim links(data.Count)
Dim item As Long, base As String
base = "https://www.aprilia.com"
For item = 1 To data.Count
links(item) = base & data(item)("href")
Next
Stop
End Sub
Public Sub ScrapeModelLinks2()
'grab all href which will include model subtypes
Dim s As String
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.aprilia.com/en_EN/aprilia/en/index?ajax=true", False
.setRequestHeader "User-Agent", "Mozilla/5.0"
.send
s = .responseText
End With
Dim re As Object, matches As Object, links() As Variant
Set re = CreateObject("VBScript.RegExp")
re.Pattern = """href"":""(.*?)"""
re.Global = True
Set matches = re.Execute(s)
ReDim links(matches.Count - 1)
Dim item As Long, base As String
base = "https://www.aprilia.com"
For item = 0 To matches.Count - 1
links(item) = base & matches(item).submatches(0)
Next
Stop
End Sub
正则表达式解释: