单击两个按钮后抓取数据

Scrape data after clicking on two buttons

我正在尝试从此网页中抓取每个模型的 href:https://www.aprilia.com/en_EN/index
显示href数据的html只有点击两个按钮(一个在右上角,一个在左边叫“Models”),一个接着一个才能看到。

Sub get_info()
  
Dim ie As Object
Dim address, str_chk As String
Dim my_data As Object
Dim oHTML_Element As IHTMLElement
Dim i As Long
  
address = "https://www.aprilia.com/en_EN/index"

Set ie = CreateObject("InternetExplorer.Application")
ie.navigate address 'the one mentioned above
ie.Visible = False
  
Do While ie.Busy Or ie.readyState <> 4
    DoEvents
Loop
  
For Each oHTML_Element In ie.document.getElementsByName("button")
    If oHTML_Element.className = "header__menu-services__nav button button--icon" Then
        oHTML_Element.Click
    End If
Next
  
Application.Wait Now + #12:00:05 AM#

For Each oHTML_Element In ie.document.getElementsByName("Models")
    oHTML_Element.Click
Next

Application.Wait Now + #12:00:05 AM#

'==>
Set my_data = html.getElementsByClassName("card-product card-product--family")

For Each elem In my_data
    For i = 0 To elem.getElementsByTagName("a").Length - 1
        str_chk = elem.getElementsByTagName("a")(i).href
        ws.Cells(9 + j, 7).Value = str_chk
        j = j + 1
    Next i
Next elem

ie.Quit
Set ie = Nothing
End Sub

我得到了

Error '424' - Object Required

我设置的地方 my_data.

我想这意味着我无法点击这两个按钮,因此 html 代码不可用。

***************** 修改代码:

Sub get_info22()

Dim address As String
Dim ie, ELE, nodes As Object
Dim i As Long
Dim t As Date
Const MAX_WAIT_SEC As Long = 10
  
address = "https://www.aprilia.com/en_EN/index"

Set ie = CreateObject("InternetExplorer.Application")
ie.navigate address 'the one mentioned above
ie.Visible = False
  
Do While ie.Busy Or ie.readyState <> 4
    DoEvents
Loop

'************** click on first button
t = Timer
Do
    On Error Resume Next
    Set ELE = ie.document.querySelector(".header__menu-services__nav")

    If Timer - t > MAX_WAIT_SEC Then Exit Do
Loop While ELE Is Nothing
If Not ELE Is Nothing Then
    ELE.Click
End If

While ie.Busy Or ie.readyState <> 4: DoEvents: Wend
'************** click on second button
Do
    On Error Resume Next
    Set ELE = ie.document.querySelector("li > button")

    If Timer - t > MAX_WAIT_SEC Then Exit Do
Loop While ELE Is Nothing

If Not ELE Is Nothing Then
    ELE.Click
End If

While ie.Busy Or ie.readyState <> 4: DoEvents: Wend

'************** get href for each model

Set nodes = ie.document.querySelectorAll(".card-product--family")

ReDim hrefs(nodes.Length - 1)

For i = 0 To nodes.Length - 1
    hrefs(i) = nodes.Item(i).href
    ActiveSheet.Cells(9 + i, 7).Value = hrefs(i)
Next

首先尝试使用更精确的选择器。对于第一个按钮使用:

ie.document.querySelector(".header__menu-services__nav").click

通过其 classes 之一定位元素。然后暂停一下,例如

While ie.Busy Or ie.ReadyState<>4:DoEvents:Wend

或者,使用明确的等待时间或循环直到出现下一个所需元素。

然后使用 type selectors and child combinator 定位下一个元素,因为您希望 li 元素中的第一个子元素 button

ie.document.querySelector("li > button").click

那你还需要等一下。

最后,您可以使用目标元素中的单个 class 和链接,并提取 href 属性并存储在数组中(例如)

Dim nodes As Object, hrefs(), i As Long

Set nodes = ie.Document.querySelectorAll(".card-product")

ReDim hrefs(nodes.Length - 1)

For i = 0 To nodes.Length - 1
    hrefs(i) = nodes.Item(i).href
Next

编辑:

似乎页面使用 ajax 来检索列表,这使得这更容易。我显示版本。第一个是点击两次按钮后我只抓取你描述的那些链接;第二个,我也在这里获取模型子类型链接。

在两者中,我都模仿页面为获取该信息而发出的请求。首先,我用 json 解析器解析返回的 json 并提取模型链接。第二个,我正则表达式出所有 href 信息即。所有子模型。


Json图书馆:

我用jsonconverter.bas。从 here 下载原始代码并添加到名为 JsonConverter 的标准模块。然后您需要转到 VBE > 工具 > 参考 > 添加对 Microsoft 脚本运行时的参考。从复制的代码中删除顶部属性行。

1)

Option Explicit

Public Sub ScrapeModelLinks1()
    
    Dim data As Object, links() As Variant, s As String
    
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.aprilia.com/en_EN/aprilia/en/index?ajax=true", False
        .setRequestHeader "User-Agent", "Mozilla/5.0"
        .send
        s = .responseText
    End With

    Set data = JsonConverter.ParseJson(s)("pageData")("main")("component-06")("items")
    
    ReDim links(data.Count)
    Dim item As Long, base As String
    
    base = "https://www.aprilia.com"

    For item = 1 To data.Count
        links(item) = base & data(item)("href")
    Next
    
    Stop
    
End Sub

Public Sub ScrapeModelLinks2()
    'grab all href which will include model subtypes
    Dim s As String
    
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.aprilia.com/en_EN/aprilia/en/index?ajax=true", False
        .setRequestHeader "User-Agent", "Mozilla/5.0"
        .send
        s = .responseText
    End With
    
    Dim re As Object, matches As Object, links() As Variant
    
    Set re = CreateObject("VBScript.RegExp")

    re.Pattern = """href"":""(.*?)"""
    re.Global = True
    Set matches = re.Execute(s)
    
    ReDim links(matches.Count - 1)
    Dim item As Long, base As String
    
    base = "https://www.aprilia.com"
 
    For item = 0 To matches.Count - 1
        links(item) = base & matches(item).submatches(0)
    Next
    
    Stop
    
End Sub

正则表达式解释: