使用 getElementsByClassName 抓取网站 --> 错误结果

Question

我正在尝试在以下 HTML 代码段中抓取所有 class 的带有 className = "disabled" 的 innerText： HTML code

我试图在 MS Access (VBA) 中实现的代码如下：

Set IE = CreateObject("InternetExplorer.Application")

claimLink = "https://www.XXXXX.com/"


IE.navigate claimLink
    Do
       DoEvents
    Loop Until IE.ReadyState = 4

Set menuEnabled = IE.Document.getElementsByClassName("disabled")(0)


For Each Item In menuEnabled
    MsgBox (Item.innerText & " --> " & Item.className)
Next

IE.Quit

Set menuEnabled = Nothing
Set searchres = Nothing
Set IE = Nothing

...结果，我得到了这个列表中的所有项目，MS Access 还说 class 所有项目（书目数据、描述、声明等）的名称是 "disabled".

任何人都可以告诉我我的代码有什么问题吗？我只想要 return 是 "Description"、"Claims" 和 "Cited Documents".

These Grey items are the only items I want to be replied

谢谢！科尼利厄斯

Answer 1

似乎需要稍等片刻才能更新元素。我正在使用 css 选择器组合来定位感兴趣的元素。

.epoContentNav [class=disabled]

"." 是一个 class 选择器。它选择在 "." 之后具有 class 名称匹配的元素，即 epoContentNav。 " " 是后代组合子，意思是右边的是左边的 children。 [] 是一个属性选择器，它通过其中命名的属性选择一个元素。在这种情况下，我使用 attribute=value 组合来指定 class 名称必须是 disabled。整个事情读作查找具有 class disabled 的元素，这些元素具有 parent 和 class epoContentNav。它选择具有 class disabled 的所有导航栏元素。

关于这些选择器的信息 here.

Option Explicit    
Public Sub GetInfo()
    Dim IE As New InternetExplorer, i As Long, nodeList

    With IE
        .Visible = True
        .navigate "https://worldwide.espacenet.com/publicationDetails/claims?DB=&ND=&locale=en_EP&FT=D&CC=DE&NR=1952914A&KC=A&tree=false#"

        While .Busy Or .readyState < 4: DoEvents: Wend

        Application.Wait Now + TimeSerial(0, 0, 2)

        Set nodeList = .document.querySelectorAll(".epoContentNav [class=disabled]")
        For i = 0 To nodeList.Length - 1
            Debug.Print nodeList.item(i).innerText, nodeList.item(i).getAttribute("class")
        Next
        Stop
        'Quit '<== Remember to quit application
    End With
End Sub

使用 getElementsByClassName 抓取网站 --> 错误结果

Scraping website using getElementsByClassName --> wrong results

html

ms-access

vba

screen-scraping

web-scraping