使用 getelementbyID 时如何隔离多个 innertext 条目

How to Isolate multiple innertext entries when using get elementbyID

我正在尝试从网页中分离出 2 个不同的 innerText 字符串,但无法将它们挑出来。所有标签的 innerText 是一个整体。 日期和季节编号是问题。

我正在使用 getElementById,这给了我一个元素。 divid "next_episode" 看起来像我感兴趣的 2 个不同的内部文本条目。当我遍历其子项的内部文本时,将跳过这 2 个条目.我不知道如何隔离 "next_episode" 标签的 2 个不同的 innerText 条目。我通过使用数组中的索引号来隔离我需要的文本我的代码 returns.

Dim IE_00 As SHDocVw.InternetExplorer
Dim HTMLDoc_00 As MSHTML.HTMLDocument
Set IE_00 = New SHDocVw.InternetExplorer
IE_00.Visible = True

IE_00.navigate "https://next-episode.net/final-space"
Do While IE_00.readyState <> READYSTATE_COMPLETE
Loop
Set HTMLDoc_00 = IE_00.document

Dim NETC_05 As MSHTML.IHTMLElementCollection
Dim NET_05 As MSHTML.IHTMLElement

'Can loop through the inner text of the children one by one and find what 
I need

Set NETC_05 = HTMLDoc_00.getElementById("next_episode").Children

For Each NET_05 In NETC_05
Debug.Print NET_05.innerText
Next NET_05

'This just gives a big block of text that includes the missing inner text 
I need

Set NET_05 = HTMLDoc_00.getElementById("next_episode")
Debug.Print NET_05.innerText

数据(大部分)在 NextSiblings:

The Node.nextSibling read-only property returns the node immediately following the specified one in their parent's childNodes, or returns null if the specified node is the last child in the parent element. *1


您可以编写一个函数,例如 GetNextSiblings,它检查当前节点的特定搜索字符串,然后从 NextSibling 中提取所需的值。我有 re-ordered 输出列以减少代码,但您可以轻松地循环另一个 headers 数组,并使用该顺序从 dict info 访问以不同的顺序写出值.我根据字典中键的输入顺序确定输出顺序。我循环 headers 数组来填充字典键,然后用抓取的值更新字典。

不需要浏览器的开销,因为所需的内容不是动态加载的。一个简单且更快的 xhr 请求就足够了。


Side-note:

对于这种类型的页面,我建议使用 Python 3 和 BeautifulSoup (bs4 4.7.1+),因为这可以让您访问伪选择器 :contains。这样代码就可以更简洁,程序也可以更快。我在最后展示这个。


VBA:

Option Explicit
Public Sub GetShowInfo()
    Dim html As MSHTML.HTMLDocument, headers(), i As Long, aCollection As Object, info As Object

    headers = Array("Name:", "Countdown:", "Date:", "Season:", "Episode:", "Status:")
    Set html = New HTMLDocument

    With CreateObject("Msxml2.xmlhttp")
        .Open "GET", "https://next-episode.net/final-space", False
        .send
        html.body.innerHTML = .responseText
    End With

    Set info = CreateObject("Scripting.Dictionary")

    For i = LBound(headers) To UBound(headers)
        info(headers(i)) = vbNullString
    Next

    info("Name:") = html.querySelector("#next_episode .sub_main").innerText
    info("Countdown:") = html.querySelector("#next_episode span").innerText
    Set aCollection = html.getElementById("middle_section").getElementsByTagName("div")
    Set info = GetNextSiblings(aCollection, headers, info)
    Set aCollection = html.getElementById("next_episode").getElementsByTagName("div")
    Set info = GetNextSiblings(aCollection, headers, info)

    With ThisWorkbook.Worksheets("Sheet1")
        .Cells(1, 1).Resize(1, info.Count) = info.keys
        .Cells(2, 1).Resize(1, info.Count) = info.items
    End With
End Sub

Public Function GetNextSiblings(ByVal aCollection As Object, ByRef headers(), ByVal info As Object) As Object
    Dim item As Object, i As Long
    For Each item In aCollection
        For i = 2 To UBound(headers)
            If InStr(item.outerHTML, headers(i)) > 0 Then
                If headers(i) = "Episode:" Then
                    info(headers(i)) = item.NextSibling.innerText
                Else
                    info(headers(i)) = item.NextSibling.NodeValue
                End If
                Exit For
            End If
        Next
    Next
    Set GetNextSiblings = info
End Function

正在阅读:

  1. NextSibling
  2. CSS selectors
  3. querySelector

Python(bs4 4.7.1+):

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://next-episode.net/final-space')
soup = bs(r.content, 'lxml')    
current_nodes = ['Status:','Name:', 'Countdown:','Date:','Season:','Episode:']

for node in current_nodes:
    selector = f'#middle_section div:contains("{node}"), #next_episode div:contains("{node}")'
    if node in ['Episode:','Name:']:
        print(node, soup.select_one(selector).text.replace(node,''))
    elif node == 'Countdown:':
         print(node, soup.select_one(selector).next_sibling.text)
    else:
        print(node, soup.select_one(selector).next_sibling)
'Setting XML 05 as an Object
    Dim XML_05 As New MSXML2.XMLHTTP60
'Setting HTML Document 05 as an Object
    Dim HTML_05 As New MSHTML.HTMLDocument

    XML_05.Open "GET", Cells(Row, NextEpisodeURL).Value, False
    XML_05.send
    HTML_05.body.innerHTML = XML_05.responseText

'Setting Net Element Tag Collection 05 as an Object
    Dim NETC_05 As MSHTML.IHTMLElementCollection
'Setting Net Element Tag 05 as an Object
    Dim NET_05 As MSHTML.IHTMLElement
'Setting Reg EX 05 as an Object
    Dim REO_05 As VBScript_RegExp_55.RegExp
'Setting Match Object 05 as Object
    Dim MO_05 As Object
'Setting Season array as Array
    Dim SN_05() As String
'Setting Episode Name 05 as Array
    Dim ENA_05() As String
'Setting Episode Number 05 as Array
    Dim EN_05() As String

'Getting Episode Name Episode Number and Season Number From Net

'Set NETC_05 = HTML_05.getElementsByClassName("sub_main")
    Set NET_05 = HTML_05.getElementById("previous_episode")
    Set REO_05 = New VBScript_RegExp_55.RegExp
        REO_05.Global = True
        REO_05.IgnoreCase = True

'Getting Episode Name
    REO_05.Pattern = "(Name:(.*))"
        Set MO_05 = REO_05.Execute(NET_05.innerText)
            Debug.Print MO_05.Count
            Debug.Print MO_05(0).Value
                ENA_05 = Split(MO_05(0), ":")
            Debug.Print ENA_05(1)
            Cells(Row, NextEpName).Value = ENA_05(1)

'Getting Episode Number
    REO_05.Pattern = "(Episode:([0-9]*))"
        Set MO_05 = REO_05.Execute(NET_05.innerText)
            Debug.Print MO_05.Count
            Debug.Print MO_05(0).Value
                EN_05 = Split(MO_05(0), ":")
            Debug.Print EN_05(1)
            Cells(Row, EpisodeNet).Value = EN_05(1)

'Getting Season Number
    REO_05.Pattern = "(Season:([0-9]*))"
        Set MO_05 = REO_05.Execute(NET_05.innerText)
            Debug.Print MO_05.Count
            Debug.Print MO_05(0).Value
                SN_05 = Split(MO_05(0), ":")
            Debug.Print SN_05(1)
            Cells(Row, SeasonNet).Value = SN_05(1)

'Getting Countdown From Net
    Set NETC_05 = HTML_05.getElementById("next_episode").Children
        Cells(Row, Countdown).Value = NETC_05(5).innerText
        Debug.Print NETC_05(5).innerText