从 Morningstar 中提取特定 table 个单元格,然后循环到下一个 Morningstar 页面

Pulling specific table cells from Morningstar, then looping to next Morningstar page

我目前正在尝试从 Morningstar 的 table 中抓取某些数据,然后让它循环到下一个代码并重复直到没有更多代码。

目前,它将拉动尾随总计 Returns table 上的整个 "rank in category" 行。我只是想提取 3 个月、6 个月、年初至今、1 年、3 年和 5 年的数据。完成拉动后,它将循环到下一个自动收报机,由导航行中的 "Cells(p, 14)" 确定。

即。它检测到 "LINKX" 在单元格 1、14 中,因此它导航到 http://performance.morningstar.com/fund/performance-return.action?t=LINKX&region=usa&culture=en_US 并从 "trailing total returns" table 中拉出所有 "Rank in Category" 行。我只想将指定的放入指定的单元格位置,然后循环到下一个自动收报机。

我浏览了很多这样的主题,使用 excel VBA 我试图从某个代码页面提取关键的特定信息,然后循环到下一个代码并重复。

Declare PtrSafe Function apiShowWindow Lib "user32" Alias "ShowWindow" _
        (ByVal hwnd As LongPtr, ByVal nCmdShow As LongPtr) As LongPtr
    Global Const SW_MAXIMIZE = 3
    Global Const SW_SHOWNORMAL = 1
    Global Const SW_SHOWMINIMIZED = 2

Sub LinkedInWebScrapeScript()

    Dim objIE As InternetExplorer

    Dim html As HTMLDocument

    Set objIE = GetObject("new:{D5E8041D-920F-45e9-B8FB-B1DEB82C6E5E}")
    objIE.Visible = 1
Dim p As Integer
p = 3

    objIE.navigate ("http://performance.morningstar.com/fund/performance-return.action?t=" & Cells(p, 14) & "&region=usa&culture=en_US")
    Application.Wait Now + #12:00:02 AM#

    While objIE.Busy
        DoEvents
    Wend
    apiShowWindow objIE.hwnd, SW_MAXIMIZE

    For i = 1 To 2
        objIE.document.parentWindow.scrollBy 0, 100000 & i
        Application.Wait Now + #12:00:01 AM#
    Next i

Dim TDelements As IHTMLElementCollection
Dim htmldoc As MSHTML.IHTMLDocument 'Document object
Dim eleColtr As MSHTML.IHTMLElementCollection 'Element collection for tr tags
Dim eleColtd As MSHTML.IHTMLElementCollection 'Element collection for td tags
Dim eleColtd1 As MSHTML.IHTMLElementCollection
Dim eleRow As MSHTML.IHTMLElement 'Row elements
Dim eleCol As MSHTML.IHTMLElement 'Column elements
Set htmldoc = objIE.document 'Document webpage
Set eleColtr = htmldoc.getElementsByTagName("tr") 'Find all tr tags
Set TDelements = htmldoc.getElementsByTagName("table")
'This section populates Excel
i = 0 'start with first value in tr collection


Set eleColtd = htmldoc.getElementsByClassName("r_table3 width955px print97")(0).getElementsByClassName("last")(0).getElementsByClassName("row_data divide") 'get all the td elements in that specific tr

    For Each eleCol In eleColtd 'for each element in the td collection
        Sheets("Sheet2").Range("A1").Offset(i, j).Value = eleCol.innerText 'paste the inner text of the td element, and offset at the same time
        j = j + 1 'move to next element in td collection
    Next eleCol 'rinse and repeat
i = i + 1

p = p + 1

objIE.navigate ("http://performance.morningstar.com/fund/performance-return.action?t=" & Cells(p, 14) & "&region=usa&culture=en_US")

Set eleColtd = htmldoc.getElementsByClassName("r_table3 width955px print97")(0).getElementsByClassName("last")(0).getElementsByClassName("row_data divide") 'get all the td elements in that specific tr

    For Each eleCol In eleColtd 'for each element in the td collection
        Sheets("Sheet2").Range("A1").Offset(i, j).Value = eleCol.innerText 'paste the inner text of the td element, and offset at the same time
        z = z + 1
        j = j + 1 'move to next element in td collection
    Next eleCol 'rinse and repeat


End Sub

它将拉动尾随总计 Returns table 上的整个 "rank in category" 行。我只是想提取 3 个月、6 个月、年初至今、1 年、3 年和 5 年的数据。完成拉动后,它将循环到下一个自动收报机,由导航行中的 "Cells(p, 14)" 确定。

下面显示了一个循环以及如何 select 适当的 table,tbody 然后 table 使用 css selectors 的单元格。代码从第 1 行开始的第 N 列读入数组。它假设范围内没有空白单元格(尽管您可以添加一个测试来确定)。

数组有一个循环,其中包含每个代码,url 中的 TICKER 占位符被替换为当前代码值。

每月显示选项卡上有一行可以点击

适当的行通过

识别
Set rankings = .querySelectorAll("#tab-month-end-content .last td")

#tab-month-end-content 是一个 id selector 得到正确的选项卡,然后 .last 是 class selector 对于 class 最后一个 tbody(即 last)的名称,然后 td 用于指定该 tbody 中的子 td 单元格。


CSS select或:

现代浏览器针对 css 进行了优化。 Css selector 是匹配 html 文档中元素的快速方法。 Css select 或通过 querySelector 或 querySelectorAll 方法应用;在这种情况下,HTMLDocument (ie.document)。 querySelectorreturns单节点:第一个匹配为cssselector; querySelectorAll returns 所有匹配项目的节点列表 - 然后您索引到该节点列表以获取特定项目,例如第二个 td 单元格位于索引 1。

查看我们指定的模式:

#tab-month-end-content .last td

第一部分是一个 id selector#,其中 select 是一个 id

的元素
#tab-month-end-content

当应用到页面时,这 returns 两个匹配,我们想要第二个

点击图片放大

下一部分

.last 

是一个class selector, ., for class name last. This selects the tbody tag child element shown in the image above. As only the second id matched element has this child we are now working with the right parent element to go on and select the td type elements using type selector

td

上述每个部分之间的空格</code>称为<a href="https://developer.mozilla.org/en-US/docs/Web/CSS/Descendant_combinator" rel="nofollow noreferrer">descendant combinators</a>,它们指定与第二个select匹配的元素或select 如果它们有一个与第一个 select 匹配的祖先元素,或者即左边的 selector 是相邻检索到的 selector 匹配元素的父元素,则 selected css select或向右。</p> <p>我们可以在下一张图片中看到这一点:</p> <p><sub><em>点击图片放大</em></sub></p> <p><a href="https://i.stack.imgur.com/mRaOJ.png" rel="nofollow noreferrer"><WBIMG:11861638-2.png></a></p> <hr> <p><strong>VBA:</strong></p> <pre><code>Option Explicit Public Sub GetData() Dim ie As Object, tickers(), ws As Worksheet, lastRow As Long Dim results(), headers(), r As Long, i As Long, url As String headers = Array("ticker", "3m", "6m", "ytd", "1y", "3y", "6y") Set ws = ThisWorkbook.Worksheets("Sheet1") tickers = Application.Transpose(ws.Range("N1:N" & GetLastRow(ws, 14)).Value) ReDim results(1 To UBound(tickers), 1 To UBound(headers) + 1) Set ie = GetObject("new:{D5E8041D-920F-45e9-B8FB-B1DEB82C6E5E}") With ie .Visible = True For i = LBound(tickers) To UBound(tickers) r = r + 1 url = Replace$("http://performance.morningstar.com/fund/performance-return.action?t=TICKER&region=usa&culture=en_US", "TICKER", tickers(i)) .Navigate2 url While .Busy Or .readyState < 4: DoEvents: Wend .document.querySelector("[tabname='#tabmonth']").Click Dim rankings As Object Do Loop While .document.querySelectorAll("#tab-month-end-content .last td").Length = 0 'could add timed loop here With .document Set rankings = .querySelectorAll("#tab-month-end-content .last td") On Error Resume Next results(r, 1) = tickers(i) results(r, 2) = rankings.item(1).innerText results(r, 3) = rankings.item(2).innerText results(r, 4) = rankings.item(3).innerText results(r, 5) = rankings.item(4).innerText results(r, 6) = rankings.item(5).innerText results(r, 7) = rankings.item(6).innerText On Error GoTo 0 End With Set rankings = Nothing Next ws.Cells(1, 15).Resize(UBound(results, 1), UBound(results, 2)) = results .Quit End With End Sub Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long With ws GetLastRow = .Cells(.rows.Count, columnNumber).End(xlUp).Row End With End Function


如@SIM 所述,您可以使用 xmlhttp 并避免使用浏览器,但不确定您的安全设置是否需要将网站列入白名单。您将需要在此处 url 中探索占位符是否有效:XNAS:TICKERXNAS 前缀可能因您的代码而异,在这种情况下,您需要适当的字符串,包括 N 列中的前缀,然后用该字符串替换扩展占位符,例如.....=PLACEHOLDER&region.......

Option Explicit
Public Sub GetData()
    Dim tickers(), ws As Worksheet, lastRow As Long
    Dim results(), headers(), r As Long, i As Long, url As String, html As HTMLDocument
    Set html = New HTMLDocument 'vbe > tools > references > Microsoft HTML Object Library

    headers = Array("ticker", "3m", "6m", "ytd", "1y", "3y", "6y")
    Set ws = ThisWorkbook.Worksheets("Sheet1")
    tickers = Application.Transpose(ws.Range("N1:N" & GetLastRow(ws, 14)).Value)
    ReDim results(1 To UBound(tickers), 1 To UBound(headers) + 1)

    With CreateObject("MSXML2.XMLHTTP")

        For i = LBound(tickers) To UBound(tickers)
            r = r + 1
            url = Replace$("http://performance.morningstar.com/perform/Performance/fund/trailing-total-returns.action?&t=XNAS:TICKER&region=usa&culture=en-US&cur=&ops=clear&s=0P0000J533&ndec=2&ep=true&align=m&annlz=true&comparisonRemove=false&loccat=&taxadj=&benchmarkSecId=&benchmarktype=", "TICKER", tickers(i))
           .Open "GET", url, False
           .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
           .setRequestHeader "DNT", "1"
           .send
           html.body.innerHTML = .responseText

            Dim rankings As Object
            With html
                Set rankings = .querySelectorAll(".last td")

                On Error Resume Next
                results(r, 1) = tickers(i)
                results(r, 2) = rankings.item(1).innerText
                results(r, 3) = rankings.item(2).innerText
                results(r, 4) = rankings.item(3).innerText
                results(r, 5) = rankings.item(4).innerText
                results(r, 6) = rankings.item(5).innerText
                results(r, 7) = rankings.item(6).innerText
                On Error GoTo 0
            End With
            Set rankings = Nothing
        Next
        ws.Cells(1, 15).Resize(UBound(results, 1), UBound(results, 2)) = results
    End With
End Sub

Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
    With ws
        GetLastRow = .Cells(.rows.Count, columnNumber).End(xlUp).Row
    End With
End Function