Excel 将网站上的主题评论绘制到单元格中的宏

Excel Macro to draw thread comments from website into cells

我正在尝试将 Reddit 线程评论存储在 excel 电子表格中,但是我在尝试弄清楚如何执行此操作时遇到了麻烦。我没有太多使用宏从网页获取数据的经验,所以我一直很难弄清楚如何从指定的 Reddit 线程中提取每条评论并将其放在一个单元格中,以及它是否是可以做到。

这是我目前拥有的:

Sub getRedditData()

Dim x As Long, y As Long
Dim htm As Object

Set htm = CreateObject("htmlFile")

With CreateObject("msxml2.xmlhttp")
    .Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
    .send
    htm.body.innerhtml = .responsetext
End With

With htm.getelementbyid("comments")
    Set cellrangex = .Rows(x).Cells.Length - 1
    Set cellrangey = .Rows(x).Cells.Length - 1
    Set cellrange1 = Sheets(1).Cells(x + 1, y + 1).Value
    Set cellrange2 = .Rows(x).Cells(y).innertext

    For x = 0 To cellrangex
        For y = 0 To cellrangey
            cellrange = cellrange2
        Next y
    Next x
End With


End Sub

您确实需要使用像样的 HTML 编辑器分析您正在抓取的网页内容。我建议导航到 chrome 中的相关页面并使用 F12 打开它的开发者工具。在 "Elements" 选项卡中,您可以快速查看哪个 HTML 正在生成页面的哪个部分(同时打开彼此相邻的页面和开发人员工具)。

当您进入评论时,您会注意到每个评论的文本都在 <p> 标签内,每个 <p> 标签都在 <div> 内。我们正在寻找模式,所以这是一个好的开始。

您还会注意到,每个 <div> 标签的 classmd。 所以...让我们将所有页面 <div> 标签加载到一个对象中,然后查找具有包含 "md":

className 的标签

子 getRedditData()

Dim x As Long, y As Long
Dim htm As Object

Set htm = CreateObject("htmlFile")

With CreateObject("msxml2.xmlhttp")
    .Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
    .send
    htm.body.innerhtml = .responsetext
End With

Set Divelements = htm.getElementsByTagName("div")

For Each DivElement In Divelements
    If InStr(1, DivElement.ClassName, "md") Then
        'print contents to the Immediate window for debugging View>>Immediate Window to insure it's up in your VBE
        Debug.Print DivElement.InnerText
    End If
Next

结束子

有了它,您将看到所有评论都卡在立即 window 中(转到查看>>立即 Window),因此您可以看到此调试输出。


跳过节点后,您似乎可以向上导航几个元素并返回树下以获取用户名:

Sub getRedditData()

    Dim x As Long, y As Long
    Dim htm As Object

    Set htm = CreateObject("htmlFile")

    With CreateObject("msxml2.xmlhttp")
        .Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
        .send
        htm.body.innerhtml = .responsetext
    End With

    Set Divelements = htm.getElementsByTagName("div")


    On Error Resume Next

    For Each divElement In Divelements
        If InStr(1, divElement.className, "md") And Not InStr(1, divElement.className, "md-container") Then
            Set commentEntry = divElement.ParentNode.ParentNode.ParentNode

            'Print the name and the comment
            Debug.Print commentEntry.FirstChild.FirstChild.NextSibling.InnerText & ":", divElement.InnerText

        End If
    Next

End Sub

要将其打印到 sheet,只需指向一个单元格而不是 debug.print 直接 window。类似于:

Sub getRedditData()

    Dim x As Long, y As Long
    Dim htm As Object
    Dim ws As Worksheet, wsCell As Integer

    'set the worksheet to print to and the first row to start printing.
    Set ws = Sheets("Sheet1")
    wsCell = 1

    Set htm = CreateObject("htmlFile")

    With CreateObject("msxml2.xmlhttp")
        .Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
        .send
        htm.body.innerhtml = .responsetext
    End With

    Set Divelements = htm.getElementsByTagName("div")


    On Error Resume Next

    For Each divElement In Divelements
        If InStr(1, divElement.className, "md") And Not InStr(1, divElement.className, "md-container") Then
            Set commentEntry = divElement.ParentNode.ParentNode.ParentNode

            'Print the name and the comment to ws sheet columns 1 and 2
            ws.Cells(wsCell, 1).Value = commentEntry.FirstChild.FirstChild.NextSibling.InnerText
            ws.Cells(wsCell, 2).Value = divElement.InnerText

            'iterate to the next row
            wsCell = wsCell + 1

        End If
    Next
End Sub