Excel 将网站上的主题评论绘制到单元格中的宏
Excel Macro to draw thread comments from website into cells
我正在尝试将 Reddit 线程评论存储在 excel 电子表格中,但是我在尝试弄清楚如何执行此操作时遇到了麻烦。我没有太多使用宏从网页获取数据的经验,所以我一直很难弄清楚如何从指定的 Reddit 线程中提取每条评论并将其放在一个单元格中,以及它是否是可以做到。
这是我目前拥有的:
Sub getRedditData()
Dim x As Long, y As Long
Dim htm As Object
Set htm = CreateObject("htmlFile")
With CreateObject("msxml2.xmlhttp")
.Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
.send
htm.body.innerhtml = .responsetext
End With
With htm.getelementbyid("comments")
Set cellrangex = .Rows(x).Cells.Length - 1
Set cellrangey = .Rows(x).Cells.Length - 1
Set cellrange1 = Sheets(1).Cells(x + 1, y + 1).Value
Set cellrange2 = .Rows(x).Cells(y).innertext
For x = 0 To cellrangex
For y = 0 To cellrangey
cellrange = cellrange2
Next y
Next x
End With
End Sub
您确实需要使用像样的 HTML 编辑器分析您正在抓取的网页内容。我建议导航到 chrome 中的相关页面并使用 F12 打开它的开发者工具。在 "Elements" 选项卡中,您可以快速查看哪个 HTML 正在生成页面的哪个部分(同时打开彼此相邻的页面和开发人员工具)。
当您进入评论时,您会注意到每个评论的文本都在 <p>
标签内,每个 <p>
标签都在 <div>
内。我们正在寻找模式,所以这是一个好的开始。
您还会注意到,每个 <div>
标签的 class
为 md
。
所以...让我们将所有页面 <div>
标签加载到一个对象中,然后查找具有包含 "md"
:
的 className
的标签
子 getRedditData()
Dim x As Long, y As Long
Dim htm As Object
Set htm = CreateObject("htmlFile")
With CreateObject("msxml2.xmlhttp")
.Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
.send
htm.body.innerhtml = .responsetext
End With
Set Divelements = htm.getElementsByTagName("div")
For Each DivElement In Divelements
If InStr(1, DivElement.ClassName, "md") Then
'print contents to the Immediate window for debugging View>>Immediate Window to insure it's up in your VBE
Debug.Print DivElement.InnerText
End If
Next
结束子
有了它,您将看到所有评论都卡在立即 window 中(转到查看>>立即 Window),因此您可以看到此调试输出。
跳过节点后,您似乎可以向上导航几个元素并返回树下以获取用户名:
Sub getRedditData()
Dim x As Long, y As Long
Dim htm As Object
Set htm = CreateObject("htmlFile")
With CreateObject("msxml2.xmlhttp")
.Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
.send
htm.body.innerhtml = .responsetext
End With
Set Divelements = htm.getElementsByTagName("div")
On Error Resume Next
For Each divElement In Divelements
If InStr(1, divElement.className, "md") And Not InStr(1, divElement.className, "md-container") Then
Set commentEntry = divElement.ParentNode.ParentNode.ParentNode
'Print the name and the comment
Debug.Print commentEntry.FirstChild.FirstChild.NextSibling.InnerText & ":", divElement.InnerText
End If
Next
End Sub
要将其打印到 sheet,只需指向一个单元格而不是 debug.print
直接 window。类似于:
Sub getRedditData()
Dim x As Long, y As Long
Dim htm As Object
Dim ws As Worksheet, wsCell As Integer
'set the worksheet to print to and the first row to start printing.
Set ws = Sheets("Sheet1")
wsCell = 1
Set htm = CreateObject("htmlFile")
With CreateObject("msxml2.xmlhttp")
.Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
.send
htm.body.innerhtml = .responsetext
End With
Set Divelements = htm.getElementsByTagName("div")
On Error Resume Next
For Each divElement In Divelements
If InStr(1, divElement.className, "md") And Not InStr(1, divElement.className, "md-container") Then
Set commentEntry = divElement.ParentNode.ParentNode.ParentNode
'Print the name and the comment to ws sheet columns 1 and 2
ws.Cells(wsCell, 1).Value = commentEntry.FirstChild.FirstChild.NextSibling.InnerText
ws.Cells(wsCell, 2).Value = divElement.InnerText
'iterate to the next row
wsCell = wsCell + 1
End If
Next
End Sub
我正在尝试将 Reddit 线程评论存储在 excel 电子表格中,但是我在尝试弄清楚如何执行此操作时遇到了麻烦。我没有太多使用宏从网页获取数据的经验,所以我一直很难弄清楚如何从指定的 Reddit 线程中提取每条评论并将其放在一个单元格中,以及它是否是可以做到。
这是我目前拥有的:
Sub getRedditData()
Dim x As Long, y As Long
Dim htm As Object
Set htm = CreateObject("htmlFile")
With CreateObject("msxml2.xmlhttp")
.Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
.send
htm.body.innerhtml = .responsetext
End With
With htm.getelementbyid("comments")
Set cellrangex = .Rows(x).Cells.Length - 1
Set cellrangey = .Rows(x).Cells.Length - 1
Set cellrange1 = Sheets(1).Cells(x + 1, y + 1).Value
Set cellrange2 = .Rows(x).Cells(y).innertext
For x = 0 To cellrangex
For y = 0 To cellrangey
cellrange = cellrange2
Next y
Next x
End With
End Sub
您确实需要使用像样的 HTML 编辑器分析您正在抓取的网页内容。我建议导航到 chrome 中的相关页面并使用 F12 打开它的开发者工具。在 "Elements" 选项卡中,您可以快速查看哪个 HTML 正在生成页面的哪个部分(同时打开彼此相邻的页面和开发人员工具)。
当您进入评论时,您会注意到每个评论的文本都在 <p>
标签内,每个 <p>
标签都在 <div>
内。我们正在寻找模式,所以这是一个好的开始。
您还会注意到,每个 <div>
标签的 class
为 md
。
所以...让我们将所有页面 <div>
标签加载到一个对象中,然后查找具有包含 "md"
:
className
的标签
子 getRedditData()
Dim x As Long, y As Long
Dim htm As Object
Set htm = CreateObject("htmlFile")
With CreateObject("msxml2.xmlhttp")
.Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
.send
htm.body.innerhtml = .responsetext
End With
Set Divelements = htm.getElementsByTagName("div")
For Each DivElement In Divelements
If InStr(1, DivElement.ClassName, "md") Then
'print contents to the Immediate window for debugging View>>Immediate Window to insure it's up in your VBE
Debug.Print DivElement.InnerText
End If
Next
结束子
有了它,您将看到所有评论都卡在立即 window 中(转到查看>>立即 Window),因此您可以看到此调试输出。
跳过节点后,您似乎可以向上导航几个元素并返回树下以获取用户名:
Sub getRedditData()
Dim x As Long, y As Long
Dim htm As Object
Set htm = CreateObject("htmlFile")
With CreateObject("msxml2.xmlhttp")
.Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
.send
htm.body.innerhtml = .responsetext
End With
Set Divelements = htm.getElementsByTagName("div")
On Error Resume Next
For Each divElement In Divelements
If InStr(1, divElement.className, "md") And Not InStr(1, divElement.className, "md-container") Then
Set commentEntry = divElement.ParentNode.ParentNode.ParentNode
'Print the name and the comment
Debug.Print commentEntry.FirstChild.FirstChild.NextSibling.InnerText & ":", divElement.InnerText
End If
Next
End Sub
要将其打印到 sheet,只需指向一个单元格而不是 debug.print
直接 window。类似于:
Sub getRedditData()
Dim x As Long, y As Long
Dim htm As Object
Dim ws As Worksheet, wsCell As Integer
'set the worksheet to print to and the first row to start printing.
Set ws = Sheets("Sheet1")
wsCell = 1
Set htm = CreateObject("htmlFile")
With CreateObject("msxml2.xmlhttp")
.Open "GET", "https://www.reddit.com/r/AskReddit/comments/4p7qsx/what_are_the_most_common_modern_day_scams/", False
.send
htm.body.innerhtml = .responsetext
End With
Set Divelements = htm.getElementsByTagName("div")
On Error Resume Next
For Each divElement In Divelements
If InStr(1, divElement.className, "md") And Not InStr(1, divElement.className, "md-container") Then
Set commentEntry = divElement.ParentNode.ParentNode.ParentNode
'Print the name and the comment to ws sheet columns 1 and 2
ws.Cells(wsCell, 1).Value = commentEntry.FirstChild.FirstChild.NextSibling.InnerText
ws.Cells(wsCell, 2).Value = divElement.InnerText
'iterate to the next row
wsCell = wsCell + 1
End If
Next
End Sub