在访问 VBA 中使用 MSXML2.XMLHTTP 未提取所有页面数据
Using MSXML2.XMLHTTP in access VBA not extracting all page data
目前,我们正在使用下面提到的代码进行数据提取,但代码并未从网页中提取完整数据,代码忽略了当我启用 java 脚本和 DOM 存储时可见的数据Internet 浏览器。
直到现在我使用下面提到的代码,尾随代码正在提取网页中接受图像的所有东西。
我的代码被打了
Set http = CreateObject("MSXML2.XMLHTTP")
http.Send
html.body.innerHTML = http.ResponseText
On Error GoTo 0
html1 = html.body.innerHTML
brand5 = html.documentElement.innerHTML
If html1 Like "*media__thumb*" Then
other_img = html.getElementsByClassName("media__thumb")(0).innerText
'other_img = other_img.innerHTML
End If
网页多图html代码如下(请注意,我上面的代码不是从下面提到的html代码中提取数据。
<a class="media__thumbnail" data-media_type="IMAGE" data-media_id="orbit-bagged-53017-64" data-target="IMAGE" data-has-index="true">
<img src="https://images.yourweb/_145.jpg">
</a>
<a class="media__thumbnail media__thumbnail--selected" data-media_type="IMAGE" data-media_id="orbit-bagged-53017-e1" data-target="IMAGE" data-has-index="true">
<img src="https://images.yourweb1_145.jpg">
</a>
</span></a>
http.response下面给出
<div id="thumbnails" class="media__thumbnails" data-component="thumbnails"></div>
<script type="text/template" id="media__thumbnails">
{{#thumbnails}}
<a class="media__thumbnail" data-media_type="{{type}}" data-media_id="{{id}}" data-target="{{type}}" data-has-index="true">
<img src="{{{thumb}}}"/>
{{# hasIcon}}
{{# threeSixtyIcon}} <div class="whitespace"><span class="threesixtyIcon"></span></div>{{/ threeSixtyIcon}}
{{^ threeSixtyIcon}} <span class="videoIcon"></span>{{/ threeSixtyIcon}}
{{/ hasIcon}}
</a>
{{/thumbnails}}
{{#additionalThumbnailsThumbnail}}
<a class="media__thumbnail media__thumbnail-additional-count" data-media_type="{{type}}" data-media_id="{{id}}" data-target="{{type}}" data-has-index="true">
<img src="{{{thumb}}}"/>
{{# hasIcon}}
{{# threeSixtyIcon}} <div class="whitespace"><span class="threesixtyIcon"></span></div>{{/ threeSixtyIcon}}
{{^ threeSixtyIcon}} <span class="videoIcon"></span>{{/ threeSixtyIcon}}
{{/ hasIcon}}
{{#additionalImagesCount}}
<div class="media__thumbnail-overlay"></div>
<span class="media__thumbnail-count">+{{additionalImagesCount}}</span>
{{/additionalImagesCount}}
</a>
该页面上的内容需要 javascript 到 运行,因此您需要:
- 在网络流量中搜索那些 url 的那部分信息,看看您是否从其他地方获得(您可以 - 如下所示);或者,
- 使浏览器自动化,例如使用 Microsoft Internet 控件
从下面可以看出内容是动态加载的:
<script type="text/template" id="media__thumbnails">
{{#thumbnails}}
<a class="media__thumbnail" data-media_type="{{type}}" data-media_id="{{id}}" data-target="{{type}}" data-has-index="true">
<img src="{{{thumb}}}"/>
{{# hasIcon}}
{{# threeSixtyIcon}} <div class="whitespace"><span class="threesixtyIcon"></span></div>{{/ threeSixtyIcon}}
{{^ threeSixtyIcon}} <span class="videoIcon"></span>{{/ threeSixtyIcon}}
{{/ hasIcon}}
</a>
{{/thumbnails}}
{{#additionalThumbnailsThumbnail}}
......
{{/additionalThumbnails}}
</script>
<script type="text/template"
1) 网络选项卡 - 不同 url
使用在网络选项卡中找到的不同 url 返回包含链接的 json。响应为 json,因此需要 jsonparser。
'VBE>Tools> References> Add reference to Microsoft Scripting Runtime
'Download and add in jsonconverter.bas from https://github.com/VBA-tools/VBA-JSON/blob/master/JsonConverter.bas
将 converter.bas 代码放在 module2 中并注释掉该行:Attribute VB_Name = "JsonConverter"
在模块 1 中放置 GetInfo
sub.
Option Compare Database
Option Explicit
Public Sub GetInfo()
Dim json As Object, url1 As String, url2 As String, url3 As String
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.homedepot.com/p/svcs/frontEndModel/100001020?_=1556447908065", False
.send
Set json = Module2.ParseJson(.responseText)
End With
'Parse json object (see paths shown below for example)
url1 = json("primaryItemData")("media")("mediaList")(2)("location")
url2 = json("primaryItemData")("media")("mediaList")(3)("location")
url3 = json("primaryItemData")("media")("mediaList")(4)("location") 'example
Stop '<==delete me later
End Sub
前 3 个缩略图的路径:
json►primaryItemData►media►mediaList►2►location
json►primaryItemData►media►mediaList►3►location
json►primaryItemData►media►mediaList►4►location
探索 json here.
2) 自动浏览器(IE版):
'VBE > Tools > References:
' Microsoft Internet Controls
Public Sub GetImageLinks()
Dim ie As New InternetExplorer, images As Object, i As Long
With ie
.Visible = True
.Navigate2 "https://www.homedepot.com/p/Orbit-Sandstone-Rock-Valve-Box-Cover-53017/100001020"
While .Busy Or .readyState < 4: DoEvents: Wend
Set images = .document.querySelectorAll(".media__thumbnail img")
For i = 0 To images.Length - 1
Debug.Print images.item(i).src
Next
Stop
.Quit
End With
End Sub
目前,我们正在使用下面提到的代码进行数据提取,但代码并未从网页中提取完整数据,代码忽略了当我启用 java 脚本和 DOM 存储时可见的数据Internet 浏览器。
直到现在我使用下面提到的代码,尾随代码正在提取网页中接受图像的所有东西。
我的代码被打了
Set http = CreateObject("MSXML2.XMLHTTP")
http.Send
html.body.innerHTML = http.ResponseText
On Error GoTo 0
html1 = html.body.innerHTML
brand5 = html.documentElement.innerHTML
If html1 Like "*media__thumb*" Then
other_img = html.getElementsByClassName("media__thumb")(0).innerText
'other_img = other_img.innerHTML
End If
网页多图html代码如下(请注意,我上面的代码不是从下面提到的html代码中提取数据。
<a class="media__thumbnail" data-media_type="IMAGE" data-media_id="orbit-bagged-53017-64" data-target="IMAGE" data-has-index="true">
<img src="https://images.yourweb/_145.jpg">
</a>
<a class="media__thumbnail media__thumbnail--selected" data-media_type="IMAGE" data-media_id="orbit-bagged-53017-e1" data-target="IMAGE" data-has-index="true">
<img src="https://images.yourweb1_145.jpg">
</a>
</span></a>
http.response下面给出
<div id="thumbnails" class="media__thumbnails" data-component="thumbnails"></div>
<script type="text/template" id="media__thumbnails">
{{#thumbnails}}
<a class="media__thumbnail" data-media_type="{{type}}" data-media_id="{{id}}" data-target="{{type}}" data-has-index="true">
<img src="{{{thumb}}}"/>
{{# hasIcon}}
{{# threeSixtyIcon}} <div class="whitespace"><span class="threesixtyIcon"></span></div>{{/ threeSixtyIcon}}
{{^ threeSixtyIcon}} <span class="videoIcon"></span>{{/ threeSixtyIcon}}
{{/ hasIcon}}
</a>
{{/thumbnails}}
{{#additionalThumbnailsThumbnail}}
<a class="media__thumbnail media__thumbnail-additional-count" data-media_type="{{type}}" data-media_id="{{id}}" data-target="{{type}}" data-has-index="true">
<img src="{{{thumb}}}"/>
{{# hasIcon}}
{{# threeSixtyIcon}} <div class="whitespace"><span class="threesixtyIcon"></span></div>{{/ threeSixtyIcon}}
{{^ threeSixtyIcon}} <span class="videoIcon"></span>{{/ threeSixtyIcon}}
{{/ hasIcon}}
{{#additionalImagesCount}}
<div class="media__thumbnail-overlay"></div>
<span class="media__thumbnail-count">+{{additionalImagesCount}}</span>
{{/additionalImagesCount}}
</a>
该页面上的内容需要 javascript 到 运行,因此您需要:
- 在网络流量中搜索那些 url 的那部分信息,看看您是否从其他地方获得(您可以 - 如下所示);或者,
- 使浏览器自动化,例如使用 Microsoft Internet 控件
从下面可以看出内容是动态加载的:
<script type="text/template" id="media__thumbnails">
{{#thumbnails}}
<a class="media__thumbnail" data-media_type="{{type}}" data-media_id="{{id}}" data-target="{{type}}" data-has-index="true">
<img src="{{{thumb}}}"/>
{{# hasIcon}}
{{# threeSixtyIcon}} <div class="whitespace"><span class="threesixtyIcon"></span></div>{{/ threeSixtyIcon}}
{{^ threeSixtyIcon}} <span class="videoIcon"></span>{{/ threeSixtyIcon}}
{{/ hasIcon}}
</a>
{{/thumbnails}}
{{#additionalThumbnailsThumbnail}}
......
{{/additionalThumbnails}}
</script>
<script type="text/template"
1) 网络选项卡 - 不同 url
使用在网络选项卡中找到的不同 url 返回包含链接的 json。响应为 json,因此需要 jsonparser。
'VBE>Tools> References> Add reference to Microsoft Scripting Runtime
'Download and add in jsonconverter.bas from https://github.com/VBA-tools/VBA-JSON/blob/master/JsonConverter.bas
将 converter.bas 代码放在 module2 中并注释掉该行:Attribute VB_Name = "JsonConverter"
在模块 1 中放置 GetInfo
sub.
Option Compare Database
Option Explicit
Public Sub GetInfo()
Dim json As Object, url1 As String, url2 As String, url3 As String
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.homedepot.com/p/svcs/frontEndModel/100001020?_=1556447908065", False
.send
Set json = Module2.ParseJson(.responseText)
End With
'Parse json object (see paths shown below for example)
url1 = json("primaryItemData")("media")("mediaList")(2)("location")
url2 = json("primaryItemData")("media")("mediaList")(3)("location")
url3 = json("primaryItemData")("media")("mediaList")(4)("location") 'example
Stop '<==delete me later
End Sub
前 3 个缩略图的路径:
json►primaryItemData►media►mediaList►2►location
json►primaryItemData►media►mediaList►3►location
json►primaryItemData►media►mediaList►4►location
探索 json here.
2) 自动浏览器(IE版):
'VBE > Tools > References:
' Microsoft Internet Controls
Public Sub GetImageLinks()
Dim ie As New InternetExplorer, images As Object, i As Long
With ie
.Visible = True
.Navigate2 "https://www.homedepot.com/p/Orbit-Sandstone-Rock-Valve-Box-Cover-53017/100001020"
While .Busy Or .readyState < 4: DoEvents: Wend
Set images = .document.querySelectorAll(".media__thumbnail img")
For i = 0 To images.Length - 1
Debug.Print images.item(i).src
Next
Stop
.Quit
End With
End Sub