通过身份验证从网站自动下载图片,第二部分
Automate picture downloads from website with authentication, part two
这个问题派生自另一个问题:
我在那里询问如何从需要登录的特定网站下载图片。
同一公司有两个网站,cgwallpapers.com and gamewallpapers.com, since with the help of the user who answered the other question I finally maneged how to automate the download of one of the websites, I'm not able to reproduce the same steps on gamewallpapers.com个网站。
也许由于我对请求缺乏经验,我要说的事情可能是错误的,所以如果 helper/expert 有时间的话,我真的建议验证参数和其他事情我要说的是像我说的还是不是,正如我所说,我可能是错的。
在cgwallpapers.com中,我基本上是这样设置查询来下载壁纸的:
http://www.cgmewallpapers.com/members/getwallpaper.php?id=100&res=1920x1080
但是我发现在gamewallpapers.com中我不能使用相同的post数据,因为它似乎是这样的:
在 cgwallpapers 中更容易,因为我可以使用具有特定壁纸分辨率的 id 的增量 for 循环,但是对于 gamewallpapers.com 网站,我不知道如何自动下载壁纸,它如果我没记错的话,似乎需要完全不同的治疗方法。
所以,我不知道该尝试什么,甚至不知道该怎么做。
登录 gamewallpapers.com 后,这是我尝试下载壁纸的方式,当然这不起作用,因为我没有使用正确的查询,但这段代码适用于cgwallpaper.com 网站所以我会展示它是否可以帮助某些事情:
注意:WallpaperInfo
是一个不相关的对象,我用它来 return 下载的壁纸图像流,代码太多所以我跳过了它。
''' <summary>
''' Tries to download the specified wallpaper from GameWallpapers server.
''' </summary>
''' <param name="id">The wallpaper id.</param>
''' <param name="res">The wallpaper resolution.</param>
''' <param name="cookieCollection">The cookie collection.</param>
''' <returns>A <see cref="WallpaperInfo"/> instance containing the wallpaper info and the image stream.</returns>
Private Function GetWallpaperMethod(ByVal id As String,
ByVal res As String,
ByRef cookieCollection As CookieCollection) As WallpaperInfo
Dim request As HttpWebRequest
Dim url As String = String.Format("http://www.gamewallpapers.com/members/getwallpaper.php?id={0}&res={1}", id, res)
Dim contentDisposition As String
Dim webResponse As WebResponse = Nothing
Dim responseStream As Stream = Nothing
Dim imageStream As MemoryStream = Nothing
Dim wallInfo As WallpaperInfo = Nothing
Try
request = DirectCast(HttpWebRequest.Create(url), HttpWebRequest)
With request
.Method = "GET"
.Headers.Add("Accept-Language", "en-US,en;q=0.5")
.Headers.Add("Accept-Encoding", "gzip, deflate")
.Headers.Add("Keep-Alive", "300")
.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
.AllowAutoRedirect = False
.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0"
.KeepAlive = True
End With
If cookieCollection IsNot Nothing Then
' Pass cookie info so that we remain logged in.
request.CookieContainer = Me.SetCookieContainer(url, cookieCollection)
End If
webResponse = request.GetResponse
Using webResponse
contentDisposition = CType(webResponse, HttpWebResponse).Headers("Content-Disposition")
If Not String.IsNullOrEmpty(contentDisposition) Then ' There is an image to download.
Dim filename As String = contentDisposition.Substring(contentDisposition.IndexOf("=") + "=".Length).
TrimStart(" "c).TrimEnd({" "c, ";"c})
Try
imageStream = New MemoryStream
responseStream = webResponse.GetResponseStream
Using responseStream
Dim buffer(2047) As Byte
Dim read As Integer
Do
read = responseStream.Read(buffer, 0, buffer.Length)
imageStream.Write(buffer, 0, read)
Loop Until read = 0
responseStream.Close()
End Using
Catch ex As Exception
Throw
End Try
' This is the object that I'll return
' that I'm storing the url, the wallpaper id,
' the wallpaper resolution, the wallpaper filename
' and finally the downloaded MemoryStream (the wallpaper image stream)
wallInfo = New WallpaperInfo(url:=url,
id:=id,
resolution:=res,
filename:=filename,
imageStream:=imageStream)
End If ' String.IsNullOrEmpty(contentDisposition)
End Using ' webResponse
Catch ex As Exception
Throw
Finally
If webResponse IsNot Nothing Then
webResponse.Close()
End If
If responseStream IsNot Nothing Then
responseStream.Close()
End If
End Try
Return wallInfo
End Function
Private Function SetCookieContainer(ByVal url As String,
ByVal cookieCollection As CookieCollection) As CookieContainer
Dim cookieContainer As New CookieContainer
Dim refDate As Date
For Each oldCookie As Cookie In cookieCollection
If Not DateTime.TryParse(oldCookie.Value, refDate) Then
Dim newCookie As New Cookie
With newCookie
.Name = oldCookie.Name
.Value = oldCookie.Value
.Domain = New Uri(url).Host
.Secure = False
End With
cookieContainer.Add(newCookie)
End If
Next oldCookie
Return cookieContainer
End Function
这是我试图通过示例用法实现的完整源代码,说明我期望它应该如何工作(一个 for 循环递增壁纸 id 以自动下载),它在更改基础 url 名称从 gamewallpapers.com
到 cgwallpapers.com
,因为此来源仅适用于 cgwallpapers.com
但我只是尝试使用 gamewallpapers.com
url:
通常 WGET 工具可以解决问题,并允许您下载网站目录中的所有文件。不幸的是,我试过了,但没有用,我不确定是不是因为我不是该网站的成员,或者图片是否存储在数据库中。
查看查询字符串,我相信他们不是故意使用数字 ID(出于安全原因 - 所以人们无法轻易获得他们网站的网络转储)根据字母数字墙纸名称以及所需的密钥:
wallpaper=wallpaper_ancient_space_01_1920x1080.jpg&keystr=1423106012&retry=
如果 Wget 失败,您将需要编写一个屏幕抓取程序来下载每个页面上的链接,例如:
System.Net.WebClient.DownloadFile("http://www.gamewallpapers.com/toplist.php","C:\temp\page with links.txt")
您可以通过递增 &start 查询字符串参数轻松分页下载所有页面:
http://www.gamewallpapers.com/toplist.php?start=24&action=go&title=&maxage=0&latestnr=0&platform=&resolution=&cyberbabes=&membersonly2=&rating=0&minimumvotes2=0&sort=date
获得图像的所有链接后,您就可以使用 WebClient 或 HttpWebRequest 下载它们。
更新:
正如所承诺的那样,我已经使用 Telerik Testing Framework.
为您 gamewallpapers.com 的问题提出了 "proper" 解决方案
您必须将 sUsername
和 sPassword
变量更改为您自己的 username/password 才能成功登录该站点。
您可能想要更改的可选变量:
sResolutionString
:默认为 1920x1080,这是您在原始问题中指定的。将此值更改为网站上接受的任何分辨率值。只是一个警告,我不能 100% 确定所有图像是否都具有相同的分辨率,因此更改此值可能会导致某些图像在没有所需分辨率的图像时被跳过。
sDownloadPath
: 当前设置为与应用程序exe相同的文件夹。将此更改为您要下载图片的路径。
sUserAgent
:默认为 Windows 的 Internet Explorer 11 的用户代理 7. 由于 Telerik 测试框架控制着一个真实的浏览器(在这种情况下,无论您在电脑上安装了什么 IE 版本),它在发送请求时使用 "real" 用户代理。此可变用户代理字符串仅在使用 HttpWebRequest
下载壁纸时使用,默认值很可能是不必要的,因为包含的代码将捕获 Telerik 使用的用户代理并将其保存以备后用。
nMaxSkippedFilesInSuccession
:默认设置为10。尝试下载墙纸图像时,应用程序将检查文件名是否已存在于您的下载目录中。如果它存在,那么文件将不会被下载并且跳过计数器将递增。如果跳过计数器达到 nMaxSkippedFilesInSuccession
的值,则应用程序会停止,因为它假定您已经在上一个会话中下载了其余文件。 注意:理论上这个值甚至可以设置为 1 或 2,因为文件名非常独特,因此永远不会重叠。问题是 toplist.php
页面是按日期排序的,如果你在 运行 中使用这个应用程序,他们会添加 x 个新图像,然后当你转到下一页时,图像将移动 x .如果 x 大于 nMaxSkippedFilesInSuccession
那么您很可能会发现该应用程序会提前结束,因为您将由于偏移而尝试再次下载大量相同的图像。
nCurrentPageID
:默认设置为0。列表页面 toplist.php
接受一个名为 Start
的查询字符串参数,该参数告诉页面根据您选择的搜索参数从哪个索引开始。该列表每页显示 24 张图像,因此 nCurrentPageID
变量必须能被 24 整除,否则您最终可能会跳过图像。根据时间和情况,您可能无法在一次会话中下载所有图像。如果是这种情况,您可以记住 nCurrentPageID
您离开了哪个并相应地更新此变量以在下次使用不同的 ID 开始(请记住,图像可能会随着新壁纸添加到网站而移动,因为列表页按壁纸日期排序)。
要使用 Telerik Testing Framework
,您只需安装安装文件,然后包含对 ArtOfTest.WebAii.dll
.
的引用
使用测试框架(至少在 Internet Explorer 中)的一个怪癖是它不允许您将浏览器作为隐藏进程启动。我已经与 telerik 支持人员讨论过这个问题,他们声称这是不可能的,尽管像 Watin 这样的其他 Web scraping 框架确实支持这个功能(出于这个和其他原因,我个人仍然更喜欢 Watin,但它相当现在是旧的,自 2011 年以来没有更新)。由于在后台 运行 web scraping 任务很好,而不打扰你使用你的计算机,这个例子启动浏览器最小化(telerik 支持)然后使用 windows api 调用隐藏浏览器进程。这有点 hack,但根据我的经验,它很有用并且效果很好。
在我最初的回答中,我提到您很可能必须通过单击链接并构建下载 url 来抓取 toplist.php
页面,但我无需单击即可使其正常工作toplist.php
以外的任何页面。这是唯一可能的,因为墙纸文件名(基本上是您需要下载的 ID)部分包含在预览图像中。我最初还认为 keystr
查询字符串参数是 "protected" 下载的某种 id,但实际上根本不需要获取壁纸。
最后一件事要提到的是 toplist.php
页面可以按评级或日期排序。评级非常不稳定,并且随着人们对图像进行投票而随时可能发生变化,因此这不是此类工作的好排序方法。我们在这种情况下使用日期,因为它可以很好地排序,并且应该始终按照与以前相同的顺序排列图像,但有一个小问题:它似乎不允许您以相反的顺序排序。因此,最新的图像总是出现在第一页的顶部。这会导致图像在列表中移动,并且很可能会导致您在发生这种情况时再次重新测试相同的图像。对于 cgwallpapers.com 这不是问题,因为新图像将收到一个新的(更高的)id 值,我们可以只记住我们离开的最后一个 id 并连续测试下一个 id 以查看是否有新图像.对于 gamewallpapers.com,我们总是从 pageid 0 重新 运行 并继续前进,直到我们达到一定数量的跳过文件,以了解自上次下载以来我们何时找到图像的结尾。
这是代码。如果您有任何问题,请告诉我:
Imports ArtOfTest.WebAii.Core
Imports System.Runtime.InteropServices
Public Class Form1
Const sUsername As String = "USERNAMEHERE"
Const sPassword As String = "PASSWORDHERE"
Const sMainURL As String = "http://www.gamewallpapers.com"
Const sListURL As String = "http://www.gamewallpapers.com/members/toplist.php"
Const sListQueryString As String = "?action=go&title=&maxage=0&latestnr=0&platform=&resolution=&cyberbabes=&membersonly2=&rating=0&minimumvotes2=0&sort=date&start="
Const sDownloadURL As String = "http://www.gamewallpapers.com/members/getwallpaper.php?wallpaper="
Const sResolutionString As String = "1920x1080"
Private sDownloadPath As String = Application.StartupPath
Private sUserAgent As String = "Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko" ' Default to ie11 user agent
Private oCookieContainerObject As New System.Net.CookieContainer
Private nMaxSkippedFilesInSuccession As Int32 = 10
Private nCurrentPageID As Int32 = 0 ' Only incrememnt this value in values of 24 or else you may miss some images
Private Enum oDownloadResult
Failed = 0
Success = 1
Skipped = 2
End Enum
Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load
StartScrape()
End Sub
Private Sub StartScrape()
Dim oBrowser As Manager = Nothing
Try
' Start Internt Explorer
Dim oSettings As New Settings
oSettings.Web.DefaultBrowser = BrowserType.InternetExplorer
oSettings.DisableDialogMonitoring = False
oSettings.UnexpectedDialogAction = UnexpectedDialogAction.DoNotHandle
oSettings.Web.UseHttpProxy = True ' This must be enabled for us to get the headers being sent and know what the user agent is dynamically
oBrowser = New Manager(oSettings)
oBrowser.Start()
oBrowser.LaunchNewBrowser(oSettings.Web.DefaultBrowser, True, ProcessWindowStyle.Minimized) ' Start minimized
' Set up a proxy so that we can capture the request headers
Dim li As New ArtOfTest.WebAii.Messaging.Http.RequestListenerInfo(AddressOf RequestHandler)
oBrowser.Http.AddBeforeRequestListener(li) ' Add proxy listener
' Hide the browser window
HideBrowser(oBrowser)
' Load the main url
oBrowser.ActiveBrowser.NavigateTo(sMainURL)
oBrowser.ActiveBrowser.WaitUntilReady()
oBrowser.Http.RemoveBeforeRequestListener(li) ' Remove proxy listener
oBrowser.ActiveBrowser.RefreshDomTree()
Dim bLoggedIn As Boolean = False
' Wait for the main logo image to show so that we know we have the right page
oBrowser.ActiveBrowser.WaitForElement(New HtmlFindExpression("Tagname=div", "Id=clickable_logo"), 30000, False)
Threading.Thread.Sleep(3000) ' Wait 3 seconds to prevent loading pages too quickly
oBrowser.ActiveBrowser.RefreshDomTree()
' Check if we are logged in already or if we need to log in
If oBrowser.ActiveBrowser.Find.ByExpression("Tagname=div", "Id=logout", "InnerText=Logout") IsNot Nothing Then
' Cannot find the logout button therefore we are already logged in
bLoggedIn = True
ElseIf oBrowser.ActiveBrowser.Find.ByExpression("Tagname=input", "Name=email") IsNot Nothing AndAlso oBrowser.ActiveBrowser.Find.ByExpression("Tagname=input", "Name=wachtwoord") IsNot Nothing Then
' Log in
oBrowser.ActiveBrowser.RefreshDomTree()
oBrowser.ActiveBrowser.Actions.SetText(oBrowser.ActiveBrowser.Find.ByExpression("Tagname=input", "Name=email"), sUsername)
oBrowser.ActiveBrowser.Actions.SetText(oBrowser.ActiveBrowser.Find.ByExpression("Tagname=input", "Name=wachtwoord"), sPassword)
oBrowser.ActiveBrowser.Actions.Click(oBrowser.ActiveBrowser.Find.ByExpression("Tagname=div", "Id=login", "InnerText=Login"))
' Wait for page to load
oBrowser.ActiveBrowser.WaitUntilReady()
oBrowser.ActiveBrowser.WaitForElement(New HtmlFindExpression("Tagname=div", "Id=logout", "InnerText=Logout"), 30000, False) ' Wait until Logout button is loaded
bLoggedIn = True
Else
' Didn't find any controls that we were looking for. Maybe the page was updated recently?
MessageBox.Show("Error loading page. Maybe the html changed?")
End If
If bLoggedIn = True Then
Dim bStop As Boolean = False
Dim sPreviewImageFilename As String
Dim sPreviewImageFileExtension As String
Dim oURI As Uri = New Uri(sMainURL)
Dim oCookie As System.Net.Cookie
Dim nSkippedFiles As Int32 = 0
' Save cookies from browser to use with HttpWebRequest later
For c As Int32 = 0 To oBrowser.ActiveBrowser.Cookies.GetCookies(oURI.Scheme & Uri.SchemeDelimiter & oURI.Host).Count - 1
oCookie = New System.Net.Cookie
oCookie.Name = oBrowser.ActiveBrowser.Cookies.GetCookies(oURI.Scheme & Uri.SchemeDelimiter & oURI.Host)(c).Name
oCookie.Value = oBrowser.ActiveBrowser.Cookies.GetCookies(oURI.Scheme & Uri.SchemeDelimiter & oURI.Host)(c).Value
oCookie.Domain = oURI.Host
oCookie.Secure = False
oCookieContainerObject.Add(oCookie)
Next
Threading.Thread.Sleep(3000) ' Wait 3 seconds to prevent loading pages too quickly
Do Until bStop = True
' Browse to the list url
oBrowser.ActiveBrowser.NavigateTo(sListURL & sListQueryString & nCurrentPageID)
oBrowser.ActiveBrowser.WaitUntilReady()
If oBrowser.ActiveBrowser.Find.AllByExpression("Tagname=img", "Class=toggleTooltip").Count > 0 Then
' Get all preview images on the page
For i As Int32 = 0 To oBrowser.ActiveBrowser.Find.AllByExpression("Tagname=img", "Class=toggleTooltip").Count - 1
' Convert the preview image browser element into an HtmlImage
Dim oHtmlImage As ArtOfTest.WebAii.Controls.HtmlControls.HtmlImage = oBrowser.ActiveBrowser.Find.AllByExpression("Tagname=img", "Class=toggleTooltip")(i).[As](Of ArtOfTest.WebAii.Controls.HtmlControls.HtmlImage)()
' Extract the filename and extension from the preview image
sPreviewImageFilename = System.IO.Path.GetFileNameWithoutExtension(oHtmlImage.Src)
sPreviewImageFileExtension = System.IO.Path.GetExtension(oHtmlImage.Src)
' Create a proper download url using the preview image filename and download the file in the resolution that we want using HttpWebRequest
Select Case DownloadImage(sDownloadURL & sPreviewImageFilename & "_" & sResolutionString & sPreviewImageFileExtension, sListURL & sListQueryString & nCurrentPageID)
Case Is = oDownloadResult.Success
nSkippedFiles = 0 ' Result skipped files back to zero
Case Is = oDownloadResult.Skipped
nSkippedFiles += 1 ' Increment skipped files by one since we have already downloaded this file previously
Case Is = oDownloadResult.Failed
' The image didn't download properly.
' Do whatever error handling in here that you want to
' Maybe save the filename to a log file so you know which file(s) failed and download them again later?
End Select
If nSkippedFiles >= nMaxSkippedFilesInSuccession Then
' We have skipped the maximum amount of files in a row so we must have downloaded them all (This should only ever happen on the 2nd+ run)
bStop = True
Exit For
Else
Threading.Thread.Sleep(3000) ' Wait 3 seconds to prevent loading pages too quickly
End If
Next
' Increment the 'Start' querystring value by 24 to simulate clicking the 'Next' button and load the next 24 images
nCurrentPageID += 24
Else
' No more images were found so we stop the application
bStop = True
End If
Loop
End If
Catch ex As Exception
MessageBox.Show(ex.Message)
Finally
' Ensure browser is closed when we exit
CleanupBrowser(oBrowser)
End Try
End Sub
Private Sub RequestHandler(sender As Object, e As ArtOfTest.WebAii.Messaging.Http.HttpRequestEventArgs)
' Save the exact user agent we are using so that we can use it with HTTPWebRequest later
sUserAgent = e.Request.Headers("User-Agent")
End Sub
Private Function DownloadImage(ByVal sPage As String, sReferer As String) As oDownloadResult
Dim req As System.Net.HttpWebRequest
Dim oReturn As oDownloadResult
Try
req = System.Net.HttpWebRequest.Create(sPage)
req.Method = "GET"
req.AllowAutoRedirect = False
req.UserAgent = sUserAgent
req.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
req.Headers.Add("Accept-Language", "en-US,en;q=0.5")
req.Headers.Add("Accept-Encoding", "gzip, deflate")
req.Headers.Add("Keep-Alive", "300")
req.KeepAlive = True
If oCookieContainerObject IsNot Nothing Then
' Set cookie info so that we continue to be logged in
req.CookieContainer = oCookieContainerObject
End If
' Save file to disk
Using oResponse As System.Net.WebResponse = CType(req.GetResponse, System.Net.WebResponse)
Dim sContentDisposition As String = CType(oResponse, System.Net.HttpWebResponse).Headers("Content-Disposition")
If sContentDisposition IsNot Nothing Then
Dim sFilename As String = sContentDisposition.Substring(sContentDisposition.IndexOf("filename="), sContentDisposition.Length - sContentDisposition.IndexOf("filename=")).Replace("filename=", "").Replace("""", "").Replace(";", "").Trim
Dim sFullPath As String = System.IO.Path.Combine(sDownloadPath, sFilename)
If System.IO.File.Exists(sFullPath) = False Then
Using responseStream As IO.Stream = oResponse.GetResponseStream
Using fs As New IO.FileStream(sFullPath, System.IO.FileMode.Create, System.IO.FileAccess.Write)
Dim buffer(2047) As Byte
Dim read As Integer
Do
read = responseStream.Read(buffer, 0, buffer.Length)
fs.Write(buffer, 0, read)
Loop Until read = 0
responseStream.Close()
fs.Flush()
fs.Close()
End Using
responseStream.Close()
End Using
oReturn = oDownloadResult.Success
Else
oReturn = oDownloadResult.Skipped ' We have downloaded this file before so skip it
End If
End If
oResponse.Close()
End Using
Catch exc As System.Net.WebException
MessageBox.Show("Network Error: " & exc.Message.ToString & " Status Code: " & exc.Status.ToString & " from " & sPage, "Error", MessageBoxButtons.OK, MessageBoxIcon.Error)
oReturn = oDownloadResult.Failed
End Try
Return oReturn
End Function
Private Sub HideBrowser(ByRef oBrowser As Manager)
Dim tmp_hWnd As IntPtr
For w As Integer = 1 To 10
tmp_hWnd = oBrowser.ActiveBrowser.Window.Handle
If Not tmp_hWnd.Equals(IntPtr.Zero) Then Exit For
Threading.Thread.Sleep(100)
Next
If Not tmp_hWnd.Equals(IntPtr.Zero) Then
' use ShowWindowAsync to change app window state (minimize and hide it).
ShowWindowAsync(tmp_hWnd, ShowWindowCommands.Minimize)
ShowWindowAsync(tmp_hWnd, ShowWindowCommands.Hide)
Else
' no window handle?
MessageBox.Show("Error - Unable to get a window handle")
End If
End Sub
Private Sub CleanupBrowser(ByRef oBrowser As Manager)
If oBrowser IsNot Nothing AndAlso oBrowser.ActiveBrowser IsNot Nothing Then
oBrowser.ActiveBrowser.Close()
End If
If oBrowser IsNot Nothing Then
oBrowser.Dispose()
End If
oBrowser = Nothing
End Sub
End Class
Module Module1
Public Enum ShowWindowCommands As Integer
Hide = 0
Normal = 1
ShowMinimized = 2
Maximize = 3
ShowMaximized = 3
ShowNoActivate = 4
Show = 5
Minimize = 6
ShowMinNoActive = 7
ShowNA = 8
Restore = 9
ShowDefault = 10
ForceMinimize = 11
End Enum
<DllImport("user32.dll", SetLastError:=True)> _
Public Function ShowWindowAsync(hWnd As IntPtr, <MarshalAs(UnmanagedType.I4)> nCmdShow As ShowWindowCommands) As <MarshalAs(UnmanagedType.Bool)> Boolean
End Function
End Module
这个问题派生自另一个问题:
同一公司有两个网站,cgwallpapers.com and gamewallpapers.com, since with the help of the user who answered the other question I finally maneged how to automate the download of one of the websites, I'm not able to reproduce the same steps on gamewallpapers.com个网站。
也许由于我对请求缺乏经验,我要说的事情可能是错误的,所以如果 helper/expert 有时间的话,我真的建议验证参数和其他事情我要说的是像我说的还是不是,正如我所说,我可能是错的。
在cgwallpapers.com中,我基本上是这样设置查询来下载壁纸的:
http://www.cgmewallpapers.com/members/getwallpaper.php?id=100&res=1920x1080
但是我发现在gamewallpapers.com中我不能使用相同的post数据,因为它似乎是这样的:
在 cgwallpapers 中更容易,因为我可以使用具有特定壁纸分辨率的 id 的增量 for 循环,但是对于 gamewallpapers.com 网站,我不知道如何自动下载壁纸,它如果我没记错的话,似乎需要完全不同的治疗方法。
所以,我不知道该尝试什么,甚至不知道该怎么做。
登录 gamewallpapers.com 后,这是我尝试下载壁纸的方式,当然这不起作用,因为我没有使用正确的查询,但这段代码适用于cgwallpaper.com 网站所以我会展示它是否可以帮助某些事情:
注意:WallpaperInfo
是一个不相关的对象,我用它来 return 下载的壁纸图像流,代码太多所以我跳过了它。
''' <summary>
''' Tries to download the specified wallpaper from GameWallpapers server.
''' </summary>
''' <param name="id">The wallpaper id.</param>
''' <param name="res">The wallpaper resolution.</param>
''' <param name="cookieCollection">The cookie collection.</param>
''' <returns>A <see cref="WallpaperInfo"/> instance containing the wallpaper info and the image stream.</returns>
Private Function GetWallpaperMethod(ByVal id As String,
ByVal res As String,
ByRef cookieCollection As CookieCollection) As WallpaperInfo
Dim request As HttpWebRequest
Dim url As String = String.Format("http://www.gamewallpapers.com/members/getwallpaper.php?id={0}&res={1}", id, res)
Dim contentDisposition As String
Dim webResponse As WebResponse = Nothing
Dim responseStream As Stream = Nothing
Dim imageStream As MemoryStream = Nothing
Dim wallInfo As WallpaperInfo = Nothing
Try
request = DirectCast(HttpWebRequest.Create(url), HttpWebRequest)
With request
.Method = "GET"
.Headers.Add("Accept-Language", "en-US,en;q=0.5")
.Headers.Add("Accept-Encoding", "gzip, deflate")
.Headers.Add("Keep-Alive", "300")
.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
.AllowAutoRedirect = False
.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0"
.KeepAlive = True
End With
If cookieCollection IsNot Nothing Then
' Pass cookie info so that we remain logged in.
request.CookieContainer = Me.SetCookieContainer(url, cookieCollection)
End If
webResponse = request.GetResponse
Using webResponse
contentDisposition = CType(webResponse, HttpWebResponse).Headers("Content-Disposition")
If Not String.IsNullOrEmpty(contentDisposition) Then ' There is an image to download.
Dim filename As String = contentDisposition.Substring(contentDisposition.IndexOf("=") + "=".Length).
TrimStart(" "c).TrimEnd({" "c, ";"c})
Try
imageStream = New MemoryStream
responseStream = webResponse.GetResponseStream
Using responseStream
Dim buffer(2047) As Byte
Dim read As Integer
Do
read = responseStream.Read(buffer, 0, buffer.Length)
imageStream.Write(buffer, 0, read)
Loop Until read = 0
responseStream.Close()
End Using
Catch ex As Exception
Throw
End Try
' This is the object that I'll return
' that I'm storing the url, the wallpaper id,
' the wallpaper resolution, the wallpaper filename
' and finally the downloaded MemoryStream (the wallpaper image stream)
wallInfo = New WallpaperInfo(url:=url,
id:=id,
resolution:=res,
filename:=filename,
imageStream:=imageStream)
End If ' String.IsNullOrEmpty(contentDisposition)
End Using ' webResponse
Catch ex As Exception
Throw
Finally
If webResponse IsNot Nothing Then
webResponse.Close()
End If
If responseStream IsNot Nothing Then
responseStream.Close()
End If
End Try
Return wallInfo
End Function
Private Function SetCookieContainer(ByVal url As String,
ByVal cookieCollection As CookieCollection) As CookieContainer
Dim cookieContainer As New CookieContainer
Dim refDate As Date
For Each oldCookie As Cookie In cookieCollection
If Not DateTime.TryParse(oldCookie.Value, refDate) Then
Dim newCookie As New Cookie
With newCookie
.Name = oldCookie.Name
.Value = oldCookie.Value
.Domain = New Uri(url).Host
.Secure = False
End With
cookieContainer.Add(newCookie)
End If
Next oldCookie
Return cookieContainer
End Function
这是我试图通过示例用法实现的完整源代码,说明我期望它应该如何工作(一个 for 循环递增壁纸 id 以自动下载),它在更改基础 url 名称从 gamewallpapers.com
到 cgwallpapers.com
,因为此来源仅适用于 cgwallpapers.com
但我只是尝试使用 gamewallpapers.com
url:
通常 WGET 工具可以解决问题,并允许您下载网站目录中的所有文件。不幸的是,我试过了,但没有用,我不确定是不是因为我不是该网站的成员,或者图片是否存储在数据库中。
查看查询字符串,我相信他们不是故意使用数字 ID(出于安全原因 - 所以人们无法轻易获得他们网站的网络转储)根据字母数字墙纸名称以及所需的密钥:
wallpaper=wallpaper_ancient_space_01_1920x1080.jpg&keystr=1423106012&retry=
如果 Wget 失败,您将需要编写一个屏幕抓取程序来下载每个页面上的链接,例如:
System.Net.WebClient.DownloadFile("http://www.gamewallpapers.com/toplist.php","C:\temp\page with links.txt")
您可以通过递增 &start 查询字符串参数轻松分页下载所有页面:
http://www.gamewallpapers.com/toplist.php?start=24&action=go&title=&maxage=0&latestnr=0&platform=&resolution=&cyberbabes=&membersonly2=&rating=0&minimumvotes2=0&sort=date
获得图像的所有链接后,您就可以使用 WebClient 或 HttpWebRequest 下载它们。
更新:
正如所承诺的那样,我已经使用 Telerik Testing Framework.
为您 gamewallpapers.com 的问题提出了 "proper" 解决方案您必须将 sUsername
和 sPassword
变量更改为您自己的 username/password 才能成功登录该站点。
您可能想要更改的可选变量:
sResolutionString
:默认为 1920x1080,这是您在原始问题中指定的。将此值更改为网站上接受的任何分辨率值。只是一个警告,我不能 100% 确定所有图像是否都具有相同的分辨率,因此更改此值可能会导致某些图像在没有所需分辨率的图像时被跳过。sDownloadPath
: 当前设置为与应用程序exe相同的文件夹。将此更改为您要下载图片的路径。sUserAgent
:默认为 Windows 的 Internet Explorer 11 的用户代理 7. 由于 Telerik 测试框架控制着一个真实的浏览器(在这种情况下,无论您在电脑上安装了什么 IE 版本),它在发送请求时使用 "real" 用户代理。此可变用户代理字符串仅在使用HttpWebRequest
下载壁纸时使用,默认值很可能是不必要的,因为包含的代码将捕获 Telerik 使用的用户代理并将其保存以备后用。nMaxSkippedFilesInSuccession
:默认设置为10。尝试下载墙纸图像时,应用程序将检查文件名是否已存在于您的下载目录中。如果它存在,那么文件将不会被下载并且跳过计数器将递增。如果跳过计数器达到nMaxSkippedFilesInSuccession
的值,则应用程序会停止,因为它假定您已经在上一个会话中下载了其余文件。 注意:理论上这个值甚至可以设置为 1 或 2,因为文件名非常独特,因此永远不会重叠。问题是toplist.php
页面是按日期排序的,如果你在 运行 中使用这个应用程序,他们会添加 x 个新图像,然后当你转到下一页时,图像将移动 x .如果 x 大于nMaxSkippedFilesInSuccession
那么您很可能会发现该应用程序会提前结束,因为您将由于偏移而尝试再次下载大量相同的图像。nCurrentPageID
:默认设置为0。列表页面toplist.php
接受一个名为Start
的查询字符串参数,该参数告诉页面根据您选择的搜索参数从哪个索引开始。该列表每页显示 24 张图像,因此nCurrentPageID
变量必须能被 24 整除,否则您最终可能会跳过图像。根据时间和情况,您可能无法在一次会话中下载所有图像。如果是这种情况,您可以记住nCurrentPageID
您离开了哪个并相应地更新此变量以在下次使用不同的 ID 开始(请记住,图像可能会随着新壁纸添加到网站而移动,因为列表页按壁纸日期排序)。
要使用 Telerik Testing Framework
,您只需安装安装文件,然后包含对 ArtOfTest.WebAii.dll
.
使用测试框架(至少在 Internet Explorer 中)的一个怪癖是它不允许您将浏览器作为隐藏进程启动。我已经与 telerik 支持人员讨论过这个问题,他们声称这是不可能的,尽管像 Watin 这样的其他 Web scraping 框架确实支持这个功能(出于这个和其他原因,我个人仍然更喜欢 Watin,但它相当现在是旧的,自 2011 年以来没有更新)。由于在后台 运行 web scraping 任务很好,而不打扰你使用你的计算机,这个例子启动浏览器最小化(telerik 支持)然后使用 windows api 调用隐藏浏览器进程。这有点 hack,但根据我的经验,它很有用并且效果很好。
在我最初的回答中,我提到您很可能必须通过单击链接并构建下载 url 来抓取 toplist.php
页面,但我无需单击即可使其正常工作toplist.php
以外的任何页面。这是唯一可能的,因为墙纸文件名(基本上是您需要下载的 ID)部分包含在预览图像中。我最初还认为 keystr
查询字符串参数是 "protected" 下载的某种 id,但实际上根本不需要获取壁纸。
最后一件事要提到的是 toplist.php
页面可以按评级或日期排序。评级非常不稳定,并且随着人们对图像进行投票而随时可能发生变化,因此这不是此类工作的好排序方法。我们在这种情况下使用日期,因为它可以很好地排序,并且应该始终按照与以前相同的顺序排列图像,但有一个小问题:它似乎不允许您以相反的顺序排序。因此,最新的图像总是出现在第一页的顶部。这会导致图像在列表中移动,并且很可能会导致您在发生这种情况时再次重新测试相同的图像。对于 cgwallpapers.com 这不是问题,因为新图像将收到一个新的(更高的)id 值,我们可以只记住我们离开的最后一个 id 并连续测试下一个 id 以查看是否有新图像.对于 gamewallpapers.com,我们总是从 pageid 0 重新 运行 并继续前进,直到我们达到一定数量的跳过文件,以了解自上次下载以来我们何时找到图像的结尾。
这是代码。如果您有任何问题,请告诉我:
Imports ArtOfTest.WebAii.Core
Imports System.Runtime.InteropServices
Public Class Form1
Const sUsername As String = "USERNAMEHERE"
Const sPassword As String = "PASSWORDHERE"
Const sMainURL As String = "http://www.gamewallpapers.com"
Const sListURL As String = "http://www.gamewallpapers.com/members/toplist.php"
Const sListQueryString As String = "?action=go&title=&maxage=0&latestnr=0&platform=&resolution=&cyberbabes=&membersonly2=&rating=0&minimumvotes2=0&sort=date&start="
Const sDownloadURL As String = "http://www.gamewallpapers.com/members/getwallpaper.php?wallpaper="
Const sResolutionString As String = "1920x1080"
Private sDownloadPath As String = Application.StartupPath
Private sUserAgent As String = "Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko" ' Default to ie11 user agent
Private oCookieContainerObject As New System.Net.CookieContainer
Private nMaxSkippedFilesInSuccession As Int32 = 10
Private nCurrentPageID As Int32 = 0 ' Only incrememnt this value in values of 24 or else you may miss some images
Private Enum oDownloadResult
Failed = 0
Success = 1
Skipped = 2
End Enum
Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load
StartScrape()
End Sub
Private Sub StartScrape()
Dim oBrowser As Manager = Nothing
Try
' Start Internt Explorer
Dim oSettings As New Settings
oSettings.Web.DefaultBrowser = BrowserType.InternetExplorer
oSettings.DisableDialogMonitoring = False
oSettings.UnexpectedDialogAction = UnexpectedDialogAction.DoNotHandle
oSettings.Web.UseHttpProxy = True ' This must be enabled for us to get the headers being sent and know what the user agent is dynamically
oBrowser = New Manager(oSettings)
oBrowser.Start()
oBrowser.LaunchNewBrowser(oSettings.Web.DefaultBrowser, True, ProcessWindowStyle.Minimized) ' Start minimized
' Set up a proxy so that we can capture the request headers
Dim li As New ArtOfTest.WebAii.Messaging.Http.RequestListenerInfo(AddressOf RequestHandler)
oBrowser.Http.AddBeforeRequestListener(li) ' Add proxy listener
' Hide the browser window
HideBrowser(oBrowser)
' Load the main url
oBrowser.ActiveBrowser.NavigateTo(sMainURL)
oBrowser.ActiveBrowser.WaitUntilReady()
oBrowser.Http.RemoveBeforeRequestListener(li) ' Remove proxy listener
oBrowser.ActiveBrowser.RefreshDomTree()
Dim bLoggedIn As Boolean = False
' Wait for the main logo image to show so that we know we have the right page
oBrowser.ActiveBrowser.WaitForElement(New HtmlFindExpression("Tagname=div", "Id=clickable_logo"), 30000, False)
Threading.Thread.Sleep(3000) ' Wait 3 seconds to prevent loading pages too quickly
oBrowser.ActiveBrowser.RefreshDomTree()
' Check if we are logged in already or if we need to log in
If oBrowser.ActiveBrowser.Find.ByExpression("Tagname=div", "Id=logout", "InnerText=Logout") IsNot Nothing Then
' Cannot find the logout button therefore we are already logged in
bLoggedIn = True
ElseIf oBrowser.ActiveBrowser.Find.ByExpression("Tagname=input", "Name=email") IsNot Nothing AndAlso oBrowser.ActiveBrowser.Find.ByExpression("Tagname=input", "Name=wachtwoord") IsNot Nothing Then
' Log in
oBrowser.ActiveBrowser.RefreshDomTree()
oBrowser.ActiveBrowser.Actions.SetText(oBrowser.ActiveBrowser.Find.ByExpression("Tagname=input", "Name=email"), sUsername)
oBrowser.ActiveBrowser.Actions.SetText(oBrowser.ActiveBrowser.Find.ByExpression("Tagname=input", "Name=wachtwoord"), sPassword)
oBrowser.ActiveBrowser.Actions.Click(oBrowser.ActiveBrowser.Find.ByExpression("Tagname=div", "Id=login", "InnerText=Login"))
' Wait for page to load
oBrowser.ActiveBrowser.WaitUntilReady()
oBrowser.ActiveBrowser.WaitForElement(New HtmlFindExpression("Tagname=div", "Id=logout", "InnerText=Logout"), 30000, False) ' Wait until Logout button is loaded
bLoggedIn = True
Else
' Didn't find any controls that we were looking for. Maybe the page was updated recently?
MessageBox.Show("Error loading page. Maybe the html changed?")
End If
If bLoggedIn = True Then
Dim bStop As Boolean = False
Dim sPreviewImageFilename As String
Dim sPreviewImageFileExtension As String
Dim oURI As Uri = New Uri(sMainURL)
Dim oCookie As System.Net.Cookie
Dim nSkippedFiles As Int32 = 0
' Save cookies from browser to use with HttpWebRequest later
For c As Int32 = 0 To oBrowser.ActiveBrowser.Cookies.GetCookies(oURI.Scheme & Uri.SchemeDelimiter & oURI.Host).Count - 1
oCookie = New System.Net.Cookie
oCookie.Name = oBrowser.ActiveBrowser.Cookies.GetCookies(oURI.Scheme & Uri.SchemeDelimiter & oURI.Host)(c).Name
oCookie.Value = oBrowser.ActiveBrowser.Cookies.GetCookies(oURI.Scheme & Uri.SchemeDelimiter & oURI.Host)(c).Value
oCookie.Domain = oURI.Host
oCookie.Secure = False
oCookieContainerObject.Add(oCookie)
Next
Threading.Thread.Sleep(3000) ' Wait 3 seconds to prevent loading pages too quickly
Do Until bStop = True
' Browse to the list url
oBrowser.ActiveBrowser.NavigateTo(sListURL & sListQueryString & nCurrentPageID)
oBrowser.ActiveBrowser.WaitUntilReady()
If oBrowser.ActiveBrowser.Find.AllByExpression("Tagname=img", "Class=toggleTooltip").Count > 0 Then
' Get all preview images on the page
For i As Int32 = 0 To oBrowser.ActiveBrowser.Find.AllByExpression("Tagname=img", "Class=toggleTooltip").Count - 1
' Convert the preview image browser element into an HtmlImage
Dim oHtmlImage As ArtOfTest.WebAii.Controls.HtmlControls.HtmlImage = oBrowser.ActiveBrowser.Find.AllByExpression("Tagname=img", "Class=toggleTooltip")(i).[As](Of ArtOfTest.WebAii.Controls.HtmlControls.HtmlImage)()
' Extract the filename and extension from the preview image
sPreviewImageFilename = System.IO.Path.GetFileNameWithoutExtension(oHtmlImage.Src)
sPreviewImageFileExtension = System.IO.Path.GetExtension(oHtmlImage.Src)
' Create a proper download url using the preview image filename and download the file in the resolution that we want using HttpWebRequest
Select Case DownloadImage(sDownloadURL & sPreviewImageFilename & "_" & sResolutionString & sPreviewImageFileExtension, sListURL & sListQueryString & nCurrentPageID)
Case Is = oDownloadResult.Success
nSkippedFiles = 0 ' Result skipped files back to zero
Case Is = oDownloadResult.Skipped
nSkippedFiles += 1 ' Increment skipped files by one since we have already downloaded this file previously
Case Is = oDownloadResult.Failed
' The image didn't download properly.
' Do whatever error handling in here that you want to
' Maybe save the filename to a log file so you know which file(s) failed and download them again later?
End Select
If nSkippedFiles >= nMaxSkippedFilesInSuccession Then
' We have skipped the maximum amount of files in a row so we must have downloaded them all (This should only ever happen on the 2nd+ run)
bStop = True
Exit For
Else
Threading.Thread.Sleep(3000) ' Wait 3 seconds to prevent loading pages too quickly
End If
Next
' Increment the 'Start' querystring value by 24 to simulate clicking the 'Next' button and load the next 24 images
nCurrentPageID += 24
Else
' No more images were found so we stop the application
bStop = True
End If
Loop
End If
Catch ex As Exception
MessageBox.Show(ex.Message)
Finally
' Ensure browser is closed when we exit
CleanupBrowser(oBrowser)
End Try
End Sub
Private Sub RequestHandler(sender As Object, e As ArtOfTest.WebAii.Messaging.Http.HttpRequestEventArgs)
' Save the exact user agent we are using so that we can use it with HTTPWebRequest later
sUserAgent = e.Request.Headers("User-Agent")
End Sub
Private Function DownloadImage(ByVal sPage As String, sReferer As String) As oDownloadResult
Dim req As System.Net.HttpWebRequest
Dim oReturn As oDownloadResult
Try
req = System.Net.HttpWebRequest.Create(sPage)
req.Method = "GET"
req.AllowAutoRedirect = False
req.UserAgent = sUserAgent
req.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
req.Headers.Add("Accept-Language", "en-US,en;q=0.5")
req.Headers.Add("Accept-Encoding", "gzip, deflate")
req.Headers.Add("Keep-Alive", "300")
req.KeepAlive = True
If oCookieContainerObject IsNot Nothing Then
' Set cookie info so that we continue to be logged in
req.CookieContainer = oCookieContainerObject
End If
' Save file to disk
Using oResponse As System.Net.WebResponse = CType(req.GetResponse, System.Net.WebResponse)
Dim sContentDisposition As String = CType(oResponse, System.Net.HttpWebResponse).Headers("Content-Disposition")
If sContentDisposition IsNot Nothing Then
Dim sFilename As String = sContentDisposition.Substring(sContentDisposition.IndexOf("filename="), sContentDisposition.Length - sContentDisposition.IndexOf("filename=")).Replace("filename=", "").Replace("""", "").Replace(";", "").Trim
Dim sFullPath As String = System.IO.Path.Combine(sDownloadPath, sFilename)
If System.IO.File.Exists(sFullPath) = False Then
Using responseStream As IO.Stream = oResponse.GetResponseStream
Using fs As New IO.FileStream(sFullPath, System.IO.FileMode.Create, System.IO.FileAccess.Write)
Dim buffer(2047) As Byte
Dim read As Integer
Do
read = responseStream.Read(buffer, 0, buffer.Length)
fs.Write(buffer, 0, read)
Loop Until read = 0
responseStream.Close()
fs.Flush()
fs.Close()
End Using
responseStream.Close()
End Using
oReturn = oDownloadResult.Success
Else
oReturn = oDownloadResult.Skipped ' We have downloaded this file before so skip it
End If
End If
oResponse.Close()
End Using
Catch exc As System.Net.WebException
MessageBox.Show("Network Error: " & exc.Message.ToString & " Status Code: " & exc.Status.ToString & " from " & sPage, "Error", MessageBoxButtons.OK, MessageBoxIcon.Error)
oReturn = oDownloadResult.Failed
End Try
Return oReturn
End Function
Private Sub HideBrowser(ByRef oBrowser As Manager)
Dim tmp_hWnd As IntPtr
For w As Integer = 1 To 10
tmp_hWnd = oBrowser.ActiveBrowser.Window.Handle
If Not tmp_hWnd.Equals(IntPtr.Zero) Then Exit For
Threading.Thread.Sleep(100)
Next
If Not tmp_hWnd.Equals(IntPtr.Zero) Then
' use ShowWindowAsync to change app window state (minimize and hide it).
ShowWindowAsync(tmp_hWnd, ShowWindowCommands.Minimize)
ShowWindowAsync(tmp_hWnd, ShowWindowCommands.Hide)
Else
' no window handle?
MessageBox.Show("Error - Unable to get a window handle")
End If
End Sub
Private Sub CleanupBrowser(ByRef oBrowser As Manager)
If oBrowser IsNot Nothing AndAlso oBrowser.ActiveBrowser IsNot Nothing Then
oBrowser.ActiveBrowser.Close()
End If
If oBrowser IsNot Nothing Then
oBrowser.Dispose()
End If
oBrowser = Nothing
End Sub
End Class
Module Module1
Public Enum ShowWindowCommands As Integer
Hide = 0
Normal = 1
ShowMinimized = 2
Maximize = 3
ShowMaximized = 3
ShowNoActivate = 4
Show = 5
Minimize = 6
ShowMinNoActive = 7
ShowNA = 8
Restore = 9
ShowDefault = 10
ForceMinimize = 11
End Enum
<DllImport("user32.dll", SetLastError:=True)> _
Public Function ShowWindowAsync(hWnd As IntPtr, <MarshalAs(UnmanagedType.I4)> nCmdShow As ShowWindowCommands) As <MarshalAs(UnmanagedType.Bool)> Boolean
End Function
End Module