这个问题是从另一个问题衍生而来的:通过身份验证从网站自动下载图片,我问过如何从需要登录的特定网站下载图片。
同一家公司有两个网站cgwallpapers.com和gamewallpapers.com,因为在回答了另一个问题的用户的帮助下,我最终决定如何自动下载其中一个网站,所以我无法复制在gamewallpapers.com网站上执行相同的步骤。
由于我对请求的经验不足,我可能要说的话可能是错误的,所以请帮助者/专家是否有时间建议我验证一下参数和我要说的其他内容,例如我是说还是不是,正如我所说,我可能是错的。
在cgwallpapers.com中,我基本上将这样的查询设置为下载墙纸:
http://www.cgmewallpapers.com/members/getwallpaper.php?id=100&res=1920x1080
但是我发现在gamewallpapers.com中我不能使用相同的帖子数据,因为它看起来像这样:
在cgwallpapers中更容易,因为我可以使用具有特定墙纸分辨率的id来使用增量for循环,但是在gamewallpapers.com网站上,我不知道该如何自动化墙纸下载,似乎需要一种完全不同的处理方法如果我没看错。
因此,我不知道该尝试什么甚至怎么做。
我登录gamewallpapers.com后,这就是我尝试下载壁纸的方式,当然这不起作用,因为我没有使用正确的查询,但是此代码适用于cgwallpaper.com网站,因此我将显示它是否可以帮助某些事情:
注意:WallpaperInfo
是一个不相关的对象,我使用它来返回下载的墙纸图像流,它的代码很多,因此我跳过了它。
''' <summary>
''' Tries to download the specified wallpaper from GameWallpapers server.
''' </summary>
''' <param name="id">The wallpaper id.</param>
''' <param name="res">The wallpaper resolution.</param>
''' <param name="cookieCollection">The cookie collection.</param>
''' <returns>A <see cref="WallpaperInfo"/> instance containing the wallpaper info and the image stream.</returns>
Private Function GetWallpaperMethod(ByVal id As String,
ByVal res As String,
ByRef cookieCollection As CookieCollection) As WallpaperInfo
Dim request As HttpWebRequest
Dim url As String = String.Format("http://www.gamewallpapers.com/members/getwallpaper.php?id={0}&res={1}", id, res)
Dim contentDisposition As String
Dim webResponse As WebResponse = Nothing
Dim responseStream As Stream = Nothing
Dim imageStream As MemoryStream = Nothing
Dim wallInfo As WallpaperInfo = Nothing
Try
request = DirectCast(HttpWebRequest.Create(url), HttpWebRequest)
With request
.Method = "GET"
.Headers.Add("Accept-Language", "en-US,en;q=0.5")
.Headers.Add("Accept-Encoding", "gzip, deflate")
.Headers.Add("Keep-Alive", "300")
.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
.AllowAutoRedirect = False
.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0"
.KeepAlive = True
End With
If cookieCollection IsNot Nothing Then
' Pass cookie info so that we remain logged in.
request.CookieContainer = Me.SetCookieContainer(url, cookieCollection)
End If
webResponse = request.GetResponse
Using webResponse
contentDisposition = CType(webResponse, HttpWebResponse).Headers("Content-Disposition")
If Not String.IsNullOrEmpty(contentDisposition) Then ' There is an image to download.
Dim filename As String = contentDisposition.Substring(contentDisposition.IndexOf("=") + "=".Length).
TrimStart(" "c).TrimEnd({" "c, ";"c})
Try
imageStream = New MemoryStream
responseStream = webResponse.GetResponseStream
Using responseStream
Dim buffer(2047) As Byte
Dim read As Integer
Do
read = responseStream.Read(buffer, 0, buffer.Length)
imageStream.Write(buffer, 0, read)
Loop Until read = 0
responseStream.Close()
End Using
Catch ex As Exception
Throw
End Try
' This is the object that I'll return
' that I'm storing the url, the wallpaper id,
' the wallpaper resolution, the wallpaper filename
' and finally the downloaded MemoryStream (the wallpaper image stream)
wallInfo = New WallpaperInfo(url:=url,
id:=id,
resolution:=res,
filename:=filename,
imageStream:=imageStream)
End If ' String.IsNullOrEmpty(contentDisposition)
End Using ' webResponse
Catch ex As Exception
Throw
Finally
If webResponse IsNot Nothing Then
webResponse.Close()
End If
If responseStream IsNot Nothing Then
responseStream.Close()
End If
End Try
Return wallInfo
End Function
Private Function SetCookieContainer(ByVal url As String,
ByVal cookieCollection As CookieCollection) As CookieContainer
Dim cookieContainer As New CookieContainer
Dim refDate As Date
For Each oldCookie As Cookie In cookieCollection
If Not DateTime.TryParse(oldCookie.Value, refDate) Then
Dim newCookie As New Cookie
With newCookie
.Name = oldCookie.Name
.Value = oldCookie.Value
.Domain = New Uri(url).Host
.Secure = False
End With
cookieContainer.Add(newCookie)
End If
Next oldCookie
Return cookieContainer
End Function
这是我试图通过示例示例了解的完整源代码,该示例用法是我预期的工作方式(用于循环递增墙纸id以自动执行下载),当将基本URL名称从更改为时,它可以完美gamewallpapers.com
地工作cgwallpapers.com
,因为此源仅适用于cgwallpapers.com
我,但我只是尝试使用gamewallpapers.com
url:
更新:
如所承诺的那样,我已经使用Telerik Testing Framework为您对gamewallpapers.com的问题提出了一个“适当”的解决方案。
您必须将sUsername
和sPassword
变量更改为您自己的用户名/密码,才能成功登录该站点。
您可能要更改的可选变量:
sResolutionString
:默认为1920x1080,这是您在原始问题中指定的值。将此值更改为网站上任何可接受的分辨率值。只是警告我不是100%不确定所有图像是否具有相同的分辨率,因此更改此值可能会导致某些图像没有所需分辨率的图像被跳过。sDownloadPath
:当前设置为与应用程序exe相同的文件夹。将此更改为要下载图像的路径。sUserAgent
:默认为Windows 7的Internet Explorer 11的用户代理。由于Telerik Testing Framework控制了真实的浏览器(在这种情况下,您的计算机上安装了IE版本),因此在发送请求时它将使用“真实的”用户代理。该变量用户代理字符串仅在使用下载壁纸时才使用HttpWebRequest
,默认值很可能是不必要的,因为随附的代码将捕获Telerik使用的用户代理并将其保存以备后用。nMaxSkippedFilesInSuccession
:默认设置为10。尝试下载墙纸图像时,应用程序将检查下载目录中是否已存在文件名。如果存在,则将不会下载文件,并且跳过计数器将增加。如果跳过计数器达到的值,nMaxSkippedFilesInSuccession
则该应用程序将停止运行,因为它假定您已在上一个会话中下载了其余文件。注意:理论上,该值甚至可以设置为1或2,因为文件名非常唯一,因此永远不会重叠。问题是该toplist.php
页面按日期排序,如果在运行此应用程序的过程中,他们添加了x张新图像,那么当您转到下一页时,图像将被x移位。如果x大于nMaxSkippedFilesInSuccession
那么您很可能会发现该应用可能会过早终止,因为您会因为该偏移而尝试重新下载许多相同的图像。nCurrentPageID
:默认设置为0。列表页面toplist.php
接受名为的查询字符串参数Start
,该参数根据指定的搜索参数告诉页面从哪个索引开始。该列表每页显示24张图像,因此该nCurrentPageID
变量必须可被24整除,否则最终可能会跳过图像。根据时间和情况,您可能无法在一个会话中下载所有图像。如果是这种情况,您可以记住nCurrentPageID
您保留了哪个选项,并相应地更新此变量以在下一次从另一个id开始(请注意,由于对列表页面进行了排序,因此当将新墙纸添加到网站时,图像可能会发生偏移。按壁纸日期)。要使用,Telerik Testing Framework
您只需安装安装文件,然后包含对的引用ArtOfTest.WebAii.dll
。
关于使用测试框架(至少与Internet Explorer一起使用)的一个怪癖是,它不允许您将浏览器作为隐藏进程启动。我已经与telerik支持人员进行了交谈,他们声称尽管其他网络抓取框架(如Watin)确实支持此功能,但该操作是不可能的(出于个人原因和其他原因,我个人还是更喜欢Watin,但它已经很老了,自从2011)。由于可以在后台运行Web抓取任务而不会打扰您使用计算机,因此,本示例将最小化浏览器(Telerik确实支持)启动浏览器,然后使用Windows api调用隐藏浏览器进程。这有点hack,但是根据我的经验,它很有用并且效果很好。
在我最初的回答中,我提到您很可能必须toplist.php
通过单击链接并构建下载网址来爬网该页面,但是我能够在不单击以外的任何页面的情况下使它起作用toplist.php
。这仅是可能的,因为墙纸文件名(基本上是您需要使用其下载的ID)部分包含在预览图像中。我最初还以为keystr
查询字符串参数是某种ID,可以“保护”下载,但实际上根本不需要它来获取墙纸。
最后要提到的是 toplist.php
该页面可以按评分或日期排序。评分是非常不稳定的,并且随着人们对图像的投票而随时变化,因此对于这种类型的作品而言,这不是一种很好的分类方法。在这种情况下,我们使用日期是因为它可以很好地进行排序,并且应该始终以与以前相同的顺序排列图像,但是有一个小问题:它似乎不允许您以相反的顺序进行排序。因此,最新图像始终显示在首页的顶部。这会导致图像在列表中移动,并且很可能在这种情况下使您再次重新测试相同的图像。对于cgwallpapers.com来说,这不是问题,因为新图像将收到一个新的(较高的)id值,我们可以只记得我们遗留下来的最后一个ID,并连续测试下一个ID以查看是否有新图像。用于游戏壁纸。
这是代码。如果您有任何问题,请告诉我:
Imports ArtOfTest.WebAii.Core
Imports System.Runtime.InteropServices
Public Class Form1
Const sUsername As String = "USERNAMEHERE"
Const sPassword As String = "PASSWORDHERE"
Const sMainURL As String = "http://www.gamewallpapers.com"
Const sListURL As String = "http://www.gamewallpapers.com/members/toplist.php"
Const sListQueryString As String = "?action=go&title=&maxage=0&latestnr=0&platform=&resolution=&cyberbabes=&membersonly2=&rating=0&minimumvotes2=0&sort=date&start="
Const sDownloadURL As String = "http://www.gamewallpapers.com/members/getwallpaper.php?wallpaper="
Const sResolutionString As String = "1920x1080"
Private sDownloadPath As String = Application.StartupPath
Private sUserAgent As String = "Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko" ' Default to ie11 user agent
Private oCookieContainerObject As New System.Net.CookieContainer
Private nMaxSkippedFilesInSuccession As Int32 = 10
Private nCurrentPageID As Int32 = 0 ' Only incrememnt this value in values of 24 or else you may miss some images
Private Enum oDownloadResult
Failed = 0
Success = 1
Skipped = 2
End Enum
Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load
StartScrape()
End Sub
Private Sub StartScrape()
Dim oBrowser As Manager = Nothing
Try
' Start Internt Explorer
Dim oSettings As New Settings
oSettings.Web.DefaultBrowser = BrowserType.InternetExplorer
oSettings.DisableDialogMonitoring = False
oSettings.UnexpectedDialogAction = UnexpectedDialogAction.DoNotHandle
oSettings.Web.UseHttpProxy = True ' This must be enabled for us to get the headers being sent and know what the user agent is dynamically
oBrowser = New Manager(oSettings)
oBrowser.Start()
oBrowser.LaunchNewBrowser(oSettings.Web.DefaultBrowser, True, ProcessWindowStyle.Minimized) ' Start minimized
' Set up a proxy so that we can capture the request headers
Dim li As New ArtOfTest.WebAii.Messaging.Http.RequestListenerInfo(AddressOf RequestHandler)
oBrowser.Http.AddBeforeRequestListener(li) ' Add proxy listener
' Hide the browser window
HideBrowser(oBrowser)
' Load the main url
oBrowser.ActiveBrowser.NavigateTo(sMainURL)
oBrowser.ActiveBrowser.WaitUntilReady()
oBrowser.Http.RemoveBeforeRequestListener(li) ' Remove proxy listener
oBrowser.ActiveBrowser.RefreshDomTree()
Dim bLoggedIn As Boolean = False
' Wait for the main logo image to show so that we know we have the right page
oBrowser.ActiveBrowser.WaitForElement(New HtmlFindExpression("Tagname=div", "Id=clickable_logo"), 30000, False)
Threading.Thread.Sleep(3000) ' Wait 3 seconds to prevent loading pages too quickly
oBrowser.ActiveBrowser.RefreshDomTree()
' Check if we are logged in already or if we need to log in
If oBrowser.ActiveBrowser.Find.ByExpression("Tagname=div", "Id=logout", "InnerText=Logout") IsNot Nothing Then
' Cannot find the logout button therefore we are already logged in
bLoggedIn = True
ElseIf oBrowser.ActiveBrowser.Find.ByExpression("Tagname=input", "Name=email") IsNot Nothing AndAlso oBrowser.ActiveBrowser.Find.ByExpression("Tagname=input", "Name=wachtwoord") IsNot Nothing Then
' Log in
oBrowser.ActiveBrowser.RefreshDomTree()
oBrowser.ActiveBrowser.Actions.SetText(oBrowser.ActiveBrowser.Find.ByExpression("Tagname=input", "Name=email"), sUsername)
oBrowser.ActiveBrowser.Actions.SetText(oBrowser.ActiveBrowser.Find.ByExpression("Tagname=input", "Name=wachtwoord"), sPassword)
oBrowser.ActiveBrowser.Actions.Click(oBrowser.ActiveBrowser.Find.ByExpression("Tagname=div", "Id=login", "InnerText=Login"))
' Wait for page to load
oBrowser.ActiveBrowser.WaitUntilReady()
oBrowser.ActiveBrowser.WaitForElement(New HtmlFindExpression("Tagname=div", "Id=logout", "InnerText=Logout"), 30000, False) ' Wait until Logout button is loaded
bLoggedIn = True
Else
' Didn't find any controls that we were looking for. Maybe the page was updated recently?
MessageBox.Show("Error loading page. Maybe the html changed?")
End If
If bLoggedIn = True Then
Dim bStop As Boolean = False
Dim sPreviewImageFilename As String
Dim sPreviewImageFileExtension As String
Dim oURI As Uri = New Uri(sMainURL)
Dim oCookie As System.Net.Cookie
Dim nSkippedFiles As Int32 = 0
' Save cookies from browser to use with HttpWebRequest later
For c As Int32 = 0 To oBrowser.ActiveBrowser.Cookies.GetCookies(oURI.Scheme & Uri.SchemeDelimiter & oURI.Host).Count - 1
oCookie = New System.Net.Cookie
oCookie.Name = oBrowser.ActiveBrowser.Cookies.GetCookies(oURI.Scheme & Uri.SchemeDelimiter & oURI.Host)(c).Name
oCookie.Value = oBrowser.ActiveBrowser.Cookies.GetCookies(oURI.Scheme & Uri.SchemeDelimiter & oURI.Host)(c).Value
oCookie.Domain = oURI.Host
oCookie.Secure = False
oCookieContainerObject.Add(oCookie)
Next
Threading.Thread.Sleep(3000) ' Wait 3 seconds to prevent loading pages too quickly
Do Until bStop = True
' Browse to the list url
oBrowser.ActiveBrowser.NavigateTo(sListURL & sListQueryString & nCurrentPageID)
oBrowser.ActiveBrowser.WaitUntilReady()
If oBrowser.ActiveBrowser.Find.AllByExpression("Tagname=img", "Class=toggleTooltip").Count > 0 Then
' Get all preview images on the page
For i As Int32 = 0 To oBrowser.ActiveBrowser.Find.AllByExpression("Tagname=img", "Class=toggleTooltip").Count - 1
' Convert the preview image browser element into an HtmlImage
Dim oHtmlImage As ArtOfTest.WebAii.Controls.HtmlControls.HtmlImage = oBrowser.ActiveBrowser.Find.AllByExpression("Tagname=img", "Class=toggleTooltip")(i).[As](Of ArtOfTest.WebAii.Controls.HtmlControls.HtmlImage)()
' Extract the filename and extension from the preview image
sPreviewImageFilename = System.IO.Path.GetFileNameWithoutExtension(oHtmlImage.Src)
sPreviewImageFileExtension = System.IO.Path.GetExtension(oHtmlImage.Src)
' Create a proper download url using the preview image filename and download the file in the resolution that we want using HttpWebRequest
Select Case DownloadImage(sDownloadURL & sPreviewImageFilename & "_" & sResolutionString & sPreviewImageFileExtension, sListURL & sListQueryString & nCurrentPageID)
Case Is = oDownloadResult.Success
nSkippedFiles = 0 ' Result skipped files back to zero
Case Is = oDownloadResult.Skipped
nSkippedFiles += 1 ' Increment skipped files by one since we have already downloaded this file previously
Case Is = oDownloadResult.Failed
' The image didn't download properly.
' Do whatever error handling in here that you want to
' Maybe save the filename to a log file so you know which file(s) failed and download them again later?
End Select
If nSkippedFiles >= nMaxSkippedFilesInSuccession Then
' We have skipped the maximum amount of files in a row so we must have downloaded them all (This should only ever happen on the 2nd+ run)
bStop = True
Exit For
Else
Threading.Thread.Sleep(3000) ' Wait 3 seconds to prevent loading pages too quickly
End If
Next
' Increment the 'Start' querystring value by 24 to simulate clicking the 'Next' button and load the next 24 images
nCurrentPageID += 24
Else
' No more images were found so we stop the application
bStop = True
End If
Loop
End If
Catch ex As Exception
MessageBox.Show(ex.Message)
Finally
' Ensure browser is closed when we exit
CleanupBrowser(oBrowser)
End Try
End Sub
Private Sub RequestHandler(sender As Object, e As ArtOfTest.WebAii.Messaging.Http.HttpRequestEventArgs)
' Save the exact user agent we are using so that we can use it with HTTPWebRequest later
sUserAgent = e.Request.Headers("User-Agent")
End Sub
Private Function DownloadImage(ByVal sPage As String, sReferer As String) As oDownloadResult
Dim req As System.Net.HttpWebRequest
Dim oReturn As oDownloadResult
Try
req = System.Net.HttpWebRequest.Create(sPage)
req.Method = "GET"
req.AllowAutoRedirect = False
req.UserAgent = sUserAgent
req.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
req.Headers.Add("Accept-Language", "en-US,en;q=0.5")
req.Headers.Add("Accept-Encoding", "gzip, deflate")
req.Headers.Add("Keep-Alive", "300")
req.KeepAlive = True
If oCookieContainerObject IsNot Nothing Then
' Set cookie info so that we continue to be logged in
req.CookieContainer = oCookieContainerObject
End If
' Save file to disk
Using oResponse As System.Net.WebResponse = CType(req.GetResponse, System.Net.WebResponse)
Dim sContentDisposition As String = CType(oResponse, System.Net.HttpWebResponse).Headers("Content-Disposition")
If sContentDisposition IsNot Nothing Then
Dim sFilename As String = sContentDisposition.Substring(sContentDisposition.IndexOf("filename="), sContentDisposition.Length - sContentDisposition.IndexOf("filename=")).Replace("filename=", "").Replace("""", "").Replace(";", "").Trim
Dim sFullPath As String = System.IO.Path.Combine(sDownloadPath, sFilename)
If System.IO.File.Exists(sFullPath) = False Then
Using responseStream As IO.Stream = oResponse.GetResponseStream
Using fs As New IO.FileStream(sFullPath, System.IO.FileMode.Create, System.IO.FileAccess.Write)
Dim buffer(2047) As Byte
Dim read As Integer
Do
read = responseStream.Read(buffer, 0, buffer.Length)
fs.Write(buffer, 0, read)
Loop Until read = 0
responseStream.Close()
fs.Flush()
fs.Close()
End Using
responseStream.Close()
End Using
oReturn = oDownloadResult.Success
Else
oReturn = oDownloadResult.Skipped ' We have downloaded this file before so skip it
End If
End If
oResponse.Close()
End Using
Catch exc As System.Net.WebException
MessageBox.Show("Network Error: " & exc.Message.ToString & " Status Code: " & exc.Status.ToString & " from " & sPage, "Error", MessageBoxButtons.OK, MessageBoxIcon.Error)
oReturn = oDownloadResult.Failed
End Try
Return oReturn
End Function
Private Sub HideBrowser(ByRef oBrowser As Manager)
Dim tmp_hWnd As IntPtr
For w As Integer = 1 To 10
tmp_hWnd = oBrowser.ActiveBrowser.Window.Handle
If Not tmp_hWnd.Equals(IntPtr.Zero) Then Exit For
Threading.Thread.Sleep(100)
Next
If Not tmp_hWnd.Equals(IntPtr.Zero) Then
' use ShowWindowAsync to change app window state (minimize and hide it).
ShowWindowAsync(tmp_hWnd, ShowWindowCommands.Minimize)
ShowWindowAsync(tmp_hWnd, ShowWindowCommands.Hide)
Else
' no window handle?
MessageBox.Show("Error - Unable to get a window handle")
End If
End Sub
Private Sub CleanupBrowser(ByRef oBrowser As Manager)
If oBrowser IsNot Nothing AndAlso oBrowser.ActiveBrowser IsNot Nothing Then
oBrowser.ActiveBrowser.Close()
End If
If oBrowser IsNot Nothing Then
oBrowser.Dispose()
End If
oBrowser = Nothing
End Sub
End Class
Module Module1
Public Enum ShowWindowCommands As Integer
Hide = 0
Normal = 1
ShowMinimized = 2
Maximize = 3
ShowMaximized = 3
ShowNoActivate = 4
Show = 5
Minimize = 6
ShowMinNoActive = 7
ShowNA = 8
Restore = 9
ShowDefault = 10
ForceMinimize = 11
End Enum
<DllImport("user32.dll", SetLastError:=True)> _
Public Function ShowWindowAsync(hWnd As IntPtr, <MarshalAs(UnmanagedType.I4)> nCmdShow As ShowWindowCommands) As <MarshalAs(UnmanagedType.Bool)> Boolean
End Function
End Module
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句