机械化不像浏览器那样处理 cookie
Mechanize not dealing with cookies like a browser does
我有以下代码:
use WWW::Mechanize;
$url = "http://daccess-ods.un.org/access.nsf/Get?Open&DS=A/HRC/WGAD/2015/28&Lang=E";
$mech = WWW::Mechanize->new();
$mech->get($url);
$content = $mech->content();
while ($content =~ m/<META HTTP-EQUIV="refresh" CONTENT="(\d+); URL=(.+?)">/) {
$refresh = ;
$link = ;
sleep $refresh;
$mech->get($link);
$content = $mech->content();
}
$mech->save_content("output.txt");
当我把 URL 分配给 $url
在浏览器中最终结果是下载 PDF
文件,但是当我 运行 上面的代码我最终得到了一个不同的文件。我认为 Mechanize
可能无法正确处理 cookie。我怎样才能让它工作?
您可以尝试向构造函数中添加一个 cookie jar,类似这些内容
use HTTP::Cookies;
my $cookie_jar = HTTP::Cookies->new(file => $cookie_file, autosave => 1, ignore_discard => 1);
my $mech = WWW::Mechanize->new('ssl_opts'=> {'SSL_verify_mode'=>'SSL_VERIFY_NONE'}, cookie_jar => $cookie_jar, autocheck => 0);
如果您想保存 cookie 并稍后加载它以保留您的会话,请执行以下操作:
$cookie_jar->save;
#after the content call
要加载 cookie:
$mech->cookie_jar->load($cookie_file);
#before the get function (but you may want a conditional statement to check if the cookie even exists
希望对您有所帮助
当我在浏览器中输入 URL 时收到 404,但请尝试使用此代码以获得更详细的调试输出。
use strict;
use warnings;
use LWP::ConsoleLogger::Easy qw( debug_ua );
use WWW::Mechanize;
my $url
= "http://daccess-ods.un.org/access.nsf/GetOpen&DS=A/HRC/WGAD/2015/28&Lang=E";
my $mech = WWW::Mechanize->new();
debug_ua( $mech );
$mech->get( $url );
my $content = $mech->content();
while (
$content =~ m/<META HTTP-EQUIV="refresh" CONTENT="(\d+); URL=(.+?)">/ )
{
my $refresh = ;
my $link = ;
sleep $refresh;
$mech->get( $link );
$content = $mech->content();
}
$mech->save_content( "output.txt" );
这就是我在 VBA
中自动化的方式:
Private Declare Function FindWindow Lib "user32" Alias "FindWindowA" _
(ByVal lpClassName As String, ByVal lpWindowName As String) As Long
Private Declare Function FindWindowEx Lib "user32" Alias "FindWindowExA" _
(ByVal hWnd1 As Long, ByVal hWnd2 As Long, ByVal lpsz1 As String, _
ByVal lpsz2 As String) As Long
Private Declare Function SetCursorPos Lib "user32" _
(ByVal X As Integer, ByVal Y As Integer) As Long
Private Declare Function GetWindowRect Lib "user32" _
(ByVal hwnd As Long, lpRect As RECT) As Long
Private Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)
Private Declare Sub mouse_event Lib "user32.dll" (ByVal dwFlags As Long, _
ByVal dx As Long, ByVal dy As Long, ByVal cButtons As Long, ByVal dwExtraInfo As Long)
Private Declare Sub SetWindowPos Lib "user32" (ByVal hwnd As Integer, ByVal _
hWndInsertAfter As Integer, ByVal X As Integer, ByVal Y As Integer, ByVal cx As _
Integer, ByVal cy As Integer, ByVal wFlags As Integer)
'~~> Constants for pressing left button of the mouse
Private Const MOUSEEVENTF_LEFTDOWN As Long = &H2
'~~> Constants for Releasing left button of the mouse
Private Const MOUSEEVENTF_LEFTUP As Long = &H4
Private Type RECT
Left As Long
Top As Long
Right As Long
Bottom As Long
End Type
Const HWND_TOPMOST = -1
Const HWND_NOTOPMOST = -2
Const SWP_NOSIZE = &H1
Const SWP_NOMOVE = &H2
Const SWP_NOACTIVATE = &H10
Const SWP_SHOWWINDOW = &H40
Dim ie As InternetExplorer
Sub GetFiles()
Set ie = New InternetExplorer
GetFileFromUrl "http://daccess-ods.un.org/access.nsf/Get?Open&DS=A/HRC/WGAD/2015/28&Lang=E"
GetFileFromUrl "http://daccess-ods.un.org/access.nsf/Get?Open&DS=A/HRC/WGAD/2015/31&Lang=F"
End Sub
Sub GetFileFromUrl(url As String)
Dim pos As RECT
ie.Navigate url
ie.Visible = True
While ie.ReadyState <> 4
DoEvents
Wend
Sleep 7000
ie.ExecWB 4, 1, "c:\test.pdf"
Sleep 5000
SaveAsHwnd = FindWindow(vbNullString, "Save As")
If SaveAsHwnd <> 0 Then
Debug.Print "Found Save As window"
Else
Debug.Print "Did not find Save As window"
End If
SaveButtonHwnd = FindWindowEx(SaveAsHwnd, ByVal 0&, "Button", "&Save")
If SaveButtonHwnd <> 0 Then
Debug.Print "Found Save button"
' click button
'res = SendMessage(SaveButtonHwnd, TCM_SETCURFOCUS, 1, ByVal 0&)
'res = PostMessage(SaveButtonHwnd, BM_CLICK, ByVal 0&, ByVal 0&)
'res = SendMessage(SaveButtonHwnd, WM_COMMAND, 0&, 0&)
GetWindowRect SaveButtonHwnd, pos
'~~> Move the cursor to the specified screen coordinates.
SetCursorPos (pos.Left - 10), (pos.Top - 10)
'~~> Suspends the execution of the current thread for a specified interval.
'~~> This give ample amount time for the API to position the cursor
Sleep 100
SetCursorPos pos.Left, pos.Top
Sleep 100
SetCursorPos (pos.Left + pos.Right) / 2, (pos.Top + pos.Bottom) / 2
'~~> Set the size, position, and Z order of "File Download" Window
SetWindowPos Ret, HWND_TOPMOST, 0, 0, 0, 0, SWP_NOACTIVATE Or SWP_SHOWWINDOW Or SWP_NOMOVE Or SWP_NOSIZE
Sleep 100
'~~> Simulate mouse motion and click the button
'~~> Simulate LEFT CLICK
mouse_event MOUSEEVENTF_LEFTDOWN, (pos.Left + pos.Right) / 2, (pos.Top + pos.Bottom) / 2, 0, 0
Sleep 700
'~~> Simulate Release of LEFT CLICK
mouse_event MOUSEEVENTF_LEFTUP, (pos.Left + pos.Right) / 2, (pos.Top + pos.Bottom) / 2, 0, 0
Else
Debug.Print "Did not find Save button"
End If
Sleep 5000
End Sub
或者,可以使用 UIAutomation
COM
对象:
Sub GetFilesAutomation()
Dim o As IUIAutomation
Dim e As IUIAutomationElement
Dim SaveAsHwnd As LongPtr
Dim ie As New InternetExplorer
Set o = New CUIAutomation
ie.Navigate "http://daccess-ods.un.org/access.nsf/Get?Open&DS=A/HRC/WGAD/2015/28&Lang=E"
ie.Visible = True
Sleep 10000
ie.ExecWB 4, 1
Sleep 5000
SaveAsHwnd = FindWindow(vbNullString, "Save As")
Set e = o.ElementFromHandle(ByVal SaveAsHwnd)
Dim iCnd As IUIAutomationCondition
Set iCnd = o.CreatePropertyCondition(UIA_NamePropertyId, "Save")
Dim Button As IUIAutomationElement
Set Button = e.FindFirst(TreeScope_Subtree, iCnd)
Dim InvokePattern As IUIAutomationInvokePattern
Set InvokePattern = Button.GetCurrentPattern(UIA_InvokePatternId)
InvokePattern.Invoke
End Sub
当您请求 http://daccess-ods.un.org/access.nsf/Get?Open&DS=A/HRC/WGAD/2015/28&Lang=E 时,您首先会重定向到 https
。
然后你会得到一个 META REFRESH
的页面。这会为您提供 /TMP
.
中的文件
得到https://daccess-ods.un.org/TMP/xxx.xxx.html and following the META REFRESH
to https://documents-dds-ny.un.org/doc/UNDOC/GEN/G15/263/87/PDF/G1526387.pdf?OpenElement后还是没有下载文件,但是报错
你从浏览器查看headers的原因是因为浏览器设置了三个cookie,而WWW::Mechanize只有一个:
- citrix_ns_id=xxx
- citrix_ns_id_.un.org_%2F_wat=xxx
- LtpaToken=xxx
那么这些 cookie 是从哪里来的呢?事实证明,TMP html 不仅仅是一个 META REFRESH。它还有这个 HTML:
<frameset ROWS="0,100%" framespacing="0" FrameBorder="0" Border="0">
<frame name="footer" scrolling="no" noresize target="main" src="https://documents-dds-ny.un.org/prod/ods_mother.nsf?Login&Username=freeods2&Password=1234" marginwidth="0" marginheight="0">
<frame name="main" src="" scrolling="auto" target="_top">
<noframes>
<body>
<p>This page uses frames, but your browser doesn't support them.</p>
</body>
</noframes>
</frameset>
url https://documents-dds-ny.un.org/prod/ods_mother.nsf?Login&Username=freeods2&Password=1234 确实设置了这些 cookie。
Set-Cookie: LtpaToken=xxx; domain=.un.org; path=/
Set-Cookie: citrix_ns_id=xxx; Domain=.un.org; Path=/; HttpOnly
Set-Cookie: citrix_ns_id_.un.org_%2F_wat=xxx; Domain=.un.org; Path=/
因此,通过更改您的代码以考虑到这一点:
use strict;
use WWW::Mechanize;
my $url = "http://daccess-ods.un.org/access.nsf/Get?Open&DS=A/HRC/WGAD/2015/28&Lang=E";
my $mech = WWW::Mechanize->new();
$mech->get($url);
my $more = 1;
while ($more) {
$more = 0;
my $follow_link;
my @links = $mech->links;
foreach my $link (@links) {
if ($link->tag eq 'meta') {
$follow_link = $link;
}
if (($link->tag eq 'frame') && ($link->url)) {
$mech->follow_link( url => $link->url );
$mech->back;
}
}
if ($follow_link) {
$more = 1;
$mech->follow_link( url => $follow_link->url );
}
}
$mech->save_content("output.txt");
output.txt 成功包含 pdf。
$ file output.txt
output.txt: PDF document, version 1.5
我有以下代码:
use WWW::Mechanize;
$url = "http://daccess-ods.un.org/access.nsf/Get?Open&DS=A/HRC/WGAD/2015/28&Lang=E";
$mech = WWW::Mechanize->new();
$mech->get($url);
$content = $mech->content();
while ($content =~ m/<META HTTP-EQUIV="refresh" CONTENT="(\d+); URL=(.+?)">/) {
$refresh = ;
$link = ;
sleep $refresh;
$mech->get($link);
$content = $mech->content();
}
$mech->save_content("output.txt");
当我把 URL 分配给 $url
在浏览器中最终结果是下载 PDF
文件,但是当我 运行 上面的代码我最终得到了一个不同的文件。我认为 Mechanize
可能无法正确处理 cookie。我怎样才能让它工作?
您可以尝试向构造函数中添加一个 cookie jar,类似这些内容
use HTTP::Cookies;
my $cookie_jar = HTTP::Cookies->new(file => $cookie_file, autosave => 1, ignore_discard => 1);
my $mech = WWW::Mechanize->new('ssl_opts'=> {'SSL_verify_mode'=>'SSL_VERIFY_NONE'}, cookie_jar => $cookie_jar, autocheck => 0);
如果您想保存 cookie 并稍后加载它以保留您的会话,请执行以下操作:
$cookie_jar->save;
#after the content call
要加载 cookie:
$mech->cookie_jar->load($cookie_file);
#before the get function (but you may want a conditional statement to check if the cookie even exists
希望对您有所帮助
当我在浏览器中输入 URL 时收到 404,但请尝试使用此代码以获得更详细的调试输出。
use strict;
use warnings;
use LWP::ConsoleLogger::Easy qw( debug_ua );
use WWW::Mechanize;
my $url
= "http://daccess-ods.un.org/access.nsf/GetOpen&DS=A/HRC/WGAD/2015/28&Lang=E";
my $mech = WWW::Mechanize->new();
debug_ua( $mech );
$mech->get( $url );
my $content = $mech->content();
while (
$content =~ m/<META HTTP-EQUIV="refresh" CONTENT="(\d+); URL=(.+?)">/ )
{
my $refresh = ;
my $link = ;
sleep $refresh;
$mech->get( $link );
$content = $mech->content();
}
$mech->save_content( "output.txt" );
这就是我在 VBA
中自动化的方式:
Private Declare Function FindWindow Lib "user32" Alias "FindWindowA" _
(ByVal lpClassName As String, ByVal lpWindowName As String) As Long
Private Declare Function FindWindowEx Lib "user32" Alias "FindWindowExA" _
(ByVal hWnd1 As Long, ByVal hWnd2 As Long, ByVal lpsz1 As String, _
ByVal lpsz2 As String) As Long
Private Declare Function SetCursorPos Lib "user32" _
(ByVal X As Integer, ByVal Y As Integer) As Long
Private Declare Function GetWindowRect Lib "user32" _
(ByVal hwnd As Long, lpRect As RECT) As Long
Private Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)
Private Declare Sub mouse_event Lib "user32.dll" (ByVal dwFlags As Long, _
ByVal dx As Long, ByVal dy As Long, ByVal cButtons As Long, ByVal dwExtraInfo As Long)
Private Declare Sub SetWindowPos Lib "user32" (ByVal hwnd As Integer, ByVal _
hWndInsertAfter As Integer, ByVal X As Integer, ByVal Y As Integer, ByVal cx As _
Integer, ByVal cy As Integer, ByVal wFlags As Integer)
'~~> Constants for pressing left button of the mouse
Private Const MOUSEEVENTF_LEFTDOWN As Long = &H2
'~~> Constants for Releasing left button of the mouse
Private Const MOUSEEVENTF_LEFTUP As Long = &H4
Private Type RECT
Left As Long
Top As Long
Right As Long
Bottom As Long
End Type
Const HWND_TOPMOST = -1
Const HWND_NOTOPMOST = -2
Const SWP_NOSIZE = &H1
Const SWP_NOMOVE = &H2
Const SWP_NOACTIVATE = &H10
Const SWP_SHOWWINDOW = &H40
Dim ie As InternetExplorer
Sub GetFiles()
Set ie = New InternetExplorer
GetFileFromUrl "http://daccess-ods.un.org/access.nsf/Get?Open&DS=A/HRC/WGAD/2015/28&Lang=E"
GetFileFromUrl "http://daccess-ods.un.org/access.nsf/Get?Open&DS=A/HRC/WGAD/2015/31&Lang=F"
End Sub
Sub GetFileFromUrl(url As String)
Dim pos As RECT
ie.Navigate url
ie.Visible = True
While ie.ReadyState <> 4
DoEvents
Wend
Sleep 7000
ie.ExecWB 4, 1, "c:\test.pdf"
Sleep 5000
SaveAsHwnd = FindWindow(vbNullString, "Save As")
If SaveAsHwnd <> 0 Then
Debug.Print "Found Save As window"
Else
Debug.Print "Did not find Save As window"
End If
SaveButtonHwnd = FindWindowEx(SaveAsHwnd, ByVal 0&, "Button", "&Save")
If SaveButtonHwnd <> 0 Then
Debug.Print "Found Save button"
' click button
'res = SendMessage(SaveButtonHwnd, TCM_SETCURFOCUS, 1, ByVal 0&)
'res = PostMessage(SaveButtonHwnd, BM_CLICK, ByVal 0&, ByVal 0&)
'res = SendMessage(SaveButtonHwnd, WM_COMMAND, 0&, 0&)
GetWindowRect SaveButtonHwnd, pos
'~~> Move the cursor to the specified screen coordinates.
SetCursorPos (pos.Left - 10), (pos.Top - 10)
'~~> Suspends the execution of the current thread for a specified interval.
'~~> This give ample amount time for the API to position the cursor
Sleep 100
SetCursorPos pos.Left, pos.Top
Sleep 100
SetCursorPos (pos.Left + pos.Right) / 2, (pos.Top + pos.Bottom) / 2
'~~> Set the size, position, and Z order of "File Download" Window
SetWindowPos Ret, HWND_TOPMOST, 0, 0, 0, 0, SWP_NOACTIVATE Or SWP_SHOWWINDOW Or SWP_NOMOVE Or SWP_NOSIZE
Sleep 100
'~~> Simulate mouse motion and click the button
'~~> Simulate LEFT CLICK
mouse_event MOUSEEVENTF_LEFTDOWN, (pos.Left + pos.Right) / 2, (pos.Top + pos.Bottom) / 2, 0, 0
Sleep 700
'~~> Simulate Release of LEFT CLICK
mouse_event MOUSEEVENTF_LEFTUP, (pos.Left + pos.Right) / 2, (pos.Top + pos.Bottom) / 2, 0, 0
Else
Debug.Print "Did not find Save button"
End If
Sleep 5000
End Sub
或者,可以使用 UIAutomation
COM
对象:
Sub GetFilesAutomation()
Dim o As IUIAutomation
Dim e As IUIAutomationElement
Dim SaveAsHwnd As LongPtr
Dim ie As New InternetExplorer
Set o = New CUIAutomation
ie.Navigate "http://daccess-ods.un.org/access.nsf/Get?Open&DS=A/HRC/WGAD/2015/28&Lang=E"
ie.Visible = True
Sleep 10000
ie.ExecWB 4, 1
Sleep 5000
SaveAsHwnd = FindWindow(vbNullString, "Save As")
Set e = o.ElementFromHandle(ByVal SaveAsHwnd)
Dim iCnd As IUIAutomationCondition
Set iCnd = o.CreatePropertyCondition(UIA_NamePropertyId, "Save")
Dim Button As IUIAutomationElement
Set Button = e.FindFirst(TreeScope_Subtree, iCnd)
Dim InvokePattern As IUIAutomationInvokePattern
Set InvokePattern = Button.GetCurrentPattern(UIA_InvokePatternId)
InvokePattern.Invoke
End Sub
当您请求 http://daccess-ods.un.org/access.nsf/Get?Open&DS=A/HRC/WGAD/2015/28&Lang=E 时,您首先会重定向到 https
。
然后你会得到一个 META REFRESH
的页面。这会为您提供 /TMP
.
得到https://daccess-ods.un.org/TMP/xxx.xxx.html and following the META REFRESH
to https://documents-dds-ny.un.org/doc/UNDOC/GEN/G15/263/87/PDF/G1526387.pdf?OpenElement后还是没有下载文件,但是报错
你从浏览器查看headers的原因是因为浏览器设置了三个cookie,而WWW::Mechanize只有一个:
- citrix_ns_id=xxx
- citrix_ns_id_.un.org_%2F_wat=xxx
- LtpaToken=xxx
那么这些 cookie 是从哪里来的呢?事实证明,TMP html 不仅仅是一个 META REFRESH。它还有这个 HTML:
<frameset ROWS="0,100%" framespacing="0" FrameBorder="0" Border="0">
<frame name="footer" scrolling="no" noresize target="main" src="https://documents-dds-ny.un.org/prod/ods_mother.nsf?Login&Username=freeods2&Password=1234" marginwidth="0" marginheight="0">
<frame name="main" src="" scrolling="auto" target="_top">
<noframes>
<body>
<p>This page uses frames, but your browser doesn't support them.</p>
</body>
</noframes>
</frameset>
url https://documents-dds-ny.un.org/prod/ods_mother.nsf?Login&Username=freeods2&Password=1234 确实设置了这些 cookie。
Set-Cookie: LtpaToken=xxx; domain=.un.org; path=/
Set-Cookie: citrix_ns_id=xxx; Domain=.un.org; Path=/; HttpOnly
Set-Cookie: citrix_ns_id_.un.org_%2F_wat=xxx; Domain=.un.org; Path=/
因此,通过更改您的代码以考虑到这一点:
use strict;
use WWW::Mechanize;
my $url = "http://daccess-ods.un.org/access.nsf/Get?Open&DS=A/HRC/WGAD/2015/28&Lang=E";
my $mech = WWW::Mechanize->new();
$mech->get($url);
my $more = 1;
while ($more) {
$more = 0;
my $follow_link;
my @links = $mech->links;
foreach my $link (@links) {
if ($link->tag eq 'meta') {
$follow_link = $link;
}
if (($link->tag eq 'frame') && ($link->url)) {
$mech->follow_link( url => $link->url );
$mech->back;
}
}
if ($follow_link) {
$more = 1;
$mech->follow_link( url => $follow_link->url );
}
}
$mech->save_content("output.txt");
output.txt 成功包含 pdf。
$ file output.txt
output.txt: PDF document, version 1.5