使用 urllib2 与请求下载文件:为什么这些输出不同?
Downloading file with urllib2 vs requests: Why are these outputs different?
这是我今天早些时候看到的一个问题的 follow-up。在这个问题中,用户询问从 url:
下载 pdf 的问题
http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009
我认为下面的两个下载函数会给出相同的结果,但是 urllib2
版本下载一些 html 带有引用 pdf 加载器的脚本标签,而 requests
版本下载真正的pdf。有人可以解释行为上的差异吗?
import urllib2
import requests
def get_pdf_urllib2(url, outfile='ex.pdf'):
resp = urllib2.urlopen(url)
with open(outfile, 'wb') as f:
f.write(resp.read())
def get_pdf_requests(url, outfile='ex.pdf'):
resp = requests.get(url)
with open(outfile, 'wb') as f:
f.write(resp.content)
requests
是否足够智能,可以在下载之前等待动态网站呈现?
编辑
按照@cwallenpoole 的想法,我比较了 headers 并尝试将 headers 从 requests
请求交换到 urllib2
请求。神奇的 header 是 Cookie;以下函数为示例 URL.
写入相同的文件
def get_pdf_urllib2(url, outfile='ex.pdf'):
req = urllib2.request(url, headers={'Cookie':'I2KBRCK=1'})
resp = urllib2.urlopen(req)
with open(outfile, 'wb') as f:
f.write(resp.read())
def get_pdf_requests(url, outfile='ex.pdf'):
resp = requests.get(url)
with open(outfile, 'wb') as f:
f.write(resp.content)
下一个问题:requests
从哪里得到那个 cookie? requests
是否多次访问服务器?
编辑 2
Cookie 来自重定向 header:
>>> handler=urllib2.HTTPHandler(debuglevel=1)
>>> opener=urllib2.build_opener(handler)
>>> urllib2.install_opener(opener)
>>> respurl=urllib2.urlopen(req1)
send: 'GET /doi/pdf/10.1177/0956797614553009 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Server: AtyponWS/7.1
header: P3P: CP="NOI DSP ADM OUR IND OTC"
header: Location: http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009?cookieSet=1
header: Set-Cookie: I2KBRCK=1; path=/; expires=Thu, 14-Dec-2017 17:28:28 GMT
header: Content-Type: text/html; charset=utf-8
header: Content-Length: 110
header: Connection: close
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
send: 'GET /doi/pdf/10.1177/0956797614553009?cookieSet=1 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Server: AtyponWS/7.1
header: Location: http://journals.sagepub.com/action/cookieAbsent
header: Content-Type: text/html; charset=utf-8
header: Content-Length: 85
header: Connection: close
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
send: 'GET /action/cookieAbsent HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: AtyponWS/7.1
header: Cache-Control: no-cache
header: Pragma: no-cache
header: X-Webstats-RespID: 8344872279f77f45555d5f9aeb97985b
header: Set-Cookie: JSESSIONID=aaavQMGH8mvlh_-5Ct7Jv; path=/
header: Content-Type: text/html; charset=UTF-8
header: Connection: close
header: Transfer-Encoding: chunked
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
header: Vary: Accept-Encoding
我敢打赌,这是用户代理 header 的问题(我刚刚使用了 curl http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009
,得到的结果与您报告的 urllib2
相同)。这是请求 header 的一部分,它让网站知道哪种类型的 program/user/whatever 正在访问该站点(不是图书馆,HTTP 请求)。
By default, it looks like urllib2
uses: Python-urllib/2.1
And requests uses:python-requests/{package version} {runtime}/{runtime version} {uname}/{uname -r}
如果您正在处理 Mac,我敢打赌该站点正在阅读 Darwin/13.1.0
或类似内容,然后为您提供 macos
适当的内容。否则,它可能会尝试将您引导至某些默认的替代内容(或阻止您抓取 URL)。
这是我今天早些时候看到的一个问题的 follow-up。在这个问题中,用户询问从 url:
下载 pdf 的问题http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009
我认为下面的两个下载函数会给出相同的结果,但是 urllib2
版本下载一些 html 带有引用 pdf 加载器的脚本标签,而 requests
版本下载真正的pdf。有人可以解释行为上的差异吗?
import urllib2
import requests
def get_pdf_urllib2(url, outfile='ex.pdf'):
resp = urllib2.urlopen(url)
with open(outfile, 'wb') as f:
f.write(resp.read())
def get_pdf_requests(url, outfile='ex.pdf'):
resp = requests.get(url)
with open(outfile, 'wb') as f:
f.write(resp.content)
requests
是否足够智能,可以在下载之前等待动态网站呈现?
编辑
按照@cwallenpoole 的想法,我比较了 headers 并尝试将 headers 从 requests
请求交换到 urllib2
请求。神奇的 header 是 Cookie;以下函数为示例 URL.
def get_pdf_urllib2(url, outfile='ex.pdf'):
req = urllib2.request(url, headers={'Cookie':'I2KBRCK=1'})
resp = urllib2.urlopen(req)
with open(outfile, 'wb') as f:
f.write(resp.read())
def get_pdf_requests(url, outfile='ex.pdf'):
resp = requests.get(url)
with open(outfile, 'wb') as f:
f.write(resp.content)
下一个问题:requests
从哪里得到那个 cookie? requests
是否多次访问服务器?
编辑 2 Cookie 来自重定向 header:
>>> handler=urllib2.HTTPHandler(debuglevel=1)
>>> opener=urllib2.build_opener(handler)
>>> urllib2.install_opener(opener)
>>> respurl=urllib2.urlopen(req1)
send: 'GET /doi/pdf/10.1177/0956797614553009 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Server: AtyponWS/7.1
header: P3P: CP="NOI DSP ADM OUR IND OTC"
header: Location: http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009?cookieSet=1
header: Set-Cookie: I2KBRCK=1; path=/; expires=Thu, 14-Dec-2017 17:28:28 GMT
header: Content-Type: text/html; charset=utf-8
header: Content-Length: 110
header: Connection: close
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
send: 'GET /doi/pdf/10.1177/0956797614553009?cookieSet=1 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Server: AtyponWS/7.1
header: Location: http://journals.sagepub.com/action/cookieAbsent
header: Content-Type: text/html; charset=utf-8
header: Content-Length: 85
header: Connection: close
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
send: 'GET /action/cookieAbsent HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: AtyponWS/7.1
header: Cache-Control: no-cache
header: Pragma: no-cache
header: X-Webstats-RespID: 8344872279f77f45555d5f9aeb97985b
header: Set-Cookie: JSESSIONID=aaavQMGH8mvlh_-5Ct7Jv; path=/
header: Content-Type: text/html; charset=UTF-8
header: Connection: close
header: Transfer-Encoding: chunked
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
header: Vary: Accept-Encoding
我敢打赌,这是用户代理 header 的问题(我刚刚使用了 curl http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009
,得到的结果与您报告的 urllib2
相同)。这是请求 header 的一部分,它让网站知道哪种类型的 program/user/whatever 正在访问该站点(不是图书馆,HTTP 请求)。
By default, it looks like urllib2
uses: Python-urllib/2.1
And requests uses:python-requests/{package version} {runtime}/{runtime version} {uname}/{uname -r}
如果您正在处理 Mac,我敢打赌该站点正在阅读 Darwin/13.1.0
或类似内容,然后为您提供 macos
适当的内容。否则,它可能会尝试将您引导至某些默认的替代内容(或阻止您抓取 URL)。