使用 urllib2 与请求下载文件：为什么这些输出不同？

Question

这是我今天早些时候看到的一个问题的 follow-up。在这个问题中，用户询问从 url:

下载 pdf 的问题

http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009

我认为下面的两个下载函数会给出相同的结果，但是 urllib2 版本下载一些 html 带有引用 pdf 加载器的脚本标签，而 requests版本下载真正的pdf。有人可以解释行为上的差异吗？

import urllib2
import requests

def get_pdf_urllib2(url, outfile='ex.pdf'):
    resp = urllib2.urlopen(url)
    with open(outfile, 'wb') as f:
        f.write(resp.read())

def get_pdf_requests(url, outfile='ex.pdf'):
    resp = requests.get(url)
    with open(outfile, 'wb') as f:
        f.write(resp.content)

requests 是否足够智能，可以在下载之前等待动态网站呈现？

编辑按照@cwallenpoole 的想法，我比较了 headers 并尝试将 headers 从 requests 请求交换到 urllib2 请求。神奇的 header 是 Cookie；以下函数为示例 URL.

写入相同的文件

def get_pdf_urllib2(url, outfile='ex.pdf'):
    req = urllib2.request(url, headers={'Cookie':'I2KBRCK=1'})
    resp = urllib2.urlopen(req)
    with open(outfile, 'wb') as f:
        f.write(resp.read())

def get_pdf_requests(url, outfile='ex.pdf'):
    resp = requests.get(url)
    with open(outfile, 'wb') as f:
        f.write(resp.content)

下一个问题：requests 从哪里得到那个 cookie？ requests 是否多次访问服务器？

编辑 2 Cookie 来自重定向 header:

>>> handler=urllib2.HTTPHandler(debuglevel=1)
>>> opener=urllib2.build_opener(handler)
>>> urllib2.install_opener(opener)
>>> respurl=urllib2.urlopen(req1)
send: 'GET /doi/pdf/10.1177/0956797614553009 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Server: AtyponWS/7.1
header: P3P: CP="NOI DSP ADM OUR IND OTC"
header: Location: http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009?cookieSet=1
header: Set-Cookie: I2KBRCK=1; path=/; expires=Thu, 14-Dec-2017 17:28:28 GMT
header: Content-Type: text/html; charset=utf-8
header: Content-Length: 110
header: Connection: close
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
send: 'GET /doi/pdf/10.1177/0956797614553009?cookieSet=1 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Server: AtyponWS/7.1
header: Location: http://journals.sagepub.com/action/cookieAbsent
header: Content-Type: text/html; charset=utf-8
header: Content-Length: 85
header: Connection: close
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
send: 'GET /action/cookieAbsent HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: AtyponWS/7.1
header: Cache-Control: no-cache
header: Pragma: no-cache
header: X-Webstats-RespID: 8344872279f77f45555d5f9aeb97985b
header: Set-Cookie: JSESSIONID=aaavQMGH8mvlh_-5Ct7Jv; path=/
header: Content-Type: text/html; charset=UTF-8
header: Connection: close
header: Transfer-Encoding: chunked
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
header: Vary: Accept-Encoding

Answer 1

我敢打赌，这是用户代理 header 的问题（我刚刚使用了 curl http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009，得到的结果与您报告的 urllib2 相同）。这是请求 header 的一部分，它让网站知道哪种类型的 program/user/whatever 正在访问该站点（不是图书馆，HTTP 请求）。

By default, it looks like urllib2 uses: Python-urllib/2.1
And requests uses:python-requests/{package version} {runtime}/{runtime version} {uname}/{uname -r}

如果您正在处理 Mac，我敢打赌该站点正在阅读 Darwin/13.1.0 或类似内容，然后为您提供 macos 适当的内容。否则，它可能会尝试将您引导至某些默认的替代内容（或阻止您抓取 URL）。

使用 urllib2 与请求下载文件：为什么这些输出不同？

Downloading file with urllib2 vs requests: Why are these outputs different?

python

urllib2

python-requests