尝试使用 urllib2 和请求下载 python 中的页面，但不断被重定向

Question

我正在尝试使用 python 下载一个页面。

http://webapps.rrc.state.tx.us/CMPL/viewPdfReportFormAction.do?method=cmplP4FormPdf&packetSummaryId=97770

如果我从服务器获得响应代码，我会得到 200

import urllib2

url = 'http://webapps.rrc.state.tx.us/CMPL/viewPdfReportFormAction.do?method=cmplP4FormPdf&packetSummaryId=97770'
file_pointer = urllib2.urlopen(url)
print file_pointer.getcode()

但是，如果我得到 url，我就会得到重定向页面

file_pointer.geturl()

我已经分别尝试了 urllib、urllib2、requests 和 mechanize，但都无法正常工作。我显然遗漏了一些东西，因为办公室里的其他人有有效的代码。求救

这里还有请求提供的更多信息

import requests

url = 'http://webapps.rrc.state.tx.us/CMPL/viewPdfReportFormAction.do?method=cmplP4FormPdf&packetSummaryId=97770'
proxy = { 'https': '200.35.152.93:1212'}
response = requests.get(url, proxies=proxy) 

send: 'GET /CMPL/viewPdfReportFormAction.do?method=cmplP4FormPdf&packetSummaryId=97770 HTTP/1.1\r\nHost: webapps.rrc.state.tx.us\r\nConnection: keep-alive\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nUser-Agent: python-requests/2.7.0 CPython/2.7.10 Windows/7\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Date: Wed, 26 Aug 2015 19:33:12 GMT
header: Server: Apache/2.2.15 (Red Hat)
header: Location: http://www.rrc.state.tx.us/site-policies/railroad-commission-of-texas-site-policies/?method=cmplP4FormPdf&packetSummaryId=97770
header: Content-Length: 405
header: Connection: close
header: Content-Type: text/html; charset=iso-8859-1
send: 'GET /site-policies/railroad-commission-of-texas-site-policies/?method=cmplP4FormPdf&packetSummaryId=97770 HTTP/1.1\r\nHost: www.rrc.state.tx.us\r\nConnection: keep-alive\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nUser-Agent: python-requests/2.7.0 CPython/2.7.10 Windows/7\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Cache-Control: private
header: Content-Type: text/html; charset=utf-8
header: server: one
header: Date: Wed, 26 Aug 2015 19:33:11 GMT
header: Content-Length: 41216

Answer 1

问题是这个特定站点正在寻找您的用户代理 header，并且由于您是 python 客户端，它不允许您获取 PDF 并重定向您。

因此您需要屏蔽您的用户代理。

看下面的例子：

url = 'http://webapps.rrc.state.tx.us/CMPL/viewPdfReportFormAction.do?method=cmplP4FormPdf&packetSummaryId=97770'

req = urllib2.Request(url)
req.add_unredirected_header('User-Agent', 'Mozilla/5.0')

file_pointer = urllib2.urlopen(req)
print file_pointer.getcode()
print file_pointer.geturl();

Answer 2

好的，所以与请求模块有关的所有事情就是禁用 redirection.Here 是我的工作代码，它也在使用代理服务器。

import requests

url = 'http://webapps.rrc.state.tx.us/CMPL/viewPdfReportFormAction.do?method=cmplP4FormPdf&packetSummaryId=97770'
proxy = { 'https': '200.35.152.93:1212'}
r = requests.get(url, proxies=proxy,allow_redirects=False) 
print r.url

尝试使用 urllib2 和请求下载 python 中的页面，但不断被重定向

Trying to download page in python with urllib2 and requests but keep getting redirected

mechanize

urllib

urllib2

python-requests