尝试使用 urllib2 和请求下载 python 中的页面,但不断被重定向

Trying to download page in python with urllib2 and requests but keep getting redirected

我正在尝试使用 python 下载一个页面。

http://webapps.rrc.state.tx.us/CMPL/viewPdfReportFormAction.do?method=cmplP4FormPdf&packetSummaryId=97770

如果我从服务器获得响应代码,我会得到 200

import urllib2

url = 'http://webapps.rrc.state.tx.us/CMPL/viewPdfReportFormAction.do?method=cmplP4FormPdf&packetSummaryId=97770'
file_pointer = urllib2.urlopen(url)
print file_pointer.getcode()

但是,如果我得到 url,我就会得到重定向页面

file_pointer.geturl()

我已经分别尝试了 urllib、urllib2、requests 和 mechanize,但都无法正常工作。我显然遗漏了一些东西,因为办公室里的其他人有有效的代码。求救

这里还有请求提供的更多信息

import requests

url = 'http://webapps.rrc.state.tx.us/CMPL/viewPdfReportFormAction.do?method=cmplP4FormPdf&packetSummaryId=97770'
proxy = { 'https': '200.35.152.93:1212'}
response = requests.get(url, proxies=proxy) 

send: 'GET /CMPL/viewPdfReportFormAction.do?method=cmplP4FormPdf&packetSummaryId=97770 HTTP/1.1\r\nHost: webapps.rrc.state.tx.us\r\nConnection: keep-alive\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nUser-Agent: python-requests/2.7.0 CPython/2.7.10 Windows/7\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Date: Wed, 26 Aug 2015 19:33:12 GMT
header: Server: Apache/2.2.15 (Red Hat)
header: Location: http://www.rrc.state.tx.us/site-policies/railroad-commission-of-texas-site-policies/?method=cmplP4FormPdf&packetSummaryId=97770
header: Content-Length: 405
header: Connection: close
header: Content-Type: text/html; charset=iso-8859-1
send: 'GET /site-policies/railroad-commission-of-texas-site-policies/?method=cmplP4FormPdf&packetSummaryId=97770 HTTP/1.1\r\nHost: www.rrc.state.tx.us\r\nConnection: keep-alive\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nUser-Agent: python-requests/2.7.0 CPython/2.7.10 Windows/7\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Cache-Control: private
header: Content-Type: text/html; charset=utf-8
header: server: one
header: Date: Wed, 26 Aug 2015 19:33:11 GMT
header: Content-Length: 41216

问题是这个特定站点正在寻找您的用户代理 header,并且由于您是 python 客户端,它不允许您获取 PDF 并重定向您。

因此您需要屏蔽您的用户代理。

看下面的例子:

url = 'http://webapps.rrc.state.tx.us/CMPL/viewPdfReportFormAction.do?method=cmplP4FormPdf&packetSummaryId=97770'

req = urllib2.Request(url)
req.add_unredirected_header('User-Agent', 'Mozilla/5.0')

file_pointer = urllib2.urlopen(req)
print file_pointer.getcode()
print file_pointer.geturl();

好的,所以与请求模块有关的所有事情就是禁用 redirection.Here 是我的工作代码,它也在使用代理服务器。

import requests

url = 'http://webapps.rrc.state.tx.us/CMPL/viewPdfReportFormAction.do?method=cmplP4FormPdf&packetSummaryId=97770'
proxy = { 'https': '200.35.152.93:1212'}
r = requests.get(url, proxies=proxy,allow_redirects=False) 
print r.url