Python urllib2 在检测重定向时在某些 URL 上抛出 "unknown URL Type" 错误

Python urllib2 throws "unknown URL Type" error on certain URLs when detecting redirection

当我尝试使用 urllib2 加载以下 URL 时,一切都成功了:

# -*- coding: utf-8 -*-

import traceback
import urllib2
import httplib

url = 'http://www.marchofdimes.com/pregnancy/preterm-labor-and-birth.aspx'
HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36',
    #'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
   'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
   'Accept-Encoding': 'gzip, deflate',
   'Accept-Language': 'en-US,en;q=0.8',
   'Connection': 'keep-alive'}

request = urllib2.Request(url, headers=HEADERS)
try:

    response = urllib2.urlopen(request)
    response_header = response.info()
    print "Success: %s - %s"%(response.code, response_header)

except urllib2.HTTPError, e:
    print 'urllib2.HTTPError %s - %s'%(e.code, e.headers)
except urllib2.URLError, e:
    print "Unknown URLError: %s"%(e.reason)
except httplib.BadStatusLine as e:
    print "Bad Status Error. (Presumably, the server closed the connection before sending a valid response)"
except Exception:
    print "Unkown Exception: %s"%(traceback.format_exc())

这输出:

Success: 200 - Cache-Control: private
Content-Type: text/html; charset=utf-8
Server: Microsoft-IIS/7.5
X-AspNet-Version: 4.0.30319
X-Powered-By: ASP.NET
X-UA-Compatible: IE=edge
Date: Fri, 23 Jan 2015 23:34:11 GMT
Connection: close
Content-Length: 129514

现在我尝试加载 URL 开启器,该开启器会在途中停止以便检测重定向:

... Same as above ...

class NoRedirection(urllib2.HTTPErrorProcessor):
    def http_response(self, request, response):
        return response
    https_response = http_response   

def load_url(url):
    print "====== Loading %s ======"%url
    request = urllib2.Request(url, headers=HEADERS)
    try:          
        opener = urllib2.build_opener(NoRedirection)    
        request = urllib2.Request(url, headers=HEADERS)
        response = opener.open(request)
        response_header = response.info()
        ending_url = response_header.getheader('Location') or url
        print "Success: %s - %s"%(response.code, response_header)
        has_redirect = url != ending_url
        if has_redirect:
            load_url(ending_url)
    except urllib2.HTTPError, e:
        print 'urllib2.HTTPError %s - %s'%(e.code, e.headers)
    except urllib2.URLError, e:
        print "Unknown URLError: %s"%(e.reason)
    except httplib.BadStatusLine as e:
        print "Bad Status Error. (Presumably, the server closed the connection before sending a valid response)"
    except Exception:            
        print "Unkown Exception: %s"%(traceback.format_exc())

load_url(url)

当运行时,输出:

====== Loading http://www.marchofdimes.com/pregnancy/preterm-labor-and-birth.aspx ======
Success: 301 - Content-Type: text/html; charset=UTF-8
Location: http://www.marchofdimes.org/pregnancy/preterm-labor-and-birth.aspx
Server: Microsoft-IIS/7.5
X-Powered-By: ASP.NET
Date: Fri, 23 Jan 2015 23:36:58 GMT
Connection: close
Content-Length: 189

====== Loading http://www.marchofdimes.org/pregnancy/preterm-labor-and-birth.aspx ======
Success: 302 - Location: /404.aspx?aspxerrorpath=/pregnancy/preterm-labor-and-birth.aspx
Server: Microsoft-IIS/7.5
X-Powered-By: ASP.NET
X-UA-Compatible: IE=edge
Date: Fri, 23 Jan 2015 23:36:59 GMT
Connection: close
Content-Length: 180

====== Loading /404.aspx?aspxerrorpath=/pregnancy/preterm-labor-and-birth.aspx ======
Unkown Exception: Traceback (most recent call last):
  File "urltest.py", line 32, in load_url
    response = opener.open(request)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 396, in open
    protocol = req.get_type()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 258, in get_type
    raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: /404.aspx?aspxerrorpath=/pregnancy/preterm-labor-and-birth.aspx

此重定向检测已与所有其他 URL 一起使用,所以我很困惑为什么它不适用于这个。

切换到 requests 库而不是 urllib 已经绕过了这个问题:

import requests
session = requests.Session()
response = session.get(url, headers=HEADERS, allow_redirects=False)