模糊逻辑:如何检测 404 实际上不是错误页面?

Fuzzy Logic: How to detect when a 404 is not actually an error page?

我 运行遇到了最奇怪的情况,其中一个站点 (http://seventhgeneration.com/mission) 错误地 return 设置了 404 响应代码。

我正在编写一个自动测试套件,用于测试站点内的所有链接并测试它们是否损坏。在这种情况下,我正在测试链接到 http://seventhgeneration.com/mission 的网站,但我无法控制第七代任务页面。此页面在浏览器中有效,尽管它在网络监视器中 return 出现 404。

是否有任何技术方法可以验证此页面不是错误页面,同时正确检测其他页面(例如 https://github.com/thisShouldNotExist) as 404s? As someone mentioned in the comments, the Seventh Generation site does have a 404 page that appears for other broken URLs: http://seventhgeneration.com/shouldNotExist

# -*- coding: utf-8 -*-

import traceback
import urllib2
import httplib

url = 'http://seventhgeneration.com/mission'
HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36',
    #'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
   'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
   'Accept-Encoding': 'gzip, deflate',
   'Accept-Language': 'en-US,en;q=0.8',
   'Connection': 'keep-alive'}

request = urllib2.Request(url, headers=HEADERS)
try:

    response = urllib2.urlopen(request)
    response_header = response.info()

    print "Success: %s - %s"%(response.code, response_header)

except urllib2.HTTPError, e:

    print 'urllib2.HTTPError %s - %s'%(e.code, e.headers)

except urllib2.URLError, e:

    print "Unknown URLError: %s"%(e.reason)

except httplib.BadStatusLine as e:
    print "Bad Status Error. (Presumably, the server closed the connection before sending a valid response)"

except Exception:

    print "Unkown Exception: %s"%(traceback.format_exc())

当运行时,这个脚本returns:

urllib2.HTTPError 404 - Server: nginx
Content-Type: text/html; charset=utf-8
X-Drupal-Cache: HIT
Etag: "1422054308-1"
Content-Language: en
Link: </node/1523879>; rel="shortlink",</404>; rel="canonical",</node/1523879>; rel="shortlink",</404>; rel="canonical"
X-Generator: Drupal 7 (http://drupal.org)
Cache-Control: public, max-age=21600
Last-Modified: Fri, 23 Jan 2015 23:05:08 +0000
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Vary: Cookie,Accept-Encoding
Content-Encoding: gzip
X-Request-ID: v-82b55230-a357-11e4-94fe-1231380988d9
X-AH-Environment: prod
Content-Length: 11441
Accept-Ranges: bytes
Date: Fri, 23 Jan 2015 23:28:17 GMT
X-Varnish: 2729940224
Age: 0
Via: 1.1 varnish
Connection: close
X-Cache: MISS

此服务器显然不符合 HTTP 规范。它在 HTML 中返回整个网页,这应该是对 404 错误发生原因的描述。你需要解决这个问题,而不是想办法解决它。