模糊逻辑:如何检测 404 实际上不是错误页面?
Fuzzy Logic: How to detect when a 404 is not actually an error page?
我 运行遇到了最奇怪的情况,其中一个站点 (http://seventhgeneration.com/mission) 错误地 return 设置了 404 响应代码。
我正在编写一个自动测试套件,用于测试站点内的所有链接并测试它们是否损坏。在这种情况下,我正在测试链接到 http://seventhgeneration.com/mission 的网站,但我无法控制第七代任务页面。此页面在浏览器中有效,尽管它在网络监视器中 return 出现 404。
是否有任何技术方法可以验证此页面不是错误页面,同时正确检测其他页面(例如 https://github.com/thisShouldNotExist) as 404s? As someone mentioned in the comments, the Seventh Generation site does have a 404 page that appears for other broken URLs: http://seventhgeneration.com/shouldNotExist
# -*- coding: utf-8 -*-
import traceback
import urllib2
import httplib
url = 'http://seventhgeneration.com/mission'
HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36',
#'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
request = urllib2.Request(url, headers=HEADERS)
try:
response = urllib2.urlopen(request)
response_header = response.info()
print "Success: %s - %s"%(response.code, response_header)
except urllib2.HTTPError, e:
print 'urllib2.HTTPError %s - %s'%(e.code, e.headers)
except urllib2.URLError, e:
print "Unknown URLError: %s"%(e.reason)
except httplib.BadStatusLine as e:
print "Bad Status Error. (Presumably, the server closed the connection before sending a valid response)"
except Exception:
print "Unkown Exception: %s"%(traceback.format_exc())
当运行时,这个脚本returns:
urllib2.HTTPError 404 - Server: nginx
Content-Type: text/html; charset=utf-8
X-Drupal-Cache: HIT
Etag: "1422054308-1"
Content-Language: en
Link: </node/1523879>; rel="shortlink",</404>; rel="canonical",</node/1523879>; rel="shortlink",</404>; rel="canonical"
X-Generator: Drupal 7 (http://drupal.org)
Cache-Control: public, max-age=21600
Last-Modified: Fri, 23 Jan 2015 23:05:08 +0000
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Vary: Cookie,Accept-Encoding
Content-Encoding: gzip
X-Request-ID: v-82b55230-a357-11e4-94fe-1231380988d9
X-AH-Environment: prod
Content-Length: 11441
Accept-Ranges: bytes
Date: Fri, 23 Jan 2015 23:28:17 GMT
X-Varnish: 2729940224
Age: 0
Via: 1.1 varnish
Connection: close
X-Cache: MISS
此服务器显然不符合 HTTP 规范。它在 HTML 中返回整个网页,这应该是对 404 错误发生原因的描述。你需要解决这个问题,而不是想办法解决它。
我 运行遇到了最奇怪的情况,其中一个站点 (http://seventhgeneration.com/mission) 错误地 return 设置了 404 响应代码。
我正在编写一个自动测试套件,用于测试站点内的所有链接并测试它们是否损坏。在这种情况下,我正在测试链接到 http://seventhgeneration.com/mission 的网站,但我无法控制第七代任务页面。此页面在浏览器中有效,尽管它在网络监视器中 return 出现 404。
是否有任何技术方法可以验证此页面不是错误页面,同时正确检测其他页面(例如 https://github.com/thisShouldNotExist) as 404s? As someone mentioned in the comments, the Seventh Generation site does have a 404 page that appears for other broken URLs: http://seventhgeneration.com/shouldNotExist
# -*- coding: utf-8 -*-
import traceback
import urllib2
import httplib
url = 'http://seventhgeneration.com/mission'
HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36',
#'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
request = urllib2.Request(url, headers=HEADERS)
try:
response = urllib2.urlopen(request)
response_header = response.info()
print "Success: %s - %s"%(response.code, response_header)
except urllib2.HTTPError, e:
print 'urllib2.HTTPError %s - %s'%(e.code, e.headers)
except urllib2.URLError, e:
print "Unknown URLError: %s"%(e.reason)
except httplib.BadStatusLine as e:
print "Bad Status Error. (Presumably, the server closed the connection before sending a valid response)"
except Exception:
print "Unkown Exception: %s"%(traceback.format_exc())
当运行时,这个脚本returns:
urllib2.HTTPError 404 - Server: nginx
Content-Type: text/html; charset=utf-8
X-Drupal-Cache: HIT
Etag: "1422054308-1"
Content-Language: en
Link: </node/1523879>; rel="shortlink",</404>; rel="canonical",</node/1523879>; rel="shortlink",</404>; rel="canonical"
X-Generator: Drupal 7 (http://drupal.org)
Cache-Control: public, max-age=21600
Last-Modified: Fri, 23 Jan 2015 23:05:08 +0000
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Vary: Cookie,Accept-Encoding
Content-Encoding: gzip
X-Request-ID: v-82b55230-a357-11e4-94fe-1231380988d9
X-AH-Environment: prod
Content-Length: 11441
Accept-Ranges: bytes
Date: Fri, 23 Jan 2015 23:28:17 GMT
X-Varnish: 2729940224
Age: 0
Via: 1.1 varnish
Connection: close
X-Cache: MISS
此服务器显然不符合 HTTP 规范。它在 HTML 中返回整个网页,这应该是对 404 错误发生原因的描述。你需要解决这个问题,而不是想办法解决它。