ScrapySlash "masks" 404
ScrapySlash "masks" 404
我在尝试使用我的蜘蛛管理 404 响应时遇到了一些问题。 ScrapySlash 似乎用 200 掩盖了 404 响应。
这是我的代码
def buildRequest(self, url, dbid):
request = Request(url, self.parse, meta={
'splash': {
'args':{
'html': 1,
'wait': 5
},
'magic_response':True,
},
'dbId': dbid
}, errback=self.errback_httpbin, dont_filter=True)
return request
一个简单的 print response.status
将始终显示 200。用 scrapy shell
测试我的 url 将显示 response <404 http://www.foo.com/>
当我使用 Request 对象时,我的蜘蛛将转到 self.errback_httpbin
方法,但使用 SpaslRequest 它不会。 SlashRequest 正确处理 502 而不是 404。
谢谢
看来您只能通过 /execute
响应与 "magic responses"(默认情况下打开)一起实现此目的:
meta['splash']['magic_response']
- when set to True and a JSON
response is received from Splash, several attributes of the response
(headers, body, url, status code) are filled using data returned in
JSON:
- response.headers are filled from '
headers
' keys;
- response.url is set
to the value of '
url
' key;
- response.body is set to the value of '
html
'
key, or to base64-decoded value of 'body' key;
- response.status is set
to the value of '
http_status
' key. (...)
This option is set to True
by default if you use SplashRequest
.
其他端点如 /render.html
和 /render.json
将 return 502 错误网关用于来自远程服务器的 4xx 和 5xx 响应(待检查)。
在此基础上 example Lua script from the README:
function main(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go{
splash.args.url,
headers=splash.args.headers,
http_method=splash.args.http_method,
body=splash.args.body,
})
assert(splash:wait(0.5))
local entries = splash:history()
local last_response = entries[#entries].response
return {
url = splash:url(),
headers = last_response.headers,
http_status = last_response.status,
cookies = splash:get_cookies(),
html = splash:html(),
}
end
(注意最后的 table,returning url、headers、http_status、html 和 cookie。)
...当您将此脚本与 /execute
、SplashRequest
和 errbacks 一起使用时,您可以重现 errback example from Scrapy docs:
import scrapy
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError
from scrapy_splash import SplashRequest
script = """
function main(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go{
splash.args.url,
headers=splash.args.headers,
http_method=splash.args.http_method,
body=splash.args.body,
})
assert(splash:wait(0.5))
local entries = splash:history()
local last_response = entries[#entries].response
return {
url = splash:url(),
headers = last_response.headers,
http_status = last_response.status,
cookies = splash:get_cookies(),
html = splash:html(),
}
end
"""
class ErrbackSpider(scrapy.Spider):
name = "errback_example"
start_urls = [
"http://www.httpbin.org/", # HTTP 200 expected
"http://www.httpbin.org/status/404", # Not found error
"http://www.httpbin.org/status/500", # server issue
]
def start_requests(self):
for u in self.start_urls:
yield SplashRequest(u, callback=self.parse_httpbin,
errback=self.errback_httpbin,
endpoint='execute',
args={'lua_source': script})
def parse_httpbin(self, response):
self.logger.info('Got successful response from {}'.format(response.url))
# do something useful here...
def errback_httpbin(self, failure):
# log all failures
self.logger.error(repr(failure))
# in case you want to do something special for some errors,
# you may need the failure's type:
if failure.check(HttpError):
# these exceptions come from HttpError spider middleware
# you can get the non-200 response
response = failure.value.response
self.logger.error('HttpError on %s', response.url)
elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
self.logger.error('DNSLookupError on %s', request.url)
elif failure.check(TimeoutError, TCPTimedOutError):
request = failure.request
self.logger.error('TimeoutError on %s', request.url)
运行 使用 scrapy 1.3,这就是你得到的:
$ scrapy crawl errback_example
2017-01-11 18:07:20 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: test404)
(...)
2017-01-11 18:07:20 [scrapy.core.engine] INFO: Spider opened
(...)
2017-01-11 18:07:21 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.httpbin.org/status/500 via http://localhost:8050/execute> (failed 1 times): 500 Internal Server Error
2017-01-11 18:07:21 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://www.httpbin.org/status/404 via http://localhost:8050/execute> (referer: None)
2017-01-11 18:07:21 [errback_example] ERROR: <twisted.python.failure.Failure scrapy.spidermiddlewares.httperror.HttpError: Ignoring non-200 response>
2017-01-11 18:07:21 [errback_example] ERROR: HttpError on http://www.httpbin.org/status/404
2017-01-11 18:07:21 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.httpbin.org/status/500 via http://localhost:8050/execute> (failed 2 times): 500 Internal Server Error
2017-01-11 18:07:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.httpbin.org/ via http://localhost:8050/execute> (referer: None)
2017-01-11 18:07:21 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.httpbin.org/status/500 via http://localhost:8050/execute> (failed 3 times): 500 Internal Server Error
2017-01-11 18:07:21 [scrapy.core.engine] DEBUG: Crawled (500) <GET http://www.httpbin.org/status/500 via http://localhost:8050/execute> (referer: None)
2017-01-11 18:07:21 [errback_example] INFO: Got successful response from http://www.httpbin.org/
2017-01-11 18:07:21 [errback_example] ERROR: <twisted.python.failure.Failure scrapy.spidermiddlewares.httperror.HttpError: Ignoring non-200 response>
2017-01-11 18:07:21 [errback_example] ERROR: HttpError on http://www.httpbin.org/status/500
2017-01-11 18:07:21 [scrapy.core.engine] INFO: Closing spider (finished)
2017-01-11 18:07:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 5365,
'downloader/request_count': 5,
'downloader/request_method_count/POST': 5,
'downloader/response_bytes': 17332,
'downloader/response_count': 5,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/400': 4,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 1, 11, 17, 7, 21, 715440),
'log_count/DEBUG': 7,
'log_count/ERROR': 4,
'log_count/INFO': 8,
'response_received_count': 3,
'scheduler/dequeued': 8,
'scheduler/dequeued/memory': 8,
'scheduler/enqueued': 8,
'scheduler/enqueued/memory': 8,
'splash/execute/request_count': 3,
'splash/execute/response_count/200': 1,
'splash/execute/response_count/400': 4,
'start_time': datetime.datetime(2017, 1, 11, 17, 7, 20, 683232)}
2017-01-11 18:07:21 [scrapy.core.engine] INFO: Spider closed (finished)
[errback_example] ERROR
行显示调用 errback 的时间,即在这里您通过 errback 方法获得 404 和 500。
我在尝试使用我的蜘蛛管理 404 响应时遇到了一些问题。 ScrapySlash 似乎用 200 掩盖了 404 响应。
这是我的代码
def buildRequest(self, url, dbid):
request = Request(url, self.parse, meta={
'splash': {
'args':{
'html': 1,
'wait': 5
},
'magic_response':True,
},
'dbId': dbid
}, errback=self.errback_httpbin, dont_filter=True)
return request
一个简单的 print response.status
将始终显示 200。用 scrapy shell
测试我的 url 将显示 response <404 http://www.foo.com/>
当我使用 Request 对象时,我的蜘蛛将转到 self.errback_httpbin
方法,但使用 SpaslRequest 它不会。 SlashRequest 正确处理 502 而不是 404。
谢谢
看来您只能通过 /execute
响应与 "magic responses"(默认情况下打开)一起实现此目的:
meta['splash']['magic_response']
- when set to True and a JSON response is received from Splash, several attributes of the response (headers, body, url, status code) are filled using data returned in JSON:
- response.headers are filled from '
headers
' keys;- response.url is set to the value of '
url
' key;- response.body is set to the value of '
html
' key, or to base64-decoded value of 'body' key;- response.status is set to the value of '
http_status
' key. (...)This option is set to
True
by default if you useSplashRequest
.
其他端点如 /render.html
和 /render.json
将 return 502 错误网关用于来自远程服务器的 4xx 和 5xx 响应(待检查)。
在此基础上 example Lua script from the README:
function main(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go{
splash.args.url,
headers=splash.args.headers,
http_method=splash.args.http_method,
body=splash.args.body,
})
assert(splash:wait(0.5))
local entries = splash:history()
local last_response = entries[#entries].response
return {
url = splash:url(),
headers = last_response.headers,
http_status = last_response.status,
cookies = splash:get_cookies(),
html = splash:html(),
}
end
(注意最后的 table,returning url、headers、http_status、html 和 cookie。)
...当您将此脚本与 /execute
、SplashRequest
和 errbacks 一起使用时,您可以重现 errback example from Scrapy docs:
import scrapy
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError
from scrapy_splash import SplashRequest
script = """
function main(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go{
splash.args.url,
headers=splash.args.headers,
http_method=splash.args.http_method,
body=splash.args.body,
})
assert(splash:wait(0.5))
local entries = splash:history()
local last_response = entries[#entries].response
return {
url = splash:url(),
headers = last_response.headers,
http_status = last_response.status,
cookies = splash:get_cookies(),
html = splash:html(),
}
end
"""
class ErrbackSpider(scrapy.Spider):
name = "errback_example"
start_urls = [
"http://www.httpbin.org/", # HTTP 200 expected
"http://www.httpbin.org/status/404", # Not found error
"http://www.httpbin.org/status/500", # server issue
]
def start_requests(self):
for u in self.start_urls:
yield SplashRequest(u, callback=self.parse_httpbin,
errback=self.errback_httpbin,
endpoint='execute',
args={'lua_source': script})
def parse_httpbin(self, response):
self.logger.info('Got successful response from {}'.format(response.url))
# do something useful here...
def errback_httpbin(self, failure):
# log all failures
self.logger.error(repr(failure))
# in case you want to do something special for some errors,
# you may need the failure's type:
if failure.check(HttpError):
# these exceptions come from HttpError spider middleware
# you can get the non-200 response
response = failure.value.response
self.logger.error('HttpError on %s', response.url)
elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
self.logger.error('DNSLookupError on %s', request.url)
elif failure.check(TimeoutError, TCPTimedOutError):
request = failure.request
self.logger.error('TimeoutError on %s', request.url)
运行 使用 scrapy 1.3,这就是你得到的:
$ scrapy crawl errback_example
2017-01-11 18:07:20 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: test404)
(...)
2017-01-11 18:07:20 [scrapy.core.engine] INFO: Spider opened
(...)
2017-01-11 18:07:21 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.httpbin.org/status/500 via http://localhost:8050/execute> (failed 1 times): 500 Internal Server Error
2017-01-11 18:07:21 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://www.httpbin.org/status/404 via http://localhost:8050/execute> (referer: None)
2017-01-11 18:07:21 [errback_example] ERROR: <twisted.python.failure.Failure scrapy.spidermiddlewares.httperror.HttpError: Ignoring non-200 response>
2017-01-11 18:07:21 [errback_example] ERROR: HttpError on http://www.httpbin.org/status/404
2017-01-11 18:07:21 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.httpbin.org/status/500 via http://localhost:8050/execute> (failed 2 times): 500 Internal Server Error
2017-01-11 18:07:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.httpbin.org/ via http://localhost:8050/execute> (referer: None)
2017-01-11 18:07:21 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.httpbin.org/status/500 via http://localhost:8050/execute> (failed 3 times): 500 Internal Server Error
2017-01-11 18:07:21 [scrapy.core.engine] DEBUG: Crawled (500) <GET http://www.httpbin.org/status/500 via http://localhost:8050/execute> (referer: None)
2017-01-11 18:07:21 [errback_example] INFO: Got successful response from http://www.httpbin.org/
2017-01-11 18:07:21 [errback_example] ERROR: <twisted.python.failure.Failure scrapy.spidermiddlewares.httperror.HttpError: Ignoring non-200 response>
2017-01-11 18:07:21 [errback_example] ERROR: HttpError on http://www.httpbin.org/status/500
2017-01-11 18:07:21 [scrapy.core.engine] INFO: Closing spider (finished)
2017-01-11 18:07:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 5365,
'downloader/request_count': 5,
'downloader/request_method_count/POST': 5,
'downloader/response_bytes': 17332,
'downloader/response_count': 5,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/400': 4,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 1, 11, 17, 7, 21, 715440),
'log_count/DEBUG': 7,
'log_count/ERROR': 4,
'log_count/INFO': 8,
'response_received_count': 3,
'scheduler/dequeued': 8,
'scheduler/dequeued/memory': 8,
'scheduler/enqueued': 8,
'scheduler/enqueued/memory': 8,
'splash/execute/request_count': 3,
'splash/execute/response_count/200': 1,
'splash/execute/response_count/400': 4,
'start_time': datetime.datetime(2017, 1, 11, 17, 7, 20, 683232)}
2017-01-11 18:07:21 [scrapy.core.engine] INFO: Spider closed (finished)
[errback_example] ERROR
行显示调用 errback 的时间,即在这里您通过 errback 方法获得 404 和 500。