如何从 scrapy-splash 获取 200 以外的状态码
how to get status code other than 200 from scrapy-splash
我正在尝试使用 scrapy 和 scrapy-splash 获取请求状态代码,下面是蜘蛛代码。
class Exp10itSpider(scrapy.Spider):
name = "exp10it"
def start_requests(self):
urls = [
'http://192.168.8.240:8000/xxxx'
]
for url in urls:
#yield SplashRequest(url, self.parse, args={'wait': 0.5, 'dont_redirect': True},meta={'handle_httpstatus_all': True})
#yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True})
yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True,'splash': {
'args': {
'html': 1,
'png': 1,
}
}
}
)
def parse(self, response):
input("start .........")
print("status code is:\n")
input(response.status)
我的开始urlhttp://192.168.8.240:8000/xxxx
是一个404状态码url,有3种请求方式:
第一个是:
yield SplashRequest(url, self.parse, args={'wait': 0.5, 'dont_redirect': True},meta={'handle_httpstatus_all': True})
第二个是:
yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True})
第三个是:
yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True,'splash': {
'args': {
'html': 1,
'png': 1,
}
}
}
)
只有第二种请求方式yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True})
才能得到正确的状态码404
,第一种和第三种都得到状态码200
,也就是说,我尝试后使用 scrapy-splash,我无法获得正确的状态码 404
,你能帮我吗?
正如 documentation 到 scrapy-splash
所建议的那样,您必须将 magic_response=True
传递给 SplashRequest
才能实现此目的:
meta['splash']['http_status_from_error_code']
- set response.status
to HTTP error code when assert(splash:go(..))
fails; it requires meta['splash']['magic_response']=True
. http_status_from_error_code
option is False
by default if you use raw meta API; SplashRequest
sets it to True
by default.
编辑:
不过,我能够让它只与 execute
端点一起工作。这是使用 httpbin.org:
测试 HTTP 状态代码的样本蜘蛛
# -*- coding: utf-8 -*-
import scrapy
import scrapy_splash
class HttpStatusSpider(scrapy.Spider):
name = 'httpstatus'
lua_script = """
function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(0.5))
return {
html = splash:html(),
png = splash:png(),
}
end
"""
def start_requests(self):
yield scrapy_splash.SplashRequest(
'https://httpbin.org/status/402', self.parse,
endpoint='execute',
magic_response=True,
meta={'handle_httpstatus_all': True},
args={'lua_source': self.lua_script})
def parse(self, response):
pass
它将HTTP 402状态码传递给Scrapy,从输出可以看出:
...
2017-10-23 08:41:31 [scrapy.core.engine] DEBUG: Crawled (402) <GET https://httpbin.org/status/402 via http://localhost:8050/execute> (referer: None)
...
您也可以尝试使用其他 HTTP 状态代码。
我正在尝试使用 scrapy 和 scrapy-splash 获取请求状态代码,下面是蜘蛛代码。
class Exp10itSpider(scrapy.Spider):
name = "exp10it"
def start_requests(self):
urls = [
'http://192.168.8.240:8000/xxxx'
]
for url in urls:
#yield SplashRequest(url, self.parse, args={'wait': 0.5, 'dont_redirect': True},meta={'handle_httpstatus_all': True})
#yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True})
yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True,'splash': {
'args': {
'html': 1,
'png': 1,
}
}
}
)
def parse(self, response):
input("start .........")
print("status code is:\n")
input(response.status)
我的开始urlhttp://192.168.8.240:8000/xxxx
是一个404状态码url,有3种请求方式:
第一个是:
yield SplashRequest(url, self.parse, args={'wait': 0.5, 'dont_redirect': True},meta={'handle_httpstatus_all': True})
第二个是:
yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True})
第三个是:
yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True,'splash': {
'args': {
'html': 1,
'png': 1,
}
}
}
)
只有第二种请求方式yield scrapy.Request(url, self.parse, meta={'handle_httpstatus_all': True})
才能得到正确的状态码404
,第一种和第三种都得到状态码200
,也就是说,我尝试后使用 scrapy-splash,我无法获得正确的状态码 404
,你能帮我吗?
正如 documentation 到 scrapy-splash
所建议的那样,您必须将 magic_response=True
传递给 SplashRequest
才能实现此目的:
meta['splash']['http_status_from_error_code']
- setresponse.status
to HTTP error code whenassert(splash:go(..))
fails; it requiresmeta['splash']['magic_response']=True
.http_status_from_error_code
option isFalse
by default if you use raw meta API;SplashRequest
sets it toTrue
by default.
编辑:
不过,我能够让它只与 execute
端点一起工作。这是使用 httpbin.org:
# -*- coding: utf-8 -*-
import scrapy
import scrapy_splash
class HttpStatusSpider(scrapy.Spider):
name = 'httpstatus'
lua_script = """
function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(0.5))
return {
html = splash:html(),
png = splash:png(),
}
end
"""
def start_requests(self):
yield scrapy_splash.SplashRequest(
'https://httpbin.org/status/402', self.parse,
endpoint='execute',
magic_response=True,
meta={'handle_httpstatus_all': True},
args={'lua_source': self.lua_script})
def parse(self, response):
pass
它将HTTP 402状态码传递给Scrapy,从输出可以看出:
...
2017-10-23 08:41:31 [scrapy.core.engine] DEBUG: Crawled (402) <GET https://httpbin.org/status/402 via http://localhost:8050/execute> (referer: None)
...
您也可以尝试使用其他 HTTP 状态代码。