404,403,301,302 的 Scrapy 错误处理
Scrapy Error Handling for 404,403,301,302
我正在尝试从 URL 列表中抓取数据。对于没有给出响应状态代码 200 的 url,我确实需要一些其他数据。
settings.py
HTTPERROR_ALLOWED_CODES = [404,403,301,302]
代码
def parse(self, response):
item = ScrapedataItem()
if response.status != 404 or response.status != 403 or response.status != 301 or response.status != 302:
item["url"] = response.url
item["status"] = response.status
item["html_data"] = response.text
else:
item["url"] = response.url
item["status"] = response.status
item["html_data"] = "Site Error"
日志
'downloader/response_status_count/200': 231,
'downloader/response_status_count/301': 12,
'downloader/response_status_count/302': 38,
'downloader/response_status_count/404': 4,
此处 else
部分未执行。如果出现错误代码,我想让 else 部分执行。
答案是
无需在settings.py
中设置
我们可以通过将错误代码列表 handle_httpstatus_list
添加到蜘蛛文件来处理错误
class CrawldataSpider(scrapy.Spider):
name = 'crawldata'
handle_httpstatus_list = [404,403,301,302]
def parse(self, response):
item = ScrapedataItem()
if response.status in (404,) or response.status in (403,) or response.status in (301,) or response.status in (302,):
item["url"] = response.url
item["status"] = response.status
item["html_data"] = "Site Error"
else:
item["url"] = response.url
item["status"] = response.status
item["html_data"] = response.text
我正在尝试从 URL 列表中抓取数据。对于没有给出响应状态代码 200 的 url,我确实需要一些其他数据。
settings.py
HTTPERROR_ALLOWED_CODES = [404,403,301,302]
代码
def parse(self, response):
item = ScrapedataItem()
if response.status != 404 or response.status != 403 or response.status != 301 or response.status != 302:
item["url"] = response.url
item["status"] = response.status
item["html_data"] = response.text
else:
item["url"] = response.url
item["status"] = response.status
item["html_data"] = "Site Error"
日志
'downloader/response_status_count/200': 231,
'downloader/response_status_count/301': 12,
'downloader/response_status_count/302': 38,
'downloader/response_status_count/404': 4,
此处 else
部分未执行。如果出现错误代码,我想让 else 部分执行。
答案是
无需在settings.py
中设置我们可以通过将错误代码列表 handle_httpstatus_list
添加到蜘蛛文件来处理错误
class CrawldataSpider(scrapy.Spider):
name = 'crawldata'
handle_httpstatus_list = [404,403,301,302]
def parse(self, response):
item = ScrapedataItem()
if response.status in (404,) or response.status in (403,) or response.status in (301,) or response.status in (302,):
item["url"] = response.url
item["status"] = response.status
item["html_data"] = "Site Error"
else:
item["url"] = response.url
item["status"] = response.status
item["html_data"] = response.text