404,403,301,302 的 Scrapy 错误处理

Scrapy Error Handling for 404,403,301,302

我正在尝试从 URL 列表中抓取数据。对于没有给出响应状态代码 200 的 url,我确实需要一些其他数据。

settings.py

HTTPERROR_ALLOWED_CODES = [404,403,301,302]

代码

def parse(self, response):
    item = ScrapedataItem()
    
    if response.status != 404 or response.status != 403 or response.status != 301 or response.status != 302:
        item["url"] = response.url
        item["status"] = response.status
        item["html_data"] = response.text
    else:
        item["url"] = response.url
        item["status"] = response.status
        item["html_data"] = "Site Error"

日志

 'downloader/response_status_count/200': 231,
 'downloader/response_status_count/301': 12,
 'downloader/response_status_count/302': 38,
 'downloader/response_status_count/404': 4,

此处 else 部分未执行。如果出现错误代码,我想让 else 部分执行。

答案是

无需在settings.py

中设置

我们可以通过将错误代码列表 handle_httpstatus_list 添加到蜘蛛文件来处理错误

class CrawldataSpider(scrapy.Spider):
    name = 'crawldata'
    handle_httpstatus_list = [404,403,301,302]


def parse(self, response):
    item = ScrapedataItem()
    if response.status in (404,) or response.status in (403,) or response.status in (301,) or response.status in (302,):
        item["url"] = response.url
        item["status"] = response.status
        item["html_data"] = "Site Error"
    else:
        item["url"] = response.url
        item["status"] = response.status
        item["html_data"] = response.text