Scrapy Contracts - Deferred 中未处理的错误
Scrapy Contracts - Unhandled error in Deferred
我正在使用 Scrapy 编写蜘蛛,目前正在向蜘蛛添加合同。蜘蛛仍然 运行 正常,但在合同中添加 @returns 后,当 运行 检查时,我得到了奇怪的结果。
@returns response 1
我突然得到 "Unhandled error is Deferred" 当 运行 scrapy check:
$ scrapy check regjeringen_no
----------------------------------------------------------------------
Ran 0 contracts in 0.000s
OK
Unhandled error in Deferred:
蜘蛛代码:
# -*- coding: utf-8 -*
import scrapy
class RegjeringenNoSpider(scrapy.Spider):
'''A spider to crawl the Norwegian Government's pages containing news, speeches and opinions'''
name = "regjeringen_no"
start_urls = [
'https://www.regjeringen.no/no/aktuelt/taler_artikler/',
'https://www.regjeringen.no/no/aktuelt/nyheter/',
]
def parse(self, response):
'''Parses the response downloaded for each of the requests made. Some
contracts are mingled with this docstring.
@url https://www.regjeringen.no/no/aktuelt/taler_artikler/
@url https://www.regjeringen.no/no/aktuelt/nyheter/
@returns response 1
'''
self.logger.info('Parse function called on %s', response.url)
for href in response.css('li.listItem h2.title a::attr(href)'):
yield response.follow(href, callback=self.parse_article)
for href in response.css('li.next a::attr(href)'):
yield response.follow(href, callback=self.parse)
def parse_article(self, response):
'''Parse response for pages with a single article'''
self.logger.info('Parse article function called on %s', response.url)
yield {
'article_title': self._extract_with_css("header.article-header h1::text", response),
'article_date': self._extract_with_css("div.article-info span.date::text", response),
'article_type': self._extract_with_css("div.article-info span.type::text", response),
'article_lead': self._extract_with_css("div.article-ingress p::text", response),
'article_text': self._extract_with_css("div.article-body::text", response),
}
def _extract_with_css(self, query, response):
return response.css(query).extract_first().strip()
这里有两点很奇怪。首先,来自 scrapy check 的反馈说 0 个合同,即使有 3 个(事实上,合同似乎只有在失败时才被计算在内)。其次,错误消息,没有多大意义(顺便说一句,错误不会中断检查的执行)。 Scrapy 错误?
注:运行
$ scrapy shell "https://www.regjeringen.no/no/aktuelt/taler_artikler/"
给我:
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7fbf214b6dd0>
[s] item {}
[s] request <GET https://www.regjeringen.no/no/aktuelt/taler_artikler/>
[s] response <200 https://www.regjeringen.no/no/aktuelt/taler_artikler/id1334/>
[s] settings <scrapy.settings.Settings object at 0x7fbf214b6d50>
[s] spider <DefaultSpider 'default' at 0x7fbf20e1e1d0>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
>>>
我预计这里糟糕的异常报告是一个 scrapy 错误。合同仍然被认为是一个新功能,也是 pretty limited。至于发生了什么:您应该指定 @returns requests 1
而不是 @returns responses 1
。指定多个 @url
指令也不适合你,只会检查第一个 url,我不确定如何在不实际扩展合同功能的情况下坦率地解决这个问题。
我正在使用 Scrapy 编写蜘蛛,目前正在向蜘蛛添加合同。蜘蛛仍然 运行 正常,但在合同中添加 @returns 后,当 运行 检查时,我得到了奇怪的结果。
@returns response 1
我突然得到 "Unhandled error is Deferred" 当 运行 scrapy check:
$ scrapy check regjeringen_no
----------------------------------------------------------------------
Ran 0 contracts in 0.000s
OK
Unhandled error in Deferred:
蜘蛛代码:
# -*- coding: utf-8 -*
import scrapy
class RegjeringenNoSpider(scrapy.Spider):
'''A spider to crawl the Norwegian Government's pages containing news, speeches and opinions'''
name = "regjeringen_no"
start_urls = [
'https://www.regjeringen.no/no/aktuelt/taler_artikler/',
'https://www.regjeringen.no/no/aktuelt/nyheter/',
]
def parse(self, response):
'''Parses the response downloaded for each of the requests made. Some
contracts are mingled with this docstring.
@url https://www.regjeringen.no/no/aktuelt/taler_artikler/
@url https://www.regjeringen.no/no/aktuelt/nyheter/
@returns response 1
'''
self.logger.info('Parse function called on %s', response.url)
for href in response.css('li.listItem h2.title a::attr(href)'):
yield response.follow(href, callback=self.parse_article)
for href in response.css('li.next a::attr(href)'):
yield response.follow(href, callback=self.parse)
def parse_article(self, response):
'''Parse response for pages with a single article'''
self.logger.info('Parse article function called on %s', response.url)
yield {
'article_title': self._extract_with_css("header.article-header h1::text", response),
'article_date': self._extract_with_css("div.article-info span.date::text", response),
'article_type': self._extract_with_css("div.article-info span.type::text", response),
'article_lead': self._extract_with_css("div.article-ingress p::text", response),
'article_text': self._extract_with_css("div.article-body::text", response),
}
def _extract_with_css(self, query, response):
return response.css(query).extract_first().strip()
这里有两点很奇怪。首先,来自 scrapy check 的反馈说 0 个合同,即使有 3 个(事实上,合同似乎只有在失败时才被计算在内)。其次,错误消息,没有多大意义(顺便说一句,错误不会中断检查的执行)。 Scrapy 错误?
注:运行
$ scrapy shell "https://www.regjeringen.no/no/aktuelt/taler_artikler/"
给我:
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7fbf214b6dd0>
[s] item {}
[s] request <GET https://www.regjeringen.no/no/aktuelt/taler_artikler/>
[s] response <200 https://www.regjeringen.no/no/aktuelt/taler_artikler/id1334/>
[s] settings <scrapy.settings.Settings object at 0x7fbf214b6d50>
[s] spider <DefaultSpider 'default' at 0x7fbf20e1e1d0>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
>>>
我预计这里糟糕的异常报告是一个 scrapy 错误。合同仍然被认为是一个新功能,也是 pretty limited。至于发生了什么:您应该指定 @returns requests 1
而不是 @returns responses 1
。指定多个 @url
指令也不适合你,只会检查第一个 url,我不确定如何在不实际扩展合同功能的情况下坦率地解决这个问题。