Scrapy 递归回调

Question

我正在开发一个 scrapy 项目，该项目向某个网站发送请求，启动报告生成，大约需要一分钟才能完成。然后下载生成的文件并解析它。（我知道 scrapy 不是这个目的的最佳选择，但我必须使用它）

现在，我有一个检查报告生成状态的功能，在生成报告之前应该发送多次。

我读了一些关于相同功能回调的文章，但都只提到了 callback=None 自动 returns 到主解析函数

这是我的代码：如果状态未完成，我希望爬虫返回相同的功能

    def parse_report_status(self, response):
        try:
            status = json.loads(response.text)
        except Exception as err:
            self.logger.error(err)
            return
        report_config = response.meta['report_config']
        trials = response.meta['trials']
        report_status = status['status']
        if report_status == 'COMPLETED':
            yield Request(
                url=DOWNLOAD_REPORT_URL.format(self.user_data['siteListEID'], report_config['id']),
                method='POST',
                callback=self.parse_data,
                meta={'report_status': report_status}
            )
        elif report_status in ["COMPLETED_WITH_ERRORS", "ERROR", "NOTFOUND"]:
            self.logger.error(f'Could not download report {report_config["id"]}')
            return
        else:
            if trials > 0:
                sleep_time = int(self.wait_time - ((self.max_trials - trials) * (self.wait_time / self.max_trials)))
                self.logger.info(f'Going to sleep for another {sleep_time} seconds')
                time.sleep(sleep_time)
                yield Request(
                    url=REPORT_STATUS_URL.format(self.user_data['siteListEID'], report_config['id']),
                    method='GET',
                    callback=self.parse_report_status,
                    meta={'report_config': report_config, 'trials': trials - 1}
                )
            else:
                self.logger.error(
                    f'Report {report_config["id"]} could not be downloaded after {self.max_trials} trial(s)')
                return

问题是程序在第一次调用时终止，这意味着它在第一次尝试后不会返回到相同的函数。

是否有任何遗漏的配置需要添加到爬虫以启用相同的功能回调？有可能吗？

想提一下我正在使用： Scrapy 1.8.0 & python 3.6.11

Answer 1

您的请求似乎被 scrapy 重复过滤器过滤（旨在过滤对之前 crawled/scraped 页面的请求）。将 dont_filter=True 添加到您的报告状态请求中：

....
yield Request(
    url=REPORT_STATUS_URL.format(self.user_data['siteListEID'], report_config['id']),
    dont_filter=True,  # <- add this
    method='GET',
    callback=self.parse_report_status,
    meta={'report_config': report_config, 'trials': trials - 1}
)

Scrapy 递归回调

Scrapy recursive callback

python

callback

scrapy