Scrapy:无法正确重启 start_requests()

Scrapy: Can't restart start_requests() properly

我有一个启动两个页面的抓取工具 - 其中一个是主页,另一个是包含我需要提取的长坐标和纬度坐标的 .js 文件,因为我稍后在解析过程中需要它们.我想先处理 .js 文件,提取坐标,然后解析主页并开始抓取其 links/parsing 其项目。 为此,我在 Request 方法中使用了 priority 参数,我是说我希望首先处理我的 .js 页面。这是有效的,但只有大约 70% 的时间(一定是由于 Scrapy 的异步请求)。剩下的 30% 的时间我最终在我的解析方法中尝试解析 .js long/lat 坐标,但是已经通过了主要网站页面,所以不可能解析它们。

出于这个原因,我尝试以这种方式修复它: 在 parse() 方法中,检查第 n 个 url 是哪个,如果它是第一个而不是 .js 那个,则重新启动蜘蛛。但是,当我下次重新启动蜘蛛时,它首先正确传递 .js,但在处理之后,蜘蛛完成工作并退出脚本而没有错误,就好像它已完成一样。 为什么会发生这种情况,重新启动蜘蛛时与刚启动时相比,页面处理有何不同,如何解决此问题?

当我尝试调试正在执行的内容以及它在重新启动时停止的原因时,这是两种情况下的示例输出代码。

class QuotesSpider(Spider):

    name = "bot"
    url_id = 0
    home_url = 'https://website.com'
    longitude = None
    latitude = None

    def __init__(self, cat=None):
        self.cat = cat.replace("-", " ")

    def start_requests(self):
        print ("Starting spider")
        self.start_urls = [
             self.home_url,
             self.home_url+'js-file-with-long-lat.js'
        ]
        for priority, url in enumerate(self.start_urls):
            print ("Processing", url)
            yield Request(url=url, priority=priority, callback=self.parse)


    def parse(self, response):
        print ("Inside parse")
        if self.url_id == 0 and response.url == self.home_url:
            self.alert("Loaded main page before long/lat page, restarting", False)
            for _ in self.start_requests():
                yield _
        else:
            print ("Everything is good, url id is", str(self.url_id))
            self.url_id +=1
            if self.longitude is None:
                for _ in self.parse_long_lat(response):
                    yield _
            else:
                print ("Calling parse cats")
                for cat in self.parse_cats(response):
                    yield cat

    def parse_long_lat(self, response):
        print ("called long lat")
        try:
            self.latitude = re.search('latitude:(\-?[0-9]{1,2}\.?[0-9]*)', 
            response.text).group(1)
            self.longitude = re.search('longitude:(\-?[0-9]{1,3}\.?[0-9]*)', 
            response.text).group(1)
            print ("Extracted coords")
            yield None
        except AttributeError as e:
            self.alert("\nCan't extract lat/long coordinates, store availability will not be parsed. ", False)
            yield None

    def parse_cats(self, response):           
        pass
        """ Parsing links code goes here """

蜘蛛正确启动时的输出,首先获取 .js 页面,然后开始解析猫:

Starting spider
https://website.com
https://website.com/js-file-with-long-lat.js
Inside parse
Everything is good, url id is 0
called long lat
Extracted coords
Inside parse
Everything is good, url id is 1
Calling parse cats

然后脚本继续运行并解析一切正常。 爬虫启动错误时的输出,先获取主页面再重启 start_requests():

Starting spider
https://website.com
https://website.com/js-file-with-long-lat.js
Inside parse
Loaded main page before long/lat page, restarting
Starting spider
https://website.com
https://website.com/js-file-with-long-lat.js
Inside parse
Everything is good, url id is 0
called long lat
Extracted coords

并且脚本停止执行,没有错误,就好像它已经完成了。

P.S。如果这很重要,我确实提到 start_requests() 中的处理 URL 是按相反顺序处理的,但由于循环序列,我发现这很自然,我希望 priority 参数执行它的操作工作(因为它大部分时间都是这样做的,它应该按照 Scrapy 的文档来做)。

关于为什么你的 Spider 在 "restarting" 的情况下没有继续;您可能 运行 与 重复请求 冲突 filtered/dropped。由于该页面已经被访问过,Scrapy认为它已经完成了。
因此,您必须使用 dont_filter=True 参数重新发送这些请求:

for priority, url in enumerate(self.start_urls):
    print ("Processing", url)
    yield Request(url=url, dont_filter=True, priority=priority, callback=self.parse)
    #                      ^^^^^^^^^^^^^^^^  notice us forcing the Dupefilter to
    #                                        ignore duplicate requests to these pages

关于更好的解决方案而不是这种 hacky 方法,请考虑使用 InitSpider(例如,存在其他方法)。这保证了您的 "initial" 工作已经完成并且值得信赖。
(出于某种原因,class 从未在 Scrapy docs 中记录,但它是一个相对简单的 Spider subclass:在开始实际的 [=33] 之前做一些初始工作=].)

这是一个代码示例:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders.init import InitSpider

class QuotesSpider(InitSpider):
    name = 'quotes'
    allowed_domains = ['website.com']
    start_urls = ['https://website.com']

    # Without this method override, InitSpider behaves like Spider.
    # This is used _instead of_ start_requests. (Do not override start_requests.)
    def init_request(self):
        # The last request that finishes the initialization needs
        # to have the `self.initialized()` method as callback.
        url = self.start_urls[0] + '/js-file-with-long-lat.js'
        yield scrapy.Request(url, callback=self.parse_long_lat, dont_filter=True)

    def parse_long_lat(self, response):
        """ The callback for our init request. """
        print ("called long lat")

        # do some work and maybe return stuff
        self.latitude = None
        self.longitude = None
        #yield stuff_here

        # Finally, start our run.
        return self.initialized()
        # Now we are "initialized", will process `start_urls`
        # and continue from there.

    def parse(self, response):
        print ("Inside parse")
        print ("Everything is good, do parse_cats stuff here")

这将导致如下输出:

2019-01-10 20:36:20 [scrapy.core.engine] INFO: Spider opened
2019-01-10 20:36:20 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-10 20:36:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1/js-file-with-long-lat.js> (referer: None)
called long lat
2019-01-10 20:36:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1> (referer: http://127.0.0.1/js-file-with-long-lat.js/)
Inside parse
Everything is good, do parse_cats stuff here
2019-01-10 20:36:21 [scrapy.core.engine] INFO: Closing spider (finished)

所以我终于通过解决方法解决了这个问题: 我检查 parse() 中收到的 response.url 是什么,并基于此将进一步的解析发送到相应的方法:

def start_requests(self):
        self.start_urls = [
            self.home_url,
            self.home_url + 'js-file-with-long-lat.js'
        ]
        for priority, url in enumerate(self.start_urls):
            yield Request(url=url, priority=priority, callback=self.parse)

def parse(self, response):
    if response.url != self.home_url:
        for _ in self.parse_long_lat(response):
            yield _
    else:
        for cat in self.parse_cats(response):
            yield cat