如何将 url 的循环列表传递给 Scrapy (url="")

Question

我有一个循环可以创建我想抓取的链接：

    start_date = date(2020, 1, 1)
    end_date = date.today()
    crawl_date = start_date
    base_url = ""https://www.racingpost.com/results/""
    links = []
    # Generate the links
    while crawl_date <= end_date:
        links.append(base_url + str(crawl_date))
        crawl_date += timedelta(days=1)

如果我打印“链接”，效果很好，我得到了我想要的 urls。

然后我有一个蜘蛛，如果我手动输入 url，它也能抓取网站。现在我尝试传递包含 url 的“链接”变量，我想如下所示抓取，但我得到了“未定义的变量”。

class RpresultSpider(scrapy.Spider):
    name = 'rpresult'
    allowed_domains = ['www.racingpost.com']
        script = '''
        function main(splash, args)
            url = args.url
            assert(splash:go(url))
            
            return splash:html()
        end
        '''
        def start_requests(self):
            yield SplashRequest(url=links, callback=self.parse, endpoint='execute',
                                args={
                                    'lua_source': self.script
                                })
            
        def parse(self, response):
            for result in response.xpath("//div[@class='rp-resultsWrapper__content']"):
                yield {
                    'result': result.xpath('.//div[@class="rpraceCourse__panel__race__info"]//a[@data-test-selector="link-listCourseNameLink"]/@href').getall()
                }

如何将生成的链接传递到 SplashRequest(url=links

非常感谢您帮助我 - 我对此还很陌生并且正在迈出一小步 - 其中大部分都落后了...

Answer 1

根据我上面的评论（我不太确定这是否有效，因为我不熟悉 scrapy。但是，明显的问题是 RpresultSpider class 中没有对 links 变量的引用。将生成 url 的循环放在函数中可以解决这个问题。

class RpresultSpider(scrapy.Spider):
    name = 'rpresult'
    allowed_domains = ['www.racingpost.com']
        script = '''
        function main(splash, args)
            url = args.url
            assert(splash:go(url))
            
            return splash:html()
        end
        '''
        def start_requests(self):
            start_date = date(2020, 1, 1)
            end_date = date.today()
            crawl_date = start_date
            base_url = ""https://www.racingpost.com/results/""
            links = []
            # Generate the links
            while crawl_date <= end_date:
                links.append(base_url + str(crawl_date))
                crawl_date += timedelta(days=1)
            yield SplashRequest(url=links, callback=self.parse, endpoint='execute',
                                args={
                                    'lua_source': self.script
                                })
            
        def parse(self, response):
            for result in response.xpath("//div[@class='rp-resultsWrapper__content']"):
                yield {
                    'result': result.xpath('.//div[@class="rpraceCourse__panel__race__info"]//a[@data-test-selector="link-listCourseNameLink"]/@href').getall()
                }

如何将 url 的循环列表传递给 Scrapy (url="")

How to pass a looped list of urls to Scrapy (url="")

python

splash-screen

scrapy