如何将 url 的循环列表传递给 Scrapy (url="")
How to pass a looped list of urls to Scrapy (url="")
我有一个循环可以创建我想抓取的链接:
start_date = date(2020, 1, 1)
end_date = date.today()
crawl_date = start_date
base_url = ""https://www.racingpost.com/results/""
links = []
# Generate the links
while crawl_date <= end_date:
links.append(base_url + str(crawl_date))
crawl_date += timedelta(days=1)
如果我打印“链接”,效果很好,我得到了我想要的 urls。
然后我有一个蜘蛛,如果我手动输入 url,它也能抓取网站。
现在我尝试传递包含 url 的“链接”变量,我想如下所示抓取,但我得到了“未定义的变量”。
class RpresultSpider(scrapy.Spider):
name = 'rpresult'
allowed_domains = ['www.racingpost.com']
script = '''
function main(splash, args)
url = args.url
assert(splash:go(url))
return splash:html()
end
'''
def start_requests(self):
yield SplashRequest(url=links, callback=self.parse, endpoint='execute',
args={
'lua_source': self.script
})
def parse(self, response):
for result in response.xpath("//div[@class='rp-resultsWrapper__content']"):
yield {
'result': result.xpath('.//div[@class="rpraceCourse__panel__race__info"]//a[@data-test-selector="link-listCourseNameLink"]/@href').getall()
}
如何将生成的链接传递到 SplashRequest(url=links
非常感谢您帮助我 - 我对此还很陌生并且正在迈出一小步 - 其中大部分都落后了...
根据我上面的评论(我不太确定这是否有效,因为我不熟悉 scrapy。但是,明显的问题是 RpresultSpider class 中没有对 links 变量的引用。将生成 url 的循环放在函数中可以解决这个问题。
class RpresultSpider(scrapy.Spider):
name = 'rpresult'
allowed_domains = ['www.racingpost.com']
script = '''
function main(splash, args)
url = args.url
assert(splash:go(url))
return splash:html()
end
'''
def start_requests(self):
start_date = date(2020, 1, 1)
end_date = date.today()
crawl_date = start_date
base_url = ""https://www.racingpost.com/results/""
links = []
# Generate the links
while crawl_date <= end_date:
links.append(base_url + str(crawl_date))
crawl_date += timedelta(days=1)
yield SplashRequest(url=links, callback=self.parse, endpoint='execute',
args={
'lua_source': self.script
})
def parse(self, response):
for result in response.xpath("//div[@class='rp-resultsWrapper__content']"):
yield {
'result': result.xpath('.//div[@class="rpraceCourse__panel__race__info"]//a[@data-test-selector="link-listCourseNameLink"]/@href').getall()
}
我有一个循环可以创建我想抓取的链接:
start_date = date(2020, 1, 1)
end_date = date.today()
crawl_date = start_date
base_url = ""https://www.racingpost.com/results/""
links = []
# Generate the links
while crawl_date <= end_date:
links.append(base_url + str(crawl_date))
crawl_date += timedelta(days=1)
如果我打印“链接”,效果很好,我得到了我想要的 urls。
然后我有一个蜘蛛,如果我手动输入 url,它也能抓取网站。 现在我尝试传递包含 url 的“链接”变量,我想如下所示抓取,但我得到了“未定义的变量”。
class RpresultSpider(scrapy.Spider):
name = 'rpresult'
allowed_domains = ['www.racingpost.com']
script = '''
function main(splash, args)
url = args.url
assert(splash:go(url))
return splash:html()
end
'''
def start_requests(self):
yield SplashRequest(url=links, callback=self.parse, endpoint='execute',
args={
'lua_source': self.script
})
def parse(self, response):
for result in response.xpath("//div[@class='rp-resultsWrapper__content']"):
yield {
'result': result.xpath('.//div[@class="rpraceCourse__panel__race__info"]//a[@data-test-selector="link-listCourseNameLink"]/@href').getall()
}
如何将生成的链接传递到 SplashRequest(url=links
非常感谢您帮助我 - 我对此还很陌生并且正在迈出一小步 - 其中大部分都落后了...
根据我上面的评论(我不太确定这是否有效,因为我不熟悉 scrapy。但是,明显的问题是 RpresultSpider class 中没有对 links 变量的引用。将生成 url 的循环放在函数中可以解决这个问题。
class RpresultSpider(scrapy.Spider):
name = 'rpresult'
allowed_domains = ['www.racingpost.com']
script = '''
function main(splash, args)
url = args.url
assert(splash:go(url))
return splash:html()
end
'''
def start_requests(self):
start_date = date(2020, 1, 1)
end_date = date.today()
crawl_date = start_date
base_url = ""https://www.racingpost.com/results/""
links = []
# Generate the links
while crawl_date <= end_date:
links.append(base_url + str(crawl_date))
crawl_date += timedelta(days=1)
yield SplashRequest(url=links, callback=self.parse, endpoint='execute',
args={
'lua_source': self.script
})
def parse(self, response):
for result in response.xpath("//div[@class='rp-resultsWrapper__content']"):
yield {
'result': result.xpath('.//div[@class="rpraceCourse__panel__race__info"]//a[@data-test-selector="link-listCourseNameLink"]/@href').getall()
}