Scrapy:无法正确重启 start_requests()
Scrapy: Can't restart start_requests() properly
我有一个启动两个页面的抓取工具 - 其中一个是主页,另一个是包含我需要提取的长坐标和纬度坐标的 .js 文件,因为我稍后在解析过程中需要它们.我想先处理 .js 文件,提取坐标,然后解析主页并开始抓取其 links/parsing 其项目。
为此,我在 Request
方法中使用了 priority
参数,我是说我希望首先处理我的 .js 页面。这是有效的,但只有大约 70% 的时间(一定是由于 Scrapy 的异步请求)。剩下的 30% 的时间我最终在我的解析方法中尝试解析 .js long/lat 坐标,但是已经通过了主要网站页面,所以不可能解析它们。
出于这个原因,我尝试以这种方式修复它:
在 parse()
方法中,检查第 n 个 url 是哪个,如果它是第一个而不是 .js 那个,则重新启动蜘蛛。但是,当我下次重新启动蜘蛛时,它首先正确传递 .js,但在处理之后,蜘蛛完成工作并退出脚本而没有错误,就好像它已完成一样。
为什么会发生这种情况,重新启动蜘蛛时与刚启动时相比,页面处理有何不同,如何解决此问题?
当我尝试调试正在执行的内容以及它在重新启动时停止的原因时,这是两种情况下的示例输出代码。
class QuotesSpider(Spider):
name = "bot"
url_id = 0
home_url = 'https://website.com'
longitude = None
latitude = None
def __init__(self, cat=None):
self.cat = cat.replace("-", " ")
def start_requests(self):
print ("Starting spider")
self.start_urls = [
self.home_url,
self.home_url+'js-file-with-long-lat.js'
]
for priority, url in enumerate(self.start_urls):
print ("Processing", url)
yield Request(url=url, priority=priority, callback=self.parse)
def parse(self, response):
print ("Inside parse")
if self.url_id == 0 and response.url == self.home_url:
self.alert("Loaded main page before long/lat page, restarting", False)
for _ in self.start_requests():
yield _
else:
print ("Everything is good, url id is", str(self.url_id))
self.url_id +=1
if self.longitude is None:
for _ in self.parse_long_lat(response):
yield _
else:
print ("Calling parse cats")
for cat in self.parse_cats(response):
yield cat
def parse_long_lat(self, response):
print ("called long lat")
try:
self.latitude = re.search('latitude:(\-?[0-9]{1,2}\.?[0-9]*)',
response.text).group(1)
self.longitude = re.search('longitude:(\-?[0-9]{1,3}\.?[0-9]*)',
response.text).group(1)
print ("Extracted coords")
yield None
except AttributeError as e:
self.alert("\nCan't extract lat/long coordinates, store availability will not be parsed. ", False)
yield None
def parse_cats(self, response):
pass
""" Parsing links code goes here """
蜘蛛正确启动时的输出,首先获取 .js 页面,然后开始解析猫:
Starting spider
https://website.com
https://website.com/js-file-with-long-lat.js
Inside parse
Everything is good, url id is 0
called long lat
Extracted coords
Inside parse
Everything is good, url id is 1
Calling parse cats
然后脚本继续运行并解析一切正常。
爬虫启动错误时的输出,先获取主页面再重启 start_requests():
Starting spider
https://website.com
https://website.com/js-file-with-long-lat.js
Inside parse
Loaded main page before long/lat page, restarting
Starting spider
https://website.com
https://website.com/js-file-with-long-lat.js
Inside parse
Everything is good, url id is 0
called long lat
Extracted coords
并且脚本停止执行,没有错误,就好像它已经完成了。
P.S。如果这很重要,我确实提到 start_requests()
中的处理 URL 是按相反顺序处理的,但由于循环序列,我发现这很自然,我希望 priority
参数执行它的操作工作(因为它大部分时间都是这样做的,它应该按照 Scrapy 的文档来做)。
关于为什么你的 Spider 在 "restarting" 的情况下没有继续;您可能 运行 与 重复请求 冲突 filtered/dropped。由于该页面已经被访问过,Scrapy认为它已经完成了。
因此,您必须使用 dont_filter=True
参数重新发送这些请求:
for priority, url in enumerate(self.start_urls):
print ("Processing", url)
yield Request(url=url, dont_filter=True, priority=priority, callback=self.parse)
# ^^^^^^^^^^^^^^^^ notice us forcing the Dupefilter to
# ignore duplicate requests to these pages
关于更好的解决方案而不是这种 hacky 方法,请考虑使用 InitSpider
(例如,存在其他方法)。这保证了您的 "initial" 工作已经完成并且值得信赖。
(出于某种原因,class 从未在 Scrapy docs 中记录,但它是一个相对简单的 Spider
subclass:在开始实际的 [=33] 之前做一些初始工作=].)
这是一个代码示例:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders.init import InitSpider
class QuotesSpider(InitSpider):
name = 'quotes'
allowed_domains = ['website.com']
start_urls = ['https://website.com']
# Without this method override, InitSpider behaves like Spider.
# This is used _instead of_ start_requests. (Do not override start_requests.)
def init_request(self):
# The last request that finishes the initialization needs
# to have the `self.initialized()` method as callback.
url = self.start_urls[0] + '/js-file-with-long-lat.js'
yield scrapy.Request(url, callback=self.parse_long_lat, dont_filter=True)
def parse_long_lat(self, response):
""" The callback for our init request. """
print ("called long lat")
# do some work and maybe return stuff
self.latitude = None
self.longitude = None
#yield stuff_here
# Finally, start our run.
return self.initialized()
# Now we are "initialized", will process `start_urls`
# and continue from there.
def parse(self, response):
print ("Inside parse")
print ("Everything is good, do parse_cats stuff here")
这将导致如下输出:
2019-01-10 20:36:20 [scrapy.core.engine] INFO: Spider opened
2019-01-10 20:36:20 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-10 20:36:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1/js-file-with-long-lat.js> (referer: None)
called long lat
2019-01-10 20:36:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1> (referer: http://127.0.0.1/js-file-with-long-lat.js/)
Inside parse
Everything is good, do parse_cats stuff here
2019-01-10 20:36:21 [scrapy.core.engine] INFO: Closing spider (finished)
所以我终于通过解决方法解决了这个问题:
我检查 parse()
中收到的 response.url
是什么,并基于此将进一步的解析发送到相应的方法:
def start_requests(self):
self.start_urls = [
self.home_url,
self.home_url + 'js-file-with-long-lat.js'
]
for priority, url in enumerate(self.start_urls):
yield Request(url=url, priority=priority, callback=self.parse)
def parse(self, response):
if response.url != self.home_url:
for _ in self.parse_long_lat(response):
yield _
else:
for cat in self.parse_cats(response):
yield cat
我有一个启动两个页面的抓取工具 - 其中一个是主页,另一个是包含我需要提取的长坐标和纬度坐标的 .js 文件,因为我稍后在解析过程中需要它们.我想先处理 .js 文件,提取坐标,然后解析主页并开始抓取其 links/parsing 其项目。
为此,我在 Request
方法中使用了 priority
参数,我是说我希望首先处理我的 .js 页面。这是有效的,但只有大约 70% 的时间(一定是由于 Scrapy 的异步请求)。剩下的 30% 的时间我最终在我的解析方法中尝试解析 .js long/lat 坐标,但是已经通过了主要网站页面,所以不可能解析它们。
出于这个原因,我尝试以这种方式修复它:
在 parse()
方法中,检查第 n 个 url 是哪个,如果它是第一个而不是 .js 那个,则重新启动蜘蛛。但是,当我下次重新启动蜘蛛时,它首先正确传递 .js,但在处理之后,蜘蛛完成工作并退出脚本而没有错误,就好像它已完成一样。
为什么会发生这种情况,重新启动蜘蛛时与刚启动时相比,页面处理有何不同,如何解决此问题?
当我尝试调试正在执行的内容以及它在重新启动时停止的原因时,这是两种情况下的示例输出代码。
class QuotesSpider(Spider):
name = "bot"
url_id = 0
home_url = 'https://website.com'
longitude = None
latitude = None
def __init__(self, cat=None):
self.cat = cat.replace("-", " ")
def start_requests(self):
print ("Starting spider")
self.start_urls = [
self.home_url,
self.home_url+'js-file-with-long-lat.js'
]
for priority, url in enumerate(self.start_urls):
print ("Processing", url)
yield Request(url=url, priority=priority, callback=self.parse)
def parse(self, response):
print ("Inside parse")
if self.url_id == 0 and response.url == self.home_url:
self.alert("Loaded main page before long/lat page, restarting", False)
for _ in self.start_requests():
yield _
else:
print ("Everything is good, url id is", str(self.url_id))
self.url_id +=1
if self.longitude is None:
for _ in self.parse_long_lat(response):
yield _
else:
print ("Calling parse cats")
for cat in self.parse_cats(response):
yield cat
def parse_long_lat(self, response):
print ("called long lat")
try:
self.latitude = re.search('latitude:(\-?[0-9]{1,2}\.?[0-9]*)',
response.text).group(1)
self.longitude = re.search('longitude:(\-?[0-9]{1,3}\.?[0-9]*)',
response.text).group(1)
print ("Extracted coords")
yield None
except AttributeError as e:
self.alert("\nCan't extract lat/long coordinates, store availability will not be parsed. ", False)
yield None
def parse_cats(self, response):
pass
""" Parsing links code goes here """
蜘蛛正确启动时的输出,首先获取 .js 页面,然后开始解析猫:
Starting spider
https://website.com
https://website.com/js-file-with-long-lat.js
Inside parse
Everything is good, url id is 0
called long lat
Extracted coords
Inside parse
Everything is good, url id is 1
Calling parse cats
然后脚本继续运行并解析一切正常。 爬虫启动错误时的输出,先获取主页面再重启 start_requests():
Starting spider
https://website.com
https://website.com/js-file-with-long-lat.js
Inside parse
Loaded main page before long/lat page, restarting
Starting spider
https://website.com
https://website.com/js-file-with-long-lat.js
Inside parse
Everything is good, url id is 0
called long lat
Extracted coords
并且脚本停止执行,没有错误,就好像它已经完成了。
P.S。如果这很重要,我确实提到 start_requests()
中的处理 URL 是按相反顺序处理的,但由于循环序列,我发现这很自然,我希望 priority
参数执行它的操作工作(因为它大部分时间都是这样做的,它应该按照 Scrapy 的文档来做)。
关于为什么你的 Spider 在 "restarting" 的情况下没有继续;您可能 运行 与 重复请求 冲突 filtered/dropped。由于该页面已经被访问过,Scrapy认为它已经完成了。
因此,您必须使用 dont_filter=True
参数重新发送这些请求:
for priority, url in enumerate(self.start_urls):
print ("Processing", url)
yield Request(url=url, dont_filter=True, priority=priority, callback=self.parse)
# ^^^^^^^^^^^^^^^^ notice us forcing the Dupefilter to
# ignore duplicate requests to these pages
关于更好的解决方案而不是这种 hacky 方法,请考虑使用 InitSpider
(例如,存在其他方法)。这保证了您的 "initial" 工作已经完成并且值得信赖。
(出于某种原因,class 从未在 Scrapy docs 中记录,但它是一个相对简单的 Spider
subclass:在开始实际的 [=33] 之前做一些初始工作=].)
这是一个代码示例:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders.init import InitSpider
class QuotesSpider(InitSpider):
name = 'quotes'
allowed_domains = ['website.com']
start_urls = ['https://website.com']
# Without this method override, InitSpider behaves like Spider.
# This is used _instead of_ start_requests. (Do not override start_requests.)
def init_request(self):
# The last request that finishes the initialization needs
# to have the `self.initialized()` method as callback.
url = self.start_urls[0] + '/js-file-with-long-lat.js'
yield scrapy.Request(url, callback=self.parse_long_lat, dont_filter=True)
def parse_long_lat(self, response):
""" The callback for our init request. """
print ("called long lat")
# do some work and maybe return stuff
self.latitude = None
self.longitude = None
#yield stuff_here
# Finally, start our run.
return self.initialized()
# Now we are "initialized", will process `start_urls`
# and continue from there.
def parse(self, response):
print ("Inside parse")
print ("Everything is good, do parse_cats stuff here")
这将导致如下输出:
2019-01-10 20:36:20 [scrapy.core.engine] INFO: Spider opened
2019-01-10 20:36:20 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-10 20:36:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1/js-file-with-long-lat.js> (referer: None)
called long lat
2019-01-10 20:36:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1> (referer: http://127.0.0.1/js-file-with-long-lat.js/)
Inside parse
Everything is good, do parse_cats stuff here
2019-01-10 20:36:21 [scrapy.core.engine] INFO: Closing spider (finished)
所以我终于通过解决方法解决了这个问题:
我检查 parse()
中收到的 response.url
是什么,并基于此将进一步的解析发送到相应的方法:
def start_requests(self):
self.start_urls = [
self.home_url,
self.home_url + 'js-file-with-long-lat.js'
]
for priority, url in enumerate(self.start_urls):
yield Request(url=url, priority=priority, callback=self.parse)
def parse(self, response):
if response.url != self.home_url:
for _ in self.parse_long_lat(response):
yield _
else:
for cat in self.parse_cats(response):
yield cat