在 Scrapy 蜘蛛中添加暂停
Adding pause in Scrapy spider
您好,我想创建一个每天抓取一个网站的蜘蛛程序。我有一个蜘蛛可以抓取我需要的所有东西,但我需要在抓取每篇文章后实施暂停。我也尝试了 threading
模块和 time
模块,但是使用它们似乎不起作用,因为我得到了这个响应(仅来自一些请求):
DEBUG: Retrying <GET https://www.example.com/.../> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
.
我的代码如下所示
class AutomatedSpider(scrapy.Spider):
name = 'automated'
allowed_domains = ['example-domain.com']
start_urls = [
'https://example.com/page/1/...'
]
pause = threading.Event()
article_num = 1
def parse(self, response):
for page_num in range(1, 26):
for href in set(response.css(".h-100 a::attr(href)").extract()):
# extract data from all the articles on current page
self.pause.wait(5.0) # this causes the response mentioned above
yield scrapy.Request(href, callback=self.parse_article)
self.article_num += 1
# move to next page
next_page = 'https://www.information-age.com/page/'+str(page_num)+'/...'
yield scrapy.Request(next_page, callback=self.parse)
def parse_article(self, response):
# function to extract desired data from website that is being scraped
我不认为 time.sleep 和等待线程可以在 Scrapy 中很好地工作,因为它是异步的工作方式。
您可以执行以下操作:
- 您可以在 settings.py 中输入 DOWNLOAD_DELAY=5 以在请求之间延迟 2.5 到 7.5 秒
- 使用 RANDOMIZE_DOWNLOAD_DELAY=False 时,它会在两者之间恰好等待 5 秒。
- 设置CONCURRENT_REQUESTS=1 将确保不会同时有多个请求运行
您好,我想创建一个每天抓取一个网站的蜘蛛程序。我有一个蜘蛛可以抓取我需要的所有东西,但我需要在抓取每篇文章后实施暂停。我也尝试了 threading
模块和 time
模块,但是使用它们似乎不起作用,因为我得到了这个响应(仅来自一些请求):
DEBUG: Retrying <GET https://www.example.com/.../> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
.
我的代码如下所示
class AutomatedSpider(scrapy.Spider):
name = 'automated'
allowed_domains = ['example-domain.com']
start_urls = [
'https://example.com/page/1/...'
]
pause = threading.Event()
article_num = 1
def parse(self, response):
for page_num in range(1, 26):
for href in set(response.css(".h-100 a::attr(href)").extract()):
# extract data from all the articles on current page
self.pause.wait(5.0) # this causes the response mentioned above
yield scrapy.Request(href, callback=self.parse_article)
self.article_num += 1
# move to next page
next_page = 'https://www.information-age.com/page/'+str(page_num)+'/...'
yield scrapy.Request(next_page, callback=self.parse)
def parse_article(self, response):
# function to extract desired data from website that is being scraped
我不认为 time.sleep 和等待线程可以在 Scrapy 中很好地工作,因为它是异步的工作方式。 您可以执行以下操作:
- 您可以在 settings.py 中输入 DOWNLOAD_DELAY=5 以在请求之间延迟 2.5 到 7.5 秒
- 使用 RANDOMIZE_DOWNLOAD_DELAY=False 时,它会在两者之间恰好等待 5 秒。
- 设置CONCURRENT_REQUESTS=1 将确保不会同时有多个请求运行