Scraper:在迭代之间添加延迟
Scraper: Adding Delay between Iterations
对于抓取项目,我想在脚本的每次迭代之间添加 1 秒的延迟。在其他线程中,我已经读到可以通过 "time" 功能包含延迟。
然而,下面的代码尽管包含 "time" 仍然每秒处理多个请求,这对于抓取器来说太快了。有谁知道如何使1秒延迟正常工作?
import scrapy
import time
custom_settings = {
'ROBOTSTXT_OBEY': False,
'USER_AGENT': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"
}
class QuotesSpider(scrapy.Spider):
name = 'spider1'
allowed_domains = ['google.com']
start_urls = ['https://www.google.com/search?q=Activision&biw=1280&bih=607&source=lnt&tbs=cdr%3A1%2Ccd_min%3A01%2F01%2F2004%2Ccd_max%3A12%2F31%2F2004&tbm=nws',
'https://www.google.com/search?q=Activision&biw=1280&bih=607&source=lnt&tbs=cdr%3A1%2Ccd_min%3A01%2F01%2F2005%2Ccd_max%3A12%2F31%2F2005&tbm=nws',
'https://www.google.com/search?q=Activision&biw=1280&bih=607&source=lnt&tbs=cdr%3A1%2Ccd_min%3A01%2F01%2F2006%2Ccd_max%3A12%2F31%2F2006&tbm=nws',
]
def parse(self, response):
item = {
'results': response.css('#resultStats::text')[0].extract(),
'url': response.url,
}
yield item
time.sleep(1)
有一个特殊的设置,称为下载延迟
您可以在 scrapy 文档中阅读更多内容:https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
对于抓取项目,我想在脚本的每次迭代之间添加 1 秒的延迟。在其他线程中,我已经读到可以通过 "time" 功能包含延迟。
然而,下面的代码尽管包含 "time" 仍然每秒处理多个请求,这对于抓取器来说太快了。有谁知道如何使1秒延迟正常工作?
import scrapy
import time
custom_settings = {
'ROBOTSTXT_OBEY': False,
'USER_AGENT': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"
}
class QuotesSpider(scrapy.Spider):
name = 'spider1'
allowed_domains = ['google.com']
start_urls = ['https://www.google.com/search?q=Activision&biw=1280&bih=607&source=lnt&tbs=cdr%3A1%2Ccd_min%3A01%2F01%2F2004%2Ccd_max%3A12%2F31%2F2004&tbm=nws',
'https://www.google.com/search?q=Activision&biw=1280&bih=607&source=lnt&tbs=cdr%3A1%2Ccd_min%3A01%2F01%2F2005%2Ccd_max%3A12%2F31%2F2005&tbm=nws',
'https://www.google.com/search?q=Activision&biw=1280&bih=607&source=lnt&tbs=cdr%3A1%2Ccd_min%3A01%2F01%2F2006%2Ccd_max%3A12%2F31%2F2006&tbm=nws',
]
def parse(self, response):
item = {
'results': response.css('#resultStats::text')[0].extract(),
'url': response.url,
}
yield item
time.sleep(1)
有一个特殊的设置,称为下载延迟
您可以在 scrapy 文档中阅读更多内容:https://doc.scrapy.org/en/latest/topics/settings.html#download-delay