Scrapy,你能限制域级别的抓取时间吗?
Scrapy, can you limit crawl time on domain level?
我的蜘蛛卡在论坛中时遇到了问题,它可以在论坛中爬行好几天而根本找不到任何东西。有没有限制抓取某个网站(基于start_url)的时间?此问题还有其他解决方案吗?
我最后做的是使用 process_links 并创建一个检查时间的方法:
rules = (Rule(LinkExtractor(allow=()), callback='parse_obj', follow=True, process_links='check_for_semi_dupe'),)
#Method to try avoiding spider traps and endless loops
def check_for_semi_dupe(self, links):
for link in links:
domainparts = urlparse(link.url)
just_domain = domainparts[1].replace("www.", "")
url_indexed = 0
if just_domain not in self.processed_dupes:
self.processed_dupes[just_domain] = datetime.datetime.now()
else:
url_indexed = 1
timediff_in_sec = int((datetime.datetime.now() - self.processed_dupes[just_domain]).total_seconds())
if just_domain in self.blocked:
print "*** Domain '%s' was blocked! ***" % just_domain
print "*** Link was: %s" % link.url
continue
elif url_indexed == 1 and timediff_in_sec > (self.time_threshhold * 60):
self.blocked.append(just_domain)
continue
else:
yield link
该方法记录域首次出现以进行抓取的日期时间。 class 变量 "time_threshold" 定义了所需的爬网时间(以分钟为单位)。当蜘蛛被喂食 links 进行爬行时,该方法确定是否应该传递 link 进行爬行或阻止。
我的蜘蛛卡在论坛中时遇到了问题,它可以在论坛中爬行好几天而根本找不到任何东西。有没有限制抓取某个网站(基于start_url)的时间?此问题还有其他解决方案吗?
我最后做的是使用 process_links 并创建一个检查时间的方法:
rules = (Rule(LinkExtractor(allow=()), callback='parse_obj', follow=True, process_links='check_for_semi_dupe'),)
#Method to try avoiding spider traps and endless loops
def check_for_semi_dupe(self, links):
for link in links:
domainparts = urlparse(link.url)
just_domain = domainparts[1].replace("www.", "")
url_indexed = 0
if just_domain not in self.processed_dupes:
self.processed_dupes[just_domain] = datetime.datetime.now()
else:
url_indexed = 1
timediff_in_sec = int((datetime.datetime.now() - self.processed_dupes[just_domain]).total_seconds())
if just_domain in self.blocked:
print "*** Domain '%s' was blocked! ***" % just_domain
print "*** Link was: %s" % link.url
continue
elif url_indexed == 1 and timediff_in_sec > (self.time_threshhold * 60):
self.blocked.append(just_domain)
continue
else:
yield link
该方法记录域首次出现以进行抓取的日期时间。 class 变量 "time_threshold" 定义了所需的爬网时间(以分钟为单位)。当蜘蛛被喂食 links 进行爬行时,该方法确定是否应该传递 link 进行爬行或阻止。