Scrapy，你能限制域级别的抓取时间吗？

Question

我的蜘蛛卡在论坛中时遇到了问题，它可以在论坛中爬行好几天而根本找不到任何东西。有没有限制抓取某个网站（基于start_url）的时间？此问题还有其他解决方案吗？

Answer 1

我最后做的是使用 process_links 并创建一个检查时间的方法：

 rules = (Rule(LinkExtractor(allow=()), callback='parse_obj', follow=True, process_links='check_for_semi_dupe'),)


#Method to try avoiding spider traps and endless loops
def check_for_semi_dupe(self, links):
  for link in links:
    domainparts = urlparse(link.url)
    just_domain = domainparts[1].replace("www.", "")
    url_indexed = 0
    if just_domain not in self.processed_dupes:
      self.processed_dupes[just_domain] = datetime.datetime.now()
    else:
      url_indexed = 1
      timediff_in_sec = int((datetime.datetime.now() - self.processed_dupes[just_domain]).total_seconds())
    if just_domain in self.blocked:
      print "*** Domain '%s' was blocked! ***" % just_domain
      print "*** Link was: %s" % link.url
      continue
    elif url_indexed == 1 and timediff_in_sec > (self.time_threshhold * 60):
      self.blocked.append(just_domain)
      continue
    else:
      yield link

该方法记录域首次出现以进行抓取的日期时间。 class 变量 "time_threshold" 定义了所需的爬网时间（以分钟为单位）。当蜘蛛被喂食 links 进行爬行时，该方法确定是否应该传递 link 进行爬行或阻止。

Scrapy，你能限制域级别的抓取时间吗？

Scrapy, can you limit crawl time on domain level?

python

web-crawler

scrapy