Scrapy 在关注之前抓取整个网站
Scrapy crawl entire website before follow
我使用 Scrapy 爬虫无限抓取网页,我的脚本使用 DEPTH_LIMIT = 0
.
我有两个主要问题:
我的抓取工具在完全抓取 start_urls
中的第一个网站之前跟踪网站。
爬虫停留在像 tumblr
或 youtube
这样的大网站上,他继续在上面爬行数十亿的页面。如何避免这种情况?我无法在 deny
变量上列出所有大型网站。
.
class MyItem(Item):
url = Field()
class HttpbinSpider(CrawlSpider):
name = "expired"
start_urls = ['http://www.siteA.com']
rules = (
Rule(LinkExtractor(allow=('.com', '.fr', '.net', '.org', '.info', '.casino'),
deny=('facebook','amazon', 'wordpress', 'blogspot', 'free', 'reddit', 'videos', 'youtube', 'google', 'doubleclick', 'microsoft', 'yahoo', 'bing', 'znet', 'stackexchang', 'twitter', 'wikipedia', 'creativecommons', 'mediawiki', 'wikidata'),
),
process_request='add_errback',
follow=True),
)
custom_settings = {
'RETRY_ENABLED': True,
'DEPTH_LIMIT' : 0,
'LOG_ENABLED' : True,
'CONCURRENT_REQUESTS_PER_DOMAIN' : 32,
'CONCURRENT_REQUESTS' : 64,
}
def add_errback(self, request):
self.logger.debug("add_errback: patching %r" % request)
return request.replace(errback=self.errback_httpbin)
def errback_httpbin(self, failure):
self.logger.error(repr(failure))
if failure.check(HttpError):
response = failure.value.response
self.logger.error('HttpError on %s', response.url)
elif failure.check(DNSLookupError):
request = failure.request
self.logger.info('Domain expired : %s', request.url)
elif failure.check(TimeoutError):
request = failure.request
self.logger.error('TimeoutError on %s', request.url)
来自精品手册:
Does Scrapy crawl in breadth-first or depth-first order?
By default, Scrapy uses a LIFO queue for storing pending requests,
which basically means that it crawls in DFO order. This order is more
convenient in most cases. If you do want to crawl in true BFO order,
you can do it by setting the following settings:
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
我使用 Scrapy 爬虫无限抓取网页,我的脚本使用 DEPTH_LIMIT = 0
.
我有两个主要问题:
我的抓取工具在完全抓取
start_urls
中的第一个网站之前跟踪网站。爬虫停留在像
tumblr
或youtube
这样的大网站上,他继续在上面爬行数十亿的页面。如何避免这种情况?我无法在deny
变量上列出所有大型网站。
.
class MyItem(Item):
url = Field()
class HttpbinSpider(CrawlSpider):
name = "expired"
start_urls = ['http://www.siteA.com']
rules = (
Rule(LinkExtractor(allow=('.com', '.fr', '.net', '.org', '.info', '.casino'),
deny=('facebook','amazon', 'wordpress', 'blogspot', 'free', 'reddit', 'videos', 'youtube', 'google', 'doubleclick', 'microsoft', 'yahoo', 'bing', 'znet', 'stackexchang', 'twitter', 'wikipedia', 'creativecommons', 'mediawiki', 'wikidata'),
),
process_request='add_errback',
follow=True),
)
custom_settings = {
'RETRY_ENABLED': True,
'DEPTH_LIMIT' : 0,
'LOG_ENABLED' : True,
'CONCURRENT_REQUESTS_PER_DOMAIN' : 32,
'CONCURRENT_REQUESTS' : 64,
}
def add_errback(self, request):
self.logger.debug("add_errback: patching %r" % request)
return request.replace(errback=self.errback_httpbin)
def errback_httpbin(self, failure):
self.logger.error(repr(failure))
if failure.check(HttpError):
response = failure.value.response
self.logger.error('HttpError on %s', response.url)
elif failure.check(DNSLookupError):
request = failure.request
self.logger.info('Domain expired : %s', request.url)
elif failure.check(TimeoutError):
request = failure.request
self.logger.error('TimeoutError on %s', request.url)
来自精品手册:
Does Scrapy crawl in breadth-first or depth-first order?
By default, Scrapy uses a LIFO queue for storing pending requests, which basically means that it crawls in DFO order. This order is more convenient in most cases. If you do want to crawl in true BFO order, you can do it by setting the following settings:
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'