Scrapy 在关注之前抓取整个网站

Question

我使用 Scrapy 爬虫无限抓取网页，我的脚本使用 DEPTH_LIMIT = 0.

我有两个主要问题：

我的抓取工具在完全抓取 start_urls 中的第一个网站之前跟踪网站。
爬虫停留在像 tumblr 或 youtube 这样的大网站上，他继续在上面爬行数十亿的页面。如何避免这种情况？我无法在 deny 变量上列出所有大型网站。

.

class MyItem(Item):
    url = Field()

class HttpbinSpider(CrawlSpider):

    name = "expired"
    start_urls = ['http://www.siteA.com']

    rules = (
        Rule(LinkExtractor(allow=('.com', '.fr', '.net', '.org', '.info', '.casino'),
                           deny=('facebook','amazon', 'wordpress', 'blogspot', 'free', 'reddit', 'videos', 'youtube', 'google', 'doubleclick', 'microsoft', 'yahoo', 'bing', 'znet', 'stackexchang', 'twitter', 'wikipedia', 'creativecommons', 'mediawiki', 'wikidata'),
                           ),
             process_request='add_errback', 
             follow=True),
    )

    custom_settings = {
        'RETRY_ENABLED': True,
        'DEPTH_LIMIT' : 0,
        'LOG_ENABLED' : True,
        'CONCURRENT_REQUESTS_PER_DOMAIN' : 32,
        'CONCURRENT_REQUESTS' : 64,
    }

    def add_errback(self, request):
        self.logger.debug("add_errback: patching %r" % request) 
        return request.replace(errback=self.errback_httpbin)

    def errback_httpbin(self, failure):
        self.logger.error(repr(failure))

        if failure.check(HttpError):
            response = failure.value.response
            self.logger.error('HttpError on %s', response.url)

        elif failure.check(DNSLookupError):
            request = failure.request
            self.logger.info('Domain expired : %s', request.url)

        elif failure.check(TimeoutError):
            request = failure.request
            self.logger.error('TimeoutError on %s', request.url)

Answer 1

来自精品手册：

Does Scrapy crawl in breadth-first or depth-first order?

By default, Scrapy uses a LIFO queue for storing pending requests, which basically means that it crawls in DFO order. This order is more convenient in most cases. If you do want to crawl in true BFO order, you can do it by setting the following settings:

DEPTH_PRIORITY = 1

SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'

SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'

Scrapy 在关注之前抓取整个网站

Scrapy crawl entire website before follow

python

scrapy

web-scraping

scrapy-spider