在 scrapy 中限制每个起始 URL 爬行的 URL 的更好方法是什么？

Question

我有一个包含大约 250 个网站 URL 的列表，我需要从中找到该网站上所有网页的所有 URL。一个问题是有些网站太大，以至于我的程序一直在无限爬行。我试图通过以下代码对此设置限制，但它不起作用：

from scrapy.linkextractors import LinkExtractor
from scrapy.exceptions import IgnoreRequest
from scrapy.exceptions import CloseSpider
from scrapy.http import Request
from scrapy import Spider

class MySpider(Spider):
    name = "spider"
    
    allowed_domains = [
        MY_250_DOMAINS_GO_HERE
    ]

    start_urls = []

    for domain in allowed_domains:
        start_urls.append('http://%s' % domain)

    output_file = open("iterable_links.txt","w+")

    LIMIT = 10
    count = 0

    def parse(self, response):

        if self.count >= self.LIMIT:
            raise IgnoreRequest()
            #raise CloseSpider(f"Scraped {self.LIMIT} items. Eject!")
        self.count += 1

        le = LinkExtractor()
        
        domain = response.url.replace("http://","").replace("https://","").split("/")[0]
        links = le.extract_links(response)
        links = [k for k in links if domain in k.url]

        output_file = open("iterable_links.txt","a+")
        
        for link in links:
            output_file.write("'" + link.url + "',\n")
            yield Request(link.url, callback=self.parse)

"""
REFERENCE:

"""

Answer 1

使用'DEPTH_LIMIT'（默认为0，即无穷大）

class MySpider(Spider):
    name = "spider"
    
    allowed_domains = [
        MY_250_DOMAINS_GO_HERE
    ]
    custom_settings = {
        'DEPTH_LIMIT' : 3, 
    }
.................
.................
.................

在 scrapy 中限制每个起始 URL 爬行的 URL 的更好方法是什么？

What's a better way to limit URLs crawled for each starting URL in scrapy?

python

web-crawler

scrapy

web-scraping

python-requests