在 scrapy 中限制每个起始 URL 爬行的 URL 的更好方法是什么?
What's a better way to limit URLs crawled for each starting URL in scrapy?
我有一个包含大约 250 个网站 URL 的列表,我需要从中找到该网站上所有网页的所有 URL。一个问题是有些网站太大,以至于我的程序一直在无限爬行。我试图通过以下代码对此设置限制,但它不起作用:
from scrapy.linkextractors import LinkExtractor
from scrapy.exceptions import IgnoreRequest
from scrapy.exceptions import CloseSpider
from scrapy.http import Request
from scrapy import Spider
class MySpider(Spider):
name = "spider"
allowed_domains = [
MY_250_DOMAINS_GO_HERE
]
start_urls = []
for domain in allowed_domains:
start_urls.append('http://%s' % domain)
output_file = open("iterable_links.txt","w+")
LIMIT = 10
count = 0
def parse(self, response):
if self.count >= self.LIMIT:
raise IgnoreRequest()
#raise CloseSpider(f"Scraped {self.LIMIT} items. Eject!")
self.count += 1
le = LinkExtractor()
domain = response.url.replace("http://","").replace("https://","").split("/")[0]
links = le.extract_links(response)
links = [k for k in links if domain in k.url]
output_file = open("iterable_links.txt","a+")
for link in links:
output_file.write("'" + link.url + "',\n")
yield Request(link.url, callback=self.parse)
"""
REFERENCE:
"""
使用'DEPTH_LIMIT'(默认为0,即无穷大)
class MySpider(Spider):
name = "spider"
allowed_domains = [
MY_250_DOMAINS_GO_HERE
]
custom_settings = {
'DEPTH_LIMIT' : 3,
}
.................
.................
.................
我有一个包含大约 250 个网站 URL 的列表,我需要从中找到该网站上所有网页的所有 URL。一个问题是有些网站太大,以至于我的程序一直在无限爬行。我试图通过以下代码对此设置限制,但它不起作用:
from scrapy.linkextractors import LinkExtractor
from scrapy.exceptions import IgnoreRequest
from scrapy.exceptions import CloseSpider
from scrapy.http import Request
from scrapy import Spider
class MySpider(Spider):
name = "spider"
allowed_domains = [
MY_250_DOMAINS_GO_HERE
]
start_urls = []
for domain in allowed_domains:
start_urls.append('http://%s' % domain)
output_file = open("iterable_links.txt","w+")
LIMIT = 10
count = 0
def parse(self, response):
if self.count >= self.LIMIT:
raise IgnoreRequest()
#raise CloseSpider(f"Scraped {self.LIMIT} items. Eject!")
self.count += 1
le = LinkExtractor()
domain = response.url.replace("http://","").replace("https://","").split("/")[0]
links = le.extract_links(response)
links = [k for k in links if domain in k.url]
output_file = open("iterable_links.txt","a+")
for link in links:
output_file.write("'" + link.url + "',\n")
yield Request(link.url, callback=self.parse)
"""
REFERENCE:
"""
使用'DEPTH_LIMIT'(默认为0,即无穷大)
class MySpider(Spider):
name = "spider"
allowed_domains = [
MY_250_DOMAINS_GO_HERE
]
custom_settings = {
'DEPTH_LIMIT' : 3,
}
.................
.................
.................