Scrapy 根据起始网址列表保留所有唯一页面

Scrapy keep all unique pages based on a list of start urls

我想给 Scrapy 一个起始 url 列表,让它访问每个起始页面上的每个 link。对于每个link,如果之前没有到过那个页面,我想下载该页面并保存在本地。我怎样才能做到这一点?

设置默认 parse 回调以分离所有链接。 By default Scrapy does not visit the same page twice.

def parse(self, response):
    links = LinkExtractor().extract_links(response)
    return (Request(url=link.url, callback=self.parse_page) for link in links)

def parse_page(self, response):
    # name = manipulate response.url to be a unique file name
    with open(name, 'wb') as f:
        f.write(response.body)