Scrapy 根据起始网址列表保留所有唯一页面
Scrapy keep all unique pages based on a list of start urls
我想给 Scrapy 一个起始 url 列表,让它访问每个起始页面上的每个 link。对于每个link,如果之前没有到过那个页面,我想下载该页面并保存在本地。我怎样才能做到这一点?
设置默认 parse
回调以分离所有链接。 By default Scrapy does not visit the same page twice.
def parse(self, response):
links = LinkExtractor().extract_links(response)
return (Request(url=link.url, callback=self.parse_page) for link in links)
def parse_page(self, response):
# name = manipulate response.url to be a unique file name
with open(name, 'wb') as f:
f.write(response.body)
我想给 Scrapy 一个起始 url 列表,让它访问每个起始页面上的每个 link。对于每个link,如果之前没有到过那个页面,我想下载该页面并保存在本地。我怎样才能做到这一点?
设置默认 parse
回调以分离所有链接。 By default Scrapy does not visit the same page twice.
def parse(self, response):
links = LinkExtractor().extract_links(response)
return (Request(url=link.url, callback=self.parse_page) for link in links)
def parse_page(self, response):
# name = manipulate response.url to be a unique file name
with open(name, 'wb') as f:
f.write(response.body)