零碎 |如何提出请求并获取所有链接
Scrapy | How to make a request and get all links
我有一个函数可以获取第一页中的所有 link。
我如何创建另一个函数来请求列表中的每个 link 并且从第二页响应中获取所有 links?
import scrapy
list = []
class SuperSpider(scrapy.Spider):
name = 'nytimes'
allowed_domains = ['nytimes.com']
start_urls = ['https://www.nytimes.com/']
def parse(self, response):
links = response.xpath('//a/@href').extract()
for link in links:
link = str(link).strip()
if link not in list:
list.append(link)
您的用例非常适合 scrapy crawl
蜘蛛。请注意,allowed_domains
设置在这种情况下非常重要,因为它定义了将被抓取的域。如果你删除它,那么你的蜘蛛会疯狂地爬行它会在每个页面上找到的所有链接。请参阅下面的示例。
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class NytimesSpider(CrawlSpider):
name = 'nytimes'
allowed_domains = ['nytimes.com']
start_urls = ['https://www.nytimes.com/']
custom_settings = {
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'
}
rules = (
Rule(LinkExtractor(allow=r''), callback='parse_item', follow=True),
)
def parse_item(self, response):
yield {
"url": response.url
}
我有一个函数可以获取第一页中的所有 link。
我如何创建另一个函数来请求列表中的每个 link 并且从第二页响应中获取所有 links?
import scrapy
list = []
class SuperSpider(scrapy.Spider):
name = 'nytimes'
allowed_domains = ['nytimes.com']
start_urls = ['https://www.nytimes.com/']
def parse(self, response):
links = response.xpath('//a/@href').extract()
for link in links:
link = str(link).strip()
if link not in list:
list.append(link)
您的用例非常适合 scrapy crawl
蜘蛛。请注意,allowed_domains
设置在这种情况下非常重要,因为它定义了将被抓取的域。如果你删除它,那么你的蜘蛛会疯狂地爬行它会在每个页面上找到的所有链接。请参阅下面的示例。
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class NytimesSpider(CrawlSpider):
name = 'nytimes'
allowed_domains = ['nytimes.com']
start_urls = ['https://www.nytimes.com/']
custom_settings = {
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'
}
rules = (
Rule(LinkExtractor(allow=r''), callback='parse_item', follow=True),
)
def parse_item(self, response):
yield {
"url": response.url
}