Scrapy - 使用第一个 URL 的结果抓取多个 URLs

Scrapy - Scrape multiple URLs using results from the first URL

  1. 我使用 Scrapy 从第一个 URL.
  2. 抓取数据
  3. 第一个 URL return 个响应包含 URL 个列表。

到目前为止对我来说还可以。我的问题是如何进一步抓取 URL 的列表?搜索后,我知道我可以在解析中 return 一个请求,但似乎只能处理一个 URL.

这是我的解析:

def parse(self, response):
    # Get the list of URLs, for example:
    list = ["http://a.com", "http://b.com", "http://c.com"]
    return scrapy.Request(list[0])
    # It works, but how can I continue b.com and c.com?

我可以这样做吗?

def parse(self, response):
    # Get the list of URLs, for example:
    list = ["http://a.com", "http://b.com", "http://c.com"]

    for link in list:
        scrapy.Request(link)
        # This is wrong, though I need something like this

完整版:

import scrapy

class MySpider(scrapy.Spider):
    name = "mySpider"
    allowed_domains = ["x.com"]
    start_urls = ["http://x.com"]

    def parse(self, response):
        # Get the list of URLs, for example:
        list = ["http://a.com", "http://b.com", "http://c.com"]

        for link in list:
            scrapy.Request(link)
            # This is wrong, though I need something like this

为此,您需要继承 scrapy.spider 并定义一个 URL 列表作为开始。然后,Scrapy 会自动跟随它找到的链接。

就像这样:

import scrapy

class YourSpider(scrapy.Spider):
    name = "your_spider"
    allowed_domains = ["a.com", "b.com", "c.com"]
    start_urls = [
        "http://a.com/",
        "http://b.com/",
        "http://c.com/",
    ]

    def parse(self, response):
        # do whatever you want
        pass

您可以在 Scrapy 的 official documentation 上找到更多信息。

我想你要找的是 yield 语句:

def parse(self, response):
    # Get the list of URLs, for example:
    list = ["http://a.com", "http://b.com", "http://c.com"]

    for link in list:
        request = scrapy.Request(link)
        yield request
# within your parse method:

urlList = response.xpath('//a/@href').extract()  
print(urlList) #to see the list of URLs 
for url in urlList:
    yield scrapy.Request(url, callback=self.parse)

这应该有效