Scrapy 多个蜘蛛

Question

我定义了两个执行以下操作的蜘蛛：

蜘蛛A：

访问主页。
从页面中提取所有链接并将它们存储在文本文件中。

这是必要的，因为主页有一个 更多结果 按钮，可以生成指向不同产品的更多链接。

蜘蛛乙：

打开文本文件。
抓取各个页面并保存信息。

我正在尝试将两者结合起来制作 crawl-spider。

主页的URL结构类似于：

http://www.example.com

各个页面的 URL 结构类似于：

http://www.example.com/Home/Detail?id=some-random-number

文本文件包含要由第二个蜘蛛抓取的 URL 列表。

我的问题：

如何将这两个蜘蛛结合起来，使一个蜘蛛完成完整的抓取？

Answer 1

来自 scrapy documantation:

In the callback function, you parse the response (web page) and return either Item objects, Request objects, or an iterable of both. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.

所以你真正需要做的是在解析方法中（你在那里提取 links，对于每个 link，产生一个新的请求，如：

yield self.make_requests_from_url(http://www.example.com/Home/Detail?id=some-random-number)

self.make_requests_from_url 已在 Spider

这样的例子：

class MySpider(Spider):

    name = "my_spider"

    def parse(self, response):
        try:
            user_name = Selector(text=response.body).xpath('//*[@id="ft"]/a/@href').extract()[0]
            yield self.make_requests_from_url("https://example.com/" + user_name)
            yield MyItem(user_name)
        except Exception as e:
            pass

您可以使用不同的解析函数处理其他请求。通过返回一个 Request 对象并明确指定回调（self.make_requests_from_url 函数默认调用 parse 函数）

Request(url=url,callback=self.parse_user_page)

Scrapy 多个蜘蛛

Scrapy multiple spiders

python

scrapy

web-scraping