Scrapy：蜘蛛优化

Question

我正在尝试删除一个电子商务网站，我分两步完成。

本网站的结构如下：

主页有指向家庭项目和子家庭项目页面的链接
每个系列和子系列页面都有一个分页的产品列表

现在我有 2 个蜘蛛：

GeneralSpider获取主页链接并存储它们
ItemSpider 从每个页面获取元素

我是 Scrapy 的新手，我正在学习一些教程来实现这一点。我想知道解析函数有多复杂以及规则是如何工作的。我的蜘蛛现在看起来像：

蜘蛛将军：

class GeneralSpider(CrawlSpider):

    name = 'domain'
    allowed_domains = ['domain.org']
    start_urls = ['http://www.domain.org/home']

    def parse(self, response):
        links = LinksItem()
        links['content'] = response.xpath("//div[@id='h45F23']").extract()
        return links

物品蜘蛛：

class GeneralSpider(CrawlSpider):

    name = 'domain'
    allowed_domains = ['domain.org']
    f = open("urls.txt")
    start_urls = [url.strip() for url in f.readlines()]
    # Each URL in the file has pagination if it has more than 30 elements
    # I don't know how to paginate over each URL
    f.close()

    def parse(self, response):
        item = ShopItem()
        item['name'] = response.xpath("//h1[@id='u_name']").extract()
        item['description'] = response.xpath("//h3[@id='desc_item']").extract()
        item['prize'] = response.xpath("//div[@id='price_eur']").extract()
        return item

让蜘蛛跟随 url 分页的最佳方法是什么？
如果分页是JQuery，意思是GET变量里面没有URL, 可以按照分页吗?
我可以在同一个蜘蛛中使用不同的“规则”来抓取页面的不同部分吗？或者让蜘蛛 专业化 更好，每个蜘蛛专注于一件事？

我也在谷歌上搜索过任何与 Scrapy 相关的书籍，但似乎还没有完成的书，或者至少我找不到一本。

有谁知道 Scrapy 书是否即将出版？

编辑：

这 2 URL 适合这个例子。在 Eroski 主页 页面中，您可以获得 URL 的产品页面。

在产品页面中，您有一个分页的项目列表（Eroski 项目）：

URL获取链接：Eroski Home
URL 获取物品：Eroski Fruits

在Eroski Fruits页面中，item的分页好像是JQuery/AJAX，因为向下滚动会显示更多的item，有没有办法使用 Scrapy 获取所有这些项目 ?

Answer 1

Which is the best way to make the spider follow the pagination of an url ?

这是非常特定于站点的，取决于分页的实现方式。

If the pagination is JQuery, meaning there is no GET variable in the URL, Would be possible to follow the pagination ?

这正是您的用例 - 分页是通过额外的 AJAX 调用进行的，您可以在 Scrapy 蜘蛛中模拟这些调用。

Can I have different "rules" in the same spider to scrape different parts of the page ? or is better to have the spiders specialized, each spider focused in one thing?

是的，CrawlSpider 提供的 "rules" 机制是一项非常强大的技术 - 它是高度可配置的 - 您可以有多个规则，其中一些规则会遵循匹配的特定链接特定条件，或位于页面的特定部分。与拥有多个蜘蛛相比，拥有一个具有多个规则的蜘蛛应该更受欢迎。

关于您的具体用例，想法如下：

创建一个 rule 以跟踪主页导航菜单中的类别和子类别 - 此处 restrict_xpaths 会有所帮助
在回调中，对于每个类别或子类别 yield 一个 Request 将模仿您打开类别页面时浏览器发送的 AJAX 请求
在 AJAX 响应处理程序（回调）中解析可用项目和 yield 另一个 Request 相同 category/subcategory 但增加 page GET参数（获取下一页）

示例工作实现：

import re
import urllib

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor


class ProductItem(scrapy.Item):
    description = scrapy.Field()
    price = scrapy.Field()


class GrupoeroskiSpider(CrawlSpider):
    name = 'grupoeroski'
    allowed_domains = ['compraonline.grupoeroski.com']
    start_urls = ['http://www.compraonline.grupoeroski.com/supermercado/home.jsp']

    rules = [
        Rule(LinkExtractor(restrict_xpaths='//div[@class="navmenu"]'), callback='parse_categories')
    ]

    def parse_categories(self, response):
        pattern = re.compile(r'/(\d+)\-\w+')
        groups = pattern.findall(response.url)
        params = {'page': 1, 'categoria': groups.pop(0)}

        if groups:
            params['grupo'] = groups.pop(0)
        if groups:
            params['familia'] = groups.pop(0)

        url = 'http://www.compraonline.grupoeroski.com/supermercado/ajax/listProducts.jsp?' + urllib.urlencode(params)
        yield scrapy.Request(url,
                             meta={'params': params},
                             callback=self.parse_products,
                             headers={'X-Requested-With': 'XMLHttpRequest'})

    def parse_products(self, response):
        for product in response.xpath('//div[@class="product_element"]'):
            item = ProductItem()
            item['description'] = product.xpath('.//span[@class="description_1"]/text()').extract()[0]
            item['price'] = product.xpath('.//div[@class="precio_line"]/p/text()').extract()[0]
            yield item

        params = response.meta['params']
        params['page'] += 1

        url = 'http://www.compraonline.grupoeroski.com/supermercado/ajax/listProducts.jsp?' + urllib.urlencode(params)
        yield scrapy.Request(url,
                             meta={'params': params},
                             callback=self.parse_products,
                             headers={'X-Requested-With': 'XMLHttpRequest'})

希望这对您来说是一个好的起点。

Does anyone know if some Scrapy book that will be released soon?

没有什么具体的我想起来了。

^{虽然我听说一些出版商有一些计划可能会出版一本关于网络抓取的书，但我不应该告诉你。}

Scrapy：蜘蛛优化

Scrapy: Spider optimization

python

scrapy

web-scraping

scrapy-spider

蜘蛛将军：

物品蜘蛛：

编辑：