爬取具有类别的网页

Question

我正在尝试删除一个网页结构不常见的网站，一页又一页，直到找到要从中提取数据的项目，

编辑（多亏了答案，我已经能够提取我需要的大部分数据，但是我需要路径链接才能到达所述产品）

这是我目前的代码：

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):

    name = 'drapertools.com'
    start_urls = ['https://www.drapertools.com/category/0/Product%20Range']

    rules = (
        Rule(LinkExtractor(allow=['/category-?.*?/'])),
        Rule(LinkExtractor(allow=['/product/']), callback='parse_product'),
    )

    def parse_product(self, response):

        yield {
            'product_name': response.xpath('//div[@id="product-title"]//h1[@class="text-primary"]/text()').extract_first(),
            'product_number': response.xpath('//div[@id="product-title"]//h1[@style="margin-bottom: 20px; color:#000000; font-size: 23px;"]/text()').extract_first(),
            'product_price': response.xpath('//div[@id="product-title"]//p/text()').extract_first(),
            'product_desc': response.xpath('//div[@class="col-md-6 col-sm-6 col-xs-12 pull-left"]//div[@class="col-md-11 col-sm-11 col-xs-11"]//p/text()').extract_first(),
            'product_path': response.xpath('//div[@class="nav-container"]//ol[@class="breadcrumb"]//li//a/text()').extract(),
            'product_path_links': response.xpath('//div[@class="nav-container"]//ol[@class="breadcrumb"]//li//a/href()').extract(),
        }

我不知道这是否有效，有人可以帮我吗？我将不胜感激。

更多信息：我正在尝试访问所有类别和其中的所有项目但是其中有一个类别，甚至在我到达该项目之前还有更多。

我正在考虑使用 Guillaume 的 LinkExtractor 代码，但我不确定它是否应该用于我想要的结果...

rules = (
        Rule(LinkExtractor(allow=['/category-?.*?/'])),
        Rule(LinkExtractor(allow=['/product/']), callback='parse_product'),
    )

Answer 1

您所有页面的结构都相同，也许您可以缩短它？

import scrapy

class DraperToolsSpider(scrapy.Spider):
    name = 'drapertools_spider'
    start_urls = ["https://www.drapertools.com/category/0/Product%20Range"]


    def parse(self, response):
        # this will call self.parse by default for all your categories
        for url in response.css('.category p a::attr(href)').extract():
            yield scrapy.Request(response.urljoin(url))  

        # here you can add some "if" if you want to catch details only on certain pages
        for req in self.parse_details(response):
            yield req

    def parse_details(self, response):
        yield {}

Answer 2

为什么不使用 CrawlSpider 呢！它非常适合这个用例！

它基本上是递归地获取每个页面的所有链接，并且只为感兴趣的链接调用回调（我假设您对产品感兴趣）。

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):

    name = 'drapertools.com'
    start_urls = ['https://www.drapertools.com/category/0/Product%20Range']

    rules = (
        Rule(LinkExtractor(allow=['/category-?.*?/'])),
        Rule(LinkExtractor(allow=['/product/']), callback='parse_product'),
    )

    def parse_product(self, response):

        yield {
            'product_name': response.xpath('//div[@id="product-title"]//h1[@class="text-primary"]/text()').extract_first(),
        }

爬取具有类别的网页

Crawling through Web-pages that have categories

python

scrapy

scrapy-spider