如何使用 scrapy 获取所有 img src？

Question

尝试在 scrapy 中做 shell

>>>scrapy shell 'https://www.trendyol.com/trendyolmilla/cok-renkli-desenli-elbise-twoss20el0573-p-36294862'
>>> response.css("div.slick-slide img").xpath("@src").getall()

输出为：

['/Content/images/defaultThumb.jpg', '/Content/images/defaultThumb.jpg', '/Content/images/defaultThumb.jpg', '/Content/images/defaultThumb.jpg', '/Content/images/defaultThumb.jpg', 'https://cdn.dsmcdn.com/mnresize/415/622/ty124/product/media/images/20210602/12/94964657/64589619/1/1_org_zoom.jpg', 'https://cdn.dsmcdn.com/mnresize/415/622/ty124/product/media/images/20210602/12/94964657/64589619/1/1_org_zoom.jpg']

只收集一张图片但在提供的情况下link有5张图片。请帮我解决这个问题。如何找到所有图片src.

Answer 1

说明

实际上，您正试图从仅包含一个 link 的 HTML 标签中获取数据。为了获取所有 link 你必须从脚本标签中获取。

这将 return json 字符串存储在文本变量中

text = response.xpath("//p/script[contains(@type,'application/ld+json')]/text()").extract_first()

加载它以转换成 python 字典

json_text = json.loads(text)

现在，通过密钥 json_text.get('image') 获取图像。

代码

在 scrapy 上执行这段代码。输出会给你所有的 5 links

from scrapy import Request


class Trendyol(scrapy.Spider):
    name = 'test'

    def start_requests(self):
        url = 'https://www.trendyol.com/trendyolmilla/cok-renkli-desenli-elbise-twoss20el0573-p-36294862'
        yield Request(url=url, callback=self.parse)

    def parse(self, response):
        text = response.xpath("//p/script[contains(@type,'application/ld+json')]/text()").extract_first()
        json_text = json.loads(text)

        print(json_text.get('image'))

如何使用 scrapy 获取所有 img src？

How can I get the all of the img src by using scrapy?

python

web-crawler

scrapy