使用 Scrapy 保存 (.svg) 图像

Question

我正在使用 Scrapy，我想在我的计算机上本地保存网页中的一些 .svg 图像。这些图像的 url 具有结构 '__.com/svg/4/8/3/1425.svg'（并且是一个完整的工作 url，包括 https）。

我在我的 items.py 文件中定义了项目：

class ImageItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()

我在我的设置中添加了以下内容：

ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
}

IMAGES_STORE = '../Data/Silks'
MEDIA_ALLOW_REDIRECTS = True

在我调用的主要解析函数中：

imageItem = ImageItem()
imageItem['image_urls'] = [url]

yield imageItem

但它不保存图像。我按照文档进行了很多尝试，但不断收到以下错误：

StopIteration: <200 https://www.________.com/svg/4/8/3/1425.svg>

During handling of the above exception, another exception occurred:
......
......
PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x1139233b0>

我错过了什么吗？谁能帮忙？我完全难住了。

Answer 1

Gallaecio 说得对！ Scrapy 遇到了 .svg 文件类型的问题。将 imagePipeline 更改为 filePipeline 并且有效！

对于任何卡住的人，文档是 here

Answer 2

Python ImagesPipeline 使用的图像库 (PIL) 不支持矢量图像。

如果您仍想从 ImagesPipeline 功能中获益，而不是切换到更通用的 FilesPipeline，您可以按照这些思路做一些事情

from svglib.svglib import svg2rlg
from reportlab.graphics import renderPM
from io import BytesIO

class SvgCompatibleImagesPipeline(ImagesPipeline):

    def get_images(self, response, request, info, *, item=None):
        """
        Add processing of SVG images to the standard images pipeline
        """
        if isinstance(response, scrapy.http.TextResponse) and response.text.startswith('<svg'):
            b = BytesIO()
            renderPM.drawToFile(svg2rlg(BytesIO(response.body)), b, fmt='PNG')
            res = response.replace(body=b.getvalue())           
        else:
            res = response

        return super().get_images(res, request, info, item=item)

这会将响应正文中的 SVG 图像替换为 PNG 版本，可以由常规 ImagesPipeline 进一步处理。

使用 Scrapy 保存 (.svg) 图像

Saving (.svg) images using Scrapy

svg

scrapy

web-scraping

scrapy-pipeline