Scrapy 中的自定义文件管道从不下载文件，即使日志应该访问所有功能

Question

我有以下用于下载 JSON 文件的自定义管道。它运行良好，直到我需要添加 __init__ 函数，在该函数中我将 FilesPipeline class 子class 添加一些新属性。管道获取指向 API 个端点的 URL 并下载它们的响应。当运行蜘蛛通过 scrapy crawl myspider 和 file_path 函数中的两个打印语句显示正确的值（文件名和文件路径）时，文件夹被正确创建。但是，这些文件从未真正下载过。

我确实发现了一些关于自定义文件管道和文件未下载的类似问题（例如 and ），但这些解决方案对我不起作用。

我做错了什么（或者在子classFilesPipeline时忘记做什么）？我已经为这个问题绞尽脑汁 3 个小时了，我的 google-fu 没有为我的案子提供任何解决方案。

    class LocalJsonFilesPipeline(FilesPipeline):
        FILES_STORE = "json_src"
        FILES_URLS_FIELD = "json_url"
        FILES_RESULT_FIELD = "local_json"
    
        def __init__(self, store_uri, use_response_url=False, filename_regex=None, settings=None):
            # super(LocalJsonFilesPipeline, self).__init__(store_uri)
            self.store_uri = store_uri
            self.use_response_url = use_response_url
            if filename_regex:
                self.filename_regex = re.compile(filename_regex)
            else:
                self.filename_regex = filename_regex
            super(LocalJsonFilesPipeline, self).__init__(store_uri, settings=settings)
    
        @classmethod
        def from_crawler(cls, crawler):
            if not crawler.spider:
                return BasePipeline()
            store_uri = f'{cls.FILES_STORE}/{crawler.spider.name}'
            settings = crawler.spider.settings
            use_response_url = settings.get('JSON_FILENAME_USE_RESPONSE_URL', False)
            filename_regex = settings.get('JSON_FILENAME_REGEX')
            return cls(store_uri, use_response_url, filename_regex, settings)
    
        def parse_path(self, value):
            if self.filename_regex:
                try:
                    return self.filename_regex.findall(value)[0]
                except IndexError:
                    pass
            # fallback method in the event no regex is provided by the spider
            # example: /p/russet-potatoes-5lb-bag-good-38-gather-8482/-/A-77775602
            link_path = os.path.splitext(urlparse(value).path)[0]  # omit extension if there is one
            link_params = link_path.rsplit('/', 1)[1]  # preserve the last portion separated by forward-slash (A-77775602)
            return link_params if '=' not in link_params else link_params.split('=', 1)[1]
    
        def get_media_requests(self, item, info):
            json_url = item.get(self.FILES_URLS_FIELD)
            if json_url:
                filename_url = json_url if not self.use_response_url else item.get('url', '')
                return [Request(json_url, meta={'filename': self.parse_path(filename_url), 'spider': info.spider.name})]
    
        def file_path(self, request, response=None, info=None):
            final_path = f'{self.FILES_STORE}/{request.meta["spider"]}/{request.meta["filename"]}.json'
            print('url', request.url)
            print('downloading to', final_path)
            return final_path

还有我的蜘蛛的自定义设置

    class MockSpider(scrapy.Spider):
        name = 'mock'
        custom_settings = {
            'ITEM_PIPELINES': {
                'mock.pipelines.LocalJsonFilesPipeline': 200
            },
            'JSON_FILENAME_REGEX': r'products\/(.+?)\/ProductInfo\+ProductDetails'
        }

日志级别设置为调试

C:\Users\Mike\Desktop\scrapy_test\pipeline_test>scrapy crawl testsite
2020-07-19 11:23:08 [scrapy.utils.log] INFO: Scrapy 2.2.1 started (bot: pipeline
_test)
2020-07-19 11:23:08 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9
.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.7.6 (
tags/v3.7.6:43364a7ae0, Dec 19 2019, 00:42:30) [MSC v.1916 64 bit (AMD64)], pyOp
enSSL 19.0.0 (OpenSSL 1.1.0i  14 Aug 2018), cryptography 2.3.1, Platform Windows
-7-6.1.7601-SP1
2020-07-19 11:23:08 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.se
lectreactor.SelectReactor
2020-07-19 11:23:08 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'pipeline_test',
 'LOG_STDOUT': True,
 'NEWSPIDER_MODULE': 'pipeline_test.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['pipeline_test.spiders']}
2020-07-19 11:23:08 [scrapy.extensions.telnet] INFO: Telnet Password: 0454b083df
d2028a
2020-07-19 11:23:08 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2020-07-19 11:23:08 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-07-19 11:23:08 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-07-19 11:23:08 [scrapy.middleware] INFO: Enabled item pipelines:
['pipeline_test.pipelines.LocalJsonFilesPipeline']
2020-07-19 11:23:08 [scrapy.core.engine] INFO: Spider opened
2020-07-19 11:23:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pag
es/min), scraped 0 items (at 0 items/min)
2020-07-19 11:23:08 [scrapy.extensions.telnet] INFO: Telnet console listening on
 127.0.0.1:6023
2020-07-19 11:23:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.[testsite].com/robots.txt> (referer: None)
2020-07-19 11:23:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://[testsite]/vpd/v1/products/prod6149174-product/ProductInfo+ProductDetails> (re
ferer: None)
2020-07-19 11:23:08 [stdout] INFO: url
2020-07-19 11:23:08 [stdout] INFO: https://[testsite]/vpd/v1/products/pro
d6149174-product/ProductInfo+ProductDetails
2020-07-19 11:23:08 [stdout] INFO: downloading to
2020-07-19 11:23:08 [stdout] INFO: json_src/[testsite]/prod6149174-product.json
2020-07-19 11:23:09 [scrapy.core.scraper] DEBUG: Scraped from <200 https://[testsite]/vpd/v1/products/prod6149174-product/ProductInfo+ProductDetails>
{'json_url': 'https://[testsite].com/vpd/v1/products/prod6149174-product/Prod
uctInfo+ProductDetails',
 'local_json': [],
 'url': 'https://[testsite].com/store/c/nature-made-super-b-complex,-tablets/
ID=prod6149174-product'}
2020-07-19 11:23:09 [scrapy.core.engine] INFO: Closing spider (finished)
2020-07-19 11:23:09 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 506,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 5515,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 0.468001,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 7, 19, 15, 23, 9, 96399),
 'item_scraped_count': 1,
 'log_count/DEBUG': 3,
 'log_count/INFO': 14,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2020, 7, 19, 15, 23, 8, 628398)}
2020-07-19 11:23:09 [scrapy.core.engine] INFO: Spider closed (finished)

Answer 1

我终于弄明白了，原来 FilesPipeline class 没有 from_crawler 方法，而是需要 from_settings 方法想要将添加的参数传递给 subclassed/custom FilesPipeline。下面是我的自定义工作版本 FilesPipeline

    from scrapy import Request
    from scrapy.pipelines.files import FilesPipeline
    from urllib.parse import urlparse
    import os
    import re
    
    class LocalFilesPipeline(FilesPipeline):
        FILES_STORE = "data_src"
        FILES_URLS_FIELD = "data_url"
        FILES_RESULT_FIELD = "local_file"
    
        def __init__(self, settings=None):
            """
            Attributes:
                use_response_url    indicates we want to grab the filename from the response url instead of json_url
                filename_regex      regexes to use for grabbing filenames out of urls
                filename_suffixes   suffixes to append to filenames when there are multiple files to download per item
                filename_extension  the file extension to append to each filename in the file_path function
            """
            self.use_response_url = settings.get('FILENAME_USE_RESPONSE_URL', False)
            self.filename_regex = settings.get('FILENAME_REGEX', [])
            self.filename_suffixes = settings.get('FILENAME_SUFFIXES', [])
            self.filename_extension = settings.get('FILENAME_EXTENSION', 'json')
    
            if isinstance(self.filename_regex, str):
                self.filename_regex = [self.filename_regex]
            if isinstance(self.filename_suffixes, str):
                self.filename_suffixes = [self.filename_suffixes]
            if self.filename_regex and self.filename_suffixes and len(self.filename_regex) != len(self.filename_suffixes):
                raise ValueError('FILENAME_REGEX and FILENAME_SUFFIXES settings must contain the same number of elements')
    
            if self.filename_regex:
                for i, f_regex in enumerate(self.filename_regex):
                    self.filename_regex[i] = re.compile(f_regex)
            super(LocalFilesPipeline, self).__init__(self.FILES_STORE, settings=settings)
    
        @classmethod
        def from_settings(cls, settings):
            return cls(settings=settings)
    
        def parse_path(self, value, index):
            if self.filename_regex:
                try:
                    return self.filename_regex[index-1].findall(value)[0]
                except IndexError:
                    pass
    
            # fallback method in the event no regex is provided by the spider
            link_path = os.path.splitext(urlparse(value).path)[0]
            # preserve the last portion separated by forward-slash
            try:
                return link_path.rsplit('/', 1)[1]
            except IndexError:
                return link_path
    
        def get_media_requests(self, item, info):
            file_urls = item.get(self.FILES_URLS_FIELD)
            requests = []
            if file_urls:
                total_urls = len(file_urls)
                for i, file_url in enumerate(file_urls, 1):
                    filename_url = file_url if not self.use_response_url else item.get('url', '')
                    filename = self.parse_path(filename_url, i)
                    if self.filename_suffixes:
                        current_suffix = self.filename_suffixes[i-1]
                        if current_suffix.startswith('/'):
                            # this will end up creating a separate folder for the different types of files
                            filename += current_suffix
                        else:
                            # this will keep all files in single folder while still making it easy to differentiate each
                            # type of file. this comes in handy when searching for a file by the base name.
                            filename += f'_{current_suffix}'
                    elif total_urls > 1:
                        # default to numbering files sequentially in the order they were added to the item
                        filename += f'_file{i}'
                    requests.append(Request(file_url, meta={'spider': info.spider.name, 'filename': filename}))
            return requests
    
        def file_path(self, request, response=None, info=None):
            return f'{request.meta["spider"]}/{request.meta["filename"]}.{self.filename_extension}'

然后，要利用管道，您可以在蜘蛛的 custom_settings 属性

中设置适用的值

    custom_settings = {
        'ITEM_PIPELINES': {
            'spins.pipelines.LocalFilesPipeline': 200
        },
        'FILENAME_REGEX': [r'products\/(.+?)\/ProductInfo\+ProductDetails']
    }

Scrapy 中的自定义文件管道从不下载文件，即使日志应该访问所有功能

Custom Files Pipeline in Scrapy never downloads Files even though logs should all functions being accessed

python

scrapy

python-3.x

scrapy-pipeline