Scrapy 中的自定义文件管道从不下载文件,即使日志应该访问所有功能
Custom Files Pipeline in Scrapy never downloads Files even though logs should all functions being accessed
我有以下用于下载 JSON 文件的自定义管道。它运行良好,直到我需要添加 __init__
函数,在该函数中我将 FilesPipeline
class 子class 添加一些新属性。管道获取指向 API 个端点的 URL 并下载它们的响应。当 运行 蜘蛛通过 scrapy crawl myspider
和 file_path
函数中的两个打印语句显示正确的值(文件名和文件路径)时,文件夹被正确创建。但是,这些文件从未真正下载过。
我确实发现了一些关于自定义文件管道和文件未下载的类似问题(例如 and ),但这些解决方案对我不起作用。
我做错了什么(或者在子classFilesPipeline
时忘记做什么)?我已经为这个问题绞尽脑汁 3 个小时了,我的 google-fu 没有为我的案子提供任何解决方案。
class LocalJsonFilesPipeline(FilesPipeline):
FILES_STORE = "json_src"
FILES_URLS_FIELD = "json_url"
FILES_RESULT_FIELD = "local_json"
def __init__(self, store_uri, use_response_url=False, filename_regex=None, settings=None):
# super(LocalJsonFilesPipeline, self).__init__(store_uri)
self.store_uri = store_uri
self.use_response_url = use_response_url
if filename_regex:
self.filename_regex = re.compile(filename_regex)
else:
self.filename_regex = filename_regex
super(LocalJsonFilesPipeline, self).__init__(store_uri, settings=settings)
@classmethod
def from_crawler(cls, crawler):
if not crawler.spider:
return BasePipeline()
store_uri = f'{cls.FILES_STORE}/{crawler.spider.name}'
settings = crawler.spider.settings
use_response_url = settings.get('JSON_FILENAME_USE_RESPONSE_URL', False)
filename_regex = settings.get('JSON_FILENAME_REGEX')
return cls(store_uri, use_response_url, filename_regex, settings)
def parse_path(self, value):
if self.filename_regex:
try:
return self.filename_regex.findall(value)[0]
except IndexError:
pass
# fallback method in the event no regex is provided by the spider
# example: /p/russet-potatoes-5lb-bag-good-38-gather-8482/-/A-77775602
link_path = os.path.splitext(urlparse(value).path)[0] # omit extension if there is one
link_params = link_path.rsplit('/', 1)[1] # preserve the last portion separated by forward-slash (A-77775602)
return link_params if '=' not in link_params else link_params.split('=', 1)[1]
def get_media_requests(self, item, info):
json_url = item.get(self.FILES_URLS_FIELD)
if json_url:
filename_url = json_url if not self.use_response_url else item.get('url', '')
return [Request(json_url, meta={'filename': self.parse_path(filename_url), 'spider': info.spider.name})]
def file_path(self, request, response=None, info=None):
final_path = f'{self.FILES_STORE}/{request.meta["spider"]}/{request.meta["filename"]}.json'
print('url', request.url)
print('downloading to', final_path)
return final_path
还有我的蜘蛛的自定义设置
class MockSpider(scrapy.Spider):
name = 'mock'
custom_settings = {
'ITEM_PIPELINES': {
'mock.pipelines.LocalJsonFilesPipeline': 200
},
'JSON_FILENAME_REGEX': r'products\/(.+?)\/ProductInfo\+ProductDetails'
}
日志级别设置为调试
C:\Users\Mike\Desktop\scrapy_test\pipeline_test>scrapy crawl testsite
2020-07-19 11:23:08 [scrapy.utils.log] INFO: Scrapy 2.2.1 started (bot: pipeline
_test)
2020-07-19 11:23:08 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9
.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.7.6 (
tags/v3.7.6:43364a7ae0, Dec 19 2019, 00:42:30) [MSC v.1916 64 bit (AMD64)], pyOp
enSSL 19.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Windows
-7-6.1.7601-SP1
2020-07-19 11:23:08 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.se
lectreactor.SelectReactor
2020-07-19 11:23:08 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'pipeline_test',
'LOG_STDOUT': True,
'NEWSPIDER_MODULE': 'pipeline_test.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['pipeline_test.spiders']}
2020-07-19 11:23:08 [scrapy.extensions.telnet] INFO: Telnet Password: 0454b083df
d2028a
2020-07-19 11:23:08 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-07-19 11:23:08 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-07-19 11:23:08 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-07-19 11:23:08 [scrapy.middleware] INFO: Enabled item pipelines:
['pipeline_test.pipelines.LocalJsonFilesPipeline']
2020-07-19 11:23:08 [scrapy.core.engine] INFO: Spider opened
2020-07-19 11:23:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pag
es/min), scraped 0 items (at 0 items/min)
2020-07-19 11:23:08 [scrapy.extensions.telnet] INFO: Telnet console listening on
127.0.0.1:6023
2020-07-19 11:23:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.[testsite].com/robots.txt> (referer: None)
2020-07-19 11:23:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://[testsite]/vpd/v1/products/prod6149174-product/ProductInfo+ProductDetails> (re
ferer: None)
2020-07-19 11:23:08 [stdout] INFO: url
2020-07-19 11:23:08 [stdout] INFO: https://[testsite]/vpd/v1/products/pro
d6149174-product/ProductInfo+ProductDetails
2020-07-19 11:23:08 [stdout] INFO: downloading to
2020-07-19 11:23:08 [stdout] INFO: json_src/[testsite]/prod6149174-product.json
2020-07-19 11:23:09 [scrapy.core.scraper] DEBUG: Scraped from <200 https://[testsite]/vpd/v1/products/prod6149174-product/ProductInfo+ProductDetails>
{'json_url': 'https://[testsite].com/vpd/v1/products/prod6149174-product/Prod
uctInfo+ProductDetails',
'local_json': [],
'url': 'https://[testsite].com/store/c/nature-made-super-b-complex,-tablets/
ID=prod6149174-product'}
2020-07-19 11:23:09 [scrapy.core.engine] INFO: Closing spider (finished)
2020-07-19 11:23:09 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 506,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 5515,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 0.468001,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 7, 19, 15, 23, 9, 96399),
'item_scraped_count': 1,
'log_count/DEBUG': 3,
'log_count/INFO': 14,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 7, 19, 15, 23, 8, 628398)}
2020-07-19 11:23:09 [scrapy.core.engine] INFO: Spider closed (finished)
我终于弄明白了,原来 FilesPipeline
class 没有 from_crawler
方法,而是需要 from_settings
方法想要将添加的参数传递给 subclassed/custom FilesPipeline
。下面是我的自定义工作版本 FilesPipeline
from scrapy import Request
from scrapy.pipelines.files import FilesPipeline
from urllib.parse import urlparse
import os
import re
class LocalFilesPipeline(FilesPipeline):
FILES_STORE = "data_src"
FILES_URLS_FIELD = "data_url"
FILES_RESULT_FIELD = "local_file"
def __init__(self, settings=None):
"""
Attributes:
use_response_url indicates we want to grab the filename from the response url instead of json_url
filename_regex regexes to use for grabbing filenames out of urls
filename_suffixes suffixes to append to filenames when there are multiple files to download per item
filename_extension the file extension to append to each filename in the file_path function
"""
self.use_response_url = settings.get('FILENAME_USE_RESPONSE_URL', False)
self.filename_regex = settings.get('FILENAME_REGEX', [])
self.filename_suffixes = settings.get('FILENAME_SUFFIXES', [])
self.filename_extension = settings.get('FILENAME_EXTENSION', 'json')
if isinstance(self.filename_regex, str):
self.filename_regex = [self.filename_regex]
if isinstance(self.filename_suffixes, str):
self.filename_suffixes = [self.filename_suffixes]
if self.filename_regex and self.filename_suffixes and len(self.filename_regex) != len(self.filename_suffixes):
raise ValueError('FILENAME_REGEX and FILENAME_SUFFIXES settings must contain the same number of elements')
if self.filename_regex:
for i, f_regex in enumerate(self.filename_regex):
self.filename_regex[i] = re.compile(f_regex)
super(LocalFilesPipeline, self).__init__(self.FILES_STORE, settings=settings)
@classmethod
def from_settings(cls, settings):
return cls(settings=settings)
def parse_path(self, value, index):
if self.filename_regex:
try:
return self.filename_regex[index-1].findall(value)[0]
except IndexError:
pass
# fallback method in the event no regex is provided by the spider
link_path = os.path.splitext(urlparse(value).path)[0]
# preserve the last portion separated by forward-slash
try:
return link_path.rsplit('/', 1)[1]
except IndexError:
return link_path
def get_media_requests(self, item, info):
file_urls = item.get(self.FILES_URLS_FIELD)
requests = []
if file_urls:
total_urls = len(file_urls)
for i, file_url in enumerate(file_urls, 1):
filename_url = file_url if not self.use_response_url else item.get('url', '')
filename = self.parse_path(filename_url, i)
if self.filename_suffixes:
current_suffix = self.filename_suffixes[i-1]
if current_suffix.startswith('/'):
# this will end up creating a separate folder for the different types of files
filename += current_suffix
else:
# this will keep all files in single folder while still making it easy to differentiate each
# type of file. this comes in handy when searching for a file by the base name.
filename += f'_{current_suffix}'
elif total_urls > 1:
# default to numbering files sequentially in the order they were added to the item
filename += f'_file{i}'
requests.append(Request(file_url, meta={'spider': info.spider.name, 'filename': filename}))
return requests
def file_path(self, request, response=None, info=None):
return f'{request.meta["spider"]}/{request.meta["filename"]}.{self.filename_extension}'
然后,要利用管道,您可以在蜘蛛的 custom_settings
属性
中设置适用的值
custom_settings = {
'ITEM_PIPELINES': {
'spins.pipelines.LocalFilesPipeline': 200
},
'FILENAME_REGEX': [r'products\/(.+?)\/ProductInfo\+ProductDetails']
}
我有以下用于下载 JSON 文件的自定义管道。它运行良好,直到我需要添加 __init__
函数,在该函数中我将 FilesPipeline
class 子class 添加一些新属性。管道获取指向 API 个端点的 URL 并下载它们的响应。当 运行 蜘蛛通过 scrapy crawl myspider
和 file_path
函数中的两个打印语句显示正确的值(文件名和文件路径)时,文件夹被正确创建。但是,这些文件从未真正下载过。
我确实发现了一些关于自定义文件管道和文件未下载的类似问题(例如
我做错了什么(或者在子classFilesPipeline
时忘记做什么)?我已经为这个问题绞尽脑汁 3 个小时了,我的 google-fu 没有为我的案子提供任何解决方案。
class LocalJsonFilesPipeline(FilesPipeline):
FILES_STORE = "json_src"
FILES_URLS_FIELD = "json_url"
FILES_RESULT_FIELD = "local_json"
def __init__(self, store_uri, use_response_url=False, filename_regex=None, settings=None):
# super(LocalJsonFilesPipeline, self).__init__(store_uri)
self.store_uri = store_uri
self.use_response_url = use_response_url
if filename_regex:
self.filename_regex = re.compile(filename_regex)
else:
self.filename_regex = filename_regex
super(LocalJsonFilesPipeline, self).__init__(store_uri, settings=settings)
@classmethod
def from_crawler(cls, crawler):
if not crawler.spider:
return BasePipeline()
store_uri = f'{cls.FILES_STORE}/{crawler.spider.name}'
settings = crawler.spider.settings
use_response_url = settings.get('JSON_FILENAME_USE_RESPONSE_URL', False)
filename_regex = settings.get('JSON_FILENAME_REGEX')
return cls(store_uri, use_response_url, filename_regex, settings)
def parse_path(self, value):
if self.filename_regex:
try:
return self.filename_regex.findall(value)[0]
except IndexError:
pass
# fallback method in the event no regex is provided by the spider
# example: /p/russet-potatoes-5lb-bag-good-38-gather-8482/-/A-77775602
link_path = os.path.splitext(urlparse(value).path)[0] # omit extension if there is one
link_params = link_path.rsplit('/', 1)[1] # preserve the last portion separated by forward-slash (A-77775602)
return link_params if '=' not in link_params else link_params.split('=', 1)[1]
def get_media_requests(self, item, info):
json_url = item.get(self.FILES_URLS_FIELD)
if json_url:
filename_url = json_url if not self.use_response_url else item.get('url', '')
return [Request(json_url, meta={'filename': self.parse_path(filename_url), 'spider': info.spider.name})]
def file_path(self, request, response=None, info=None):
final_path = f'{self.FILES_STORE}/{request.meta["spider"]}/{request.meta["filename"]}.json'
print('url', request.url)
print('downloading to', final_path)
return final_path
还有我的蜘蛛的自定义设置
class MockSpider(scrapy.Spider):
name = 'mock'
custom_settings = {
'ITEM_PIPELINES': {
'mock.pipelines.LocalJsonFilesPipeline': 200
},
'JSON_FILENAME_REGEX': r'products\/(.+?)\/ProductInfo\+ProductDetails'
}
日志级别设置为调试
C:\Users\Mike\Desktop\scrapy_test\pipeline_test>scrapy crawl testsite
2020-07-19 11:23:08 [scrapy.utils.log] INFO: Scrapy 2.2.1 started (bot: pipeline
_test)
2020-07-19 11:23:08 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9
.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.7.6 (
tags/v3.7.6:43364a7ae0, Dec 19 2019, 00:42:30) [MSC v.1916 64 bit (AMD64)], pyOp
enSSL 19.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Windows
-7-6.1.7601-SP1
2020-07-19 11:23:08 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.se
lectreactor.SelectReactor
2020-07-19 11:23:08 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'pipeline_test',
'LOG_STDOUT': True,
'NEWSPIDER_MODULE': 'pipeline_test.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['pipeline_test.spiders']}
2020-07-19 11:23:08 [scrapy.extensions.telnet] INFO: Telnet Password: 0454b083df
d2028a
2020-07-19 11:23:08 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-07-19 11:23:08 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-07-19 11:23:08 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-07-19 11:23:08 [scrapy.middleware] INFO: Enabled item pipelines:
['pipeline_test.pipelines.LocalJsonFilesPipeline']
2020-07-19 11:23:08 [scrapy.core.engine] INFO: Spider opened
2020-07-19 11:23:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pag
es/min), scraped 0 items (at 0 items/min)
2020-07-19 11:23:08 [scrapy.extensions.telnet] INFO: Telnet console listening on
127.0.0.1:6023
2020-07-19 11:23:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.[testsite].com/robots.txt> (referer: None)
2020-07-19 11:23:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://[testsite]/vpd/v1/products/prod6149174-product/ProductInfo+ProductDetails> (re
ferer: None)
2020-07-19 11:23:08 [stdout] INFO: url
2020-07-19 11:23:08 [stdout] INFO: https://[testsite]/vpd/v1/products/pro
d6149174-product/ProductInfo+ProductDetails
2020-07-19 11:23:08 [stdout] INFO: downloading to
2020-07-19 11:23:08 [stdout] INFO: json_src/[testsite]/prod6149174-product.json
2020-07-19 11:23:09 [scrapy.core.scraper] DEBUG: Scraped from <200 https://[testsite]/vpd/v1/products/prod6149174-product/ProductInfo+ProductDetails>
{'json_url': 'https://[testsite].com/vpd/v1/products/prod6149174-product/Prod
uctInfo+ProductDetails',
'local_json': [],
'url': 'https://[testsite].com/store/c/nature-made-super-b-complex,-tablets/
ID=prod6149174-product'}
2020-07-19 11:23:09 [scrapy.core.engine] INFO: Closing spider (finished)
2020-07-19 11:23:09 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 506,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 5515,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 0.468001,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 7, 19, 15, 23, 9, 96399),
'item_scraped_count': 1,
'log_count/DEBUG': 3,
'log_count/INFO': 14,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 7, 19, 15, 23, 8, 628398)}
2020-07-19 11:23:09 [scrapy.core.engine] INFO: Spider closed (finished)
我终于弄明白了,原来 FilesPipeline
class 没有 from_crawler
方法,而是需要 from_settings
方法想要将添加的参数传递给 subclassed/custom FilesPipeline
。下面是我的自定义工作版本 FilesPipeline
from scrapy import Request
from scrapy.pipelines.files import FilesPipeline
from urllib.parse import urlparse
import os
import re
class LocalFilesPipeline(FilesPipeline):
FILES_STORE = "data_src"
FILES_URLS_FIELD = "data_url"
FILES_RESULT_FIELD = "local_file"
def __init__(self, settings=None):
"""
Attributes:
use_response_url indicates we want to grab the filename from the response url instead of json_url
filename_regex regexes to use for grabbing filenames out of urls
filename_suffixes suffixes to append to filenames when there are multiple files to download per item
filename_extension the file extension to append to each filename in the file_path function
"""
self.use_response_url = settings.get('FILENAME_USE_RESPONSE_URL', False)
self.filename_regex = settings.get('FILENAME_REGEX', [])
self.filename_suffixes = settings.get('FILENAME_SUFFIXES', [])
self.filename_extension = settings.get('FILENAME_EXTENSION', 'json')
if isinstance(self.filename_regex, str):
self.filename_regex = [self.filename_regex]
if isinstance(self.filename_suffixes, str):
self.filename_suffixes = [self.filename_suffixes]
if self.filename_regex and self.filename_suffixes and len(self.filename_regex) != len(self.filename_suffixes):
raise ValueError('FILENAME_REGEX and FILENAME_SUFFIXES settings must contain the same number of elements')
if self.filename_regex:
for i, f_regex in enumerate(self.filename_regex):
self.filename_regex[i] = re.compile(f_regex)
super(LocalFilesPipeline, self).__init__(self.FILES_STORE, settings=settings)
@classmethod
def from_settings(cls, settings):
return cls(settings=settings)
def parse_path(self, value, index):
if self.filename_regex:
try:
return self.filename_regex[index-1].findall(value)[0]
except IndexError:
pass
# fallback method in the event no regex is provided by the spider
link_path = os.path.splitext(urlparse(value).path)[0]
# preserve the last portion separated by forward-slash
try:
return link_path.rsplit('/', 1)[1]
except IndexError:
return link_path
def get_media_requests(self, item, info):
file_urls = item.get(self.FILES_URLS_FIELD)
requests = []
if file_urls:
total_urls = len(file_urls)
for i, file_url in enumerate(file_urls, 1):
filename_url = file_url if not self.use_response_url else item.get('url', '')
filename = self.parse_path(filename_url, i)
if self.filename_suffixes:
current_suffix = self.filename_suffixes[i-1]
if current_suffix.startswith('/'):
# this will end up creating a separate folder for the different types of files
filename += current_suffix
else:
# this will keep all files in single folder while still making it easy to differentiate each
# type of file. this comes in handy when searching for a file by the base name.
filename += f'_{current_suffix}'
elif total_urls > 1:
# default to numbering files sequentially in the order they were added to the item
filename += f'_file{i}'
requests.append(Request(file_url, meta={'spider': info.spider.name, 'filename': filename}))
return requests
def file_path(self, request, response=None, info=None):
return f'{request.meta["spider"]}/{request.meta["filename"]}.{self.filename_extension}'
然后,要利用管道,您可以在蜘蛛的 custom_settings
属性
custom_settings = {
'ITEM_PIPELINES': {
'spins.pipelines.LocalFilesPipeline': 200
},
'FILENAME_REGEX': [r'products\/(.+?)\/ProductInfo\+ProductDetails']
}