如何根据动态文件夹下载scrapy图片
How to download scrapy images in a dyanmic folder based on
我正在尝试将默认路径 full/hash.jpg
覆盖到 <dynamic>/hash.jpg
,我已尝试 How to download scrapy images in a dyanmic folder 使用以下代码:
def item_completed(self, results, item, info):
for result in [x for ok, x in results if ok]:
path = result['path']
# here we create the session-path where the files should be in the end
# you'll have to change this path creation depending on your needs
slug = slugify(item['category'])
target_path = os.path.join(slug, os.path.basename(path))
# try to move the file and raise exception if not possible
if not os.rename(path, target_path):
raise DropItem("Could not move image to target folder")
if self.IMAGES_RESULT_FIELD in item.fields:
item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
return item
但我得到:
Traceback (most recent call last):
File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 839, in _cbDeferred
self.callback(self.resultList)
File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 382, in callback
self._startRunCallbacks(result)
File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 490, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/user/Projects/sepid/scraper/scraper/pipelines.py", line 44, in item_completed
if not os.rename(path, target_path):
exceptions.OSError: [Errno 2] No such file or directory
不知道怎么回事,请问还有什么方法可以改路径吗?谢谢
问题出现是因为dst文件夹不存在,快速解决方法是:
def item_completed(self, results, item, info):
for result in [x for ok, x in results if ok]:
path = result['path']
slug = slugify(item['designer'])
settings = get_project_settings()
storage = settings.get('IMAGES_STORE')
target_path = os.path.join(storage, slug, os.path.basename(path))
path = os.path.join(storage, path)
# If path doesn't exist, it will be created
if not os.path.exists(os.path.join(storage, slug)):
os.makedirs(os.path.join(storage, slug))
if not os.rename(path, target_path):
raise DropItem("Could not move image to target folder")
if self.IMAGES_RESULT_FIELD in item.fields:
item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
return item
@neelix 提供的解决方案是最好的解决方案,但我正在尝试使用它,但我发现了一些奇怪的结果,一些文档被移动了,但不是所有文档。所以我更换了:
if not os.rename(path, target_path):
raise DropItem("Could not move image to target folder")
并且我导入了 shutil 库,那么我的代码是:
def item_completed(self, results, item, info):
for result in [x for ok, x in results if ok]:
path = result['path']
slug = slugify(item['designer'])
settings = get_project_settings()
storage = settings.get('IMAGES_STORE')
target_path = os.path.join(storage, slug, os.path.basename(path))
path = os.path.join(storage, path)
# If path doesn't exist, it will be created
if not os.path.exists(os.path.join(storage, slug)):
os.makedirs(os.path.join(storage, slug))
shutil.move(path, target_path)
if self.IMAGES_RESULT_FIELD in item.fields:
item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
return item
我希望它对你们也有用:)
我创建了一个继承自 ImagesPipeline
并重写 file_path
方法的管道,并使用它代替标准 ImagesPipeline
class StoreImgPipeline(ImagesPipeline):
def file_path(self, request, response=None, info=None):
image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
return 'realty-sc/%s/%s/%s/%s.jpg' % (YEAR, image_guid[:2], image_guid[2:4], image_guid)
为了在下载图像之前动态设置由 scrapy 蜘蛛下载的图像的路径,而不是在下载图像之后移动它们,我创建了一个覆盖 get_media_requests
和 file_path
方法的自定义管道。
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
return [Request(url, meta={'f1':item.get('field1'), 'f2':item.get('field2'), 'f3':item.get('field3'), 'f4':item.get('field4')}) for url in item.get(self.images_urls_field, [])]
def file_path(self, request, response=None, info=None):
## start of deprecation warning block (can be removed in the future)
def _warn():
from scrapy.exceptions import ScrapyDeprecationWarning
import warnings
warnings.warn('ImagesPipeline.image_key(url) and file_key(url) methods are deprecated, '
'please use file_path(request, response=None, info=None) instead',
category=ScrapyDeprecationWarning, stacklevel=1)
# check if called from image_key or file_key with url as first argument
if not isinstance(request, Request):
_warn()
url = request
else:
url = request.url
# detect if file_key() or image_key() methods have been overridden
if not hasattr(self.file_key, '_base'):
_warn()
return self.file_key(url)
elif not hasattr(self.image_key, '_base'):
_warn()
return self.image_key(url)
## end of deprecation warning block
image_guid = hashlib.sha1(to_bytes(url)).hexdigest()
return '%s/%s/%s/%s/%s.jpg' % (request.meta['f1'], request.meta['f2'], request.meta['f3'], request.meta['f4'], image_guid)
此方法假定您在蜘蛛中定义了一个 scrapy.Item
并用您的特定字段名称替换,例如 "field1"。在 get_media_requests
中设置 Request.meta 允许项目字段值用于设置每个项目的下载目录,如 file_path
的 return 语句所示。如果目录不存在,Scrapy 会自动创建。
自定义管道 class 定义保存在我的项目 pipelines.py
中。这里的方法直接改编自默认的 scrapy 管道 images.py
,我的 Mac 存储在 ~/anaconda3/pkgs/scrapy-1.5.0-py36_0/lib/python3.6/site-packages/scrapy/pipelines/
中。可以根据需要从该文件复制包含和其他方法。
我正在尝试将默认路径 full/hash.jpg
覆盖到 <dynamic>/hash.jpg
,我已尝试 How to download scrapy images in a dyanmic folder 使用以下代码:
def item_completed(self, results, item, info):
for result in [x for ok, x in results if ok]:
path = result['path']
# here we create the session-path where the files should be in the end
# you'll have to change this path creation depending on your needs
slug = slugify(item['category'])
target_path = os.path.join(slug, os.path.basename(path))
# try to move the file and raise exception if not possible
if not os.rename(path, target_path):
raise DropItem("Could not move image to target folder")
if self.IMAGES_RESULT_FIELD in item.fields:
item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
return item
但我得到:
Traceback (most recent call last):
File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 839, in _cbDeferred
self.callback(self.resultList)
File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 382, in callback
self._startRunCallbacks(result)
File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 490, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/home/user/.venv/sepid/lib/python2.7/site-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/user/Projects/sepid/scraper/scraper/pipelines.py", line 44, in item_completed
if not os.rename(path, target_path):
exceptions.OSError: [Errno 2] No such file or directory
不知道怎么回事,请问还有什么方法可以改路径吗?谢谢
问题出现是因为dst文件夹不存在,快速解决方法是:
def item_completed(self, results, item, info):
for result in [x for ok, x in results if ok]:
path = result['path']
slug = slugify(item['designer'])
settings = get_project_settings()
storage = settings.get('IMAGES_STORE')
target_path = os.path.join(storage, slug, os.path.basename(path))
path = os.path.join(storage, path)
# If path doesn't exist, it will be created
if not os.path.exists(os.path.join(storage, slug)):
os.makedirs(os.path.join(storage, slug))
if not os.rename(path, target_path):
raise DropItem("Could not move image to target folder")
if self.IMAGES_RESULT_FIELD in item.fields:
item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
return item
@neelix 提供的解决方案是最好的解决方案,但我正在尝试使用它,但我发现了一些奇怪的结果,一些文档被移动了,但不是所有文档。所以我更换了:
if not os.rename(path, target_path):
raise DropItem("Could not move image to target folder")
并且我导入了 shutil 库,那么我的代码是:
def item_completed(self, results, item, info):
for result in [x for ok, x in results if ok]:
path = result['path']
slug = slugify(item['designer'])
settings = get_project_settings()
storage = settings.get('IMAGES_STORE')
target_path = os.path.join(storage, slug, os.path.basename(path))
path = os.path.join(storage, path)
# If path doesn't exist, it will be created
if not os.path.exists(os.path.join(storage, slug)):
os.makedirs(os.path.join(storage, slug))
shutil.move(path, target_path)
if self.IMAGES_RESULT_FIELD in item.fields:
item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
return item
我希望它对你们也有用:)
我创建了一个继承自 ImagesPipeline
并重写 file_path
方法的管道,并使用它代替标准 ImagesPipeline
class StoreImgPipeline(ImagesPipeline):
def file_path(self, request, response=None, info=None):
image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
return 'realty-sc/%s/%s/%s/%s.jpg' % (YEAR, image_guid[:2], image_guid[2:4], image_guid)
为了在下载图像之前动态设置由 scrapy 蜘蛛下载的图像的路径,而不是在下载图像之后移动它们,我创建了一个覆盖 get_media_requests
和 file_path
方法的自定义管道。
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
return [Request(url, meta={'f1':item.get('field1'), 'f2':item.get('field2'), 'f3':item.get('field3'), 'f4':item.get('field4')}) for url in item.get(self.images_urls_field, [])]
def file_path(self, request, response=None, info=None):
## start of deprecation warning block (can be removed in the future)
def _warn():
from scrapy.exceptions import ScrapyDeprecationWarning
import warnings
warnings.warn('ImagesPipeline.image_key(url) and file_key(url) methods are deprecated, '
'please use file_path(request, response=None, info=None) instead',
category=ScrapyDeprecationWarning, stacklevel=1)
# check if called from image_key or file_key with url as first argument
if not isinstance(request, Request):
_warn()
url = request
else:
url = request.url
# detect if file_key() or image_key() methods have been overridden
if not hasattr(self.file_key, '_base'):
_warn()
return self.file_key(url)
elif not hasattr(self.image_key, '_base'):
_warn()
return self.image_key(url)
## end of deprecation warning block
image_guid = hashlib.sha1(to_bytes(url)).hexdigest()
return '%s/%s/%s/%s/%s.jpg' % (request.meta['f1'], request.meta['f2'], request.meta['f3'], request.meta['f4'], image_guid)
此方法假定您在蜘蛛中定义了一个 scrapy.Item
并用您的特定字段名称替换,例如 "field1"。在 get_media_requests
中设置 Request.meta 允许项目字段值用于设置每个项目的下载目录,如 file_path
的 return 语句所示。如果目录不存在,Scrapy 会自动创建。
自定义管道 class 定义保存在我的项目 pipelines.py
中。这里的方法直接改编自默认的 scrapy 管道 images.py
,我的 Mac 存储在 ~/anaconda3/pkgs/scrapy-1.5.0-py36_0/lib/python3.6/site-packages/scrapy/pipelines/
中。可以根据需要从该文件复制包含和其他方法。