如何在 Databricks 中使用 shutil 压缩文件(在 Azure Blob 存储上)

How to zip files (on Azure Blob Storage) with shutil in Databricks

我训练的深度学习模型存在于一个文件夹中的几个文件中。所以这与压缩数据帧无关。

我想压缩此文件夹(在 Azure Blob 存储中)。但是当我使用 shutil 时,这似乎不起作用:

import shutil
modelPath = "/dbfs/mnt/databricks/Models/predictBaseTerm/noNormalizationCode/2020-01-10-13-43/9_0.8147903598547376"
zipPath= "/mnt/databricks/Deploy/" (no /dbfs here or it will error)
shutil.make_archive(base_dir= modelPath, format='zip', base_name=zipPath)

有人知道如何执行此操作并将文件放到 Azure Blob 存储(我从中读取它的地方)吗?

最后我自己想通了。

无法使用 Shutil 直接写入 dbfs(Azure Blob 存储)。

你需要先像这样把文件放在databricks的本地驱动节点上(在文档的某处读到,你不能直接写入Blob存储):

import shutil
modelPath = "/dbfs/mnt/databricks/Models/predictBaseTerm/noNormalizationCode/2020-01-10-13-43/9_0.8147903598547376"
zipPath= "/tmp/model"
shutil.make_archive(base_dir= modelPath, format='zip', base_name=zipPath)

然后您可以将文件从本地驱动程序节点复制到 blob 存储。请注意 "file:" 从本地存储中获取文件!

blobStoragePath = "dbfs:/mnt/databricks/Models"
dbutils.fs.cp("file:" +zipPath + ".zip", blobStoragePath)

我为此浪费了几个小时,如果这个答案对您有帮助,请投票!

实际上,在不使用 shutil 的情况下,我可以将 Databricks dbfs 中的文件压缩为一个 zip 文件,作为已安装到 dbfs 的 Azure Blob 存储的 blob。

这是我使用 Python 标准库 oszipfile.

的示例代码
# Mount a container of Azure Blob Storage to dbfs
storage_account_name='<your storage account name>'
storage_account_access_key='<your storage account key>'
container_name = '<your container name>'

dbutils.fs.mount(
  source = "wasbs://"+container_name+"@"+storage_account_name+".blob.core.windows.net",
  mount_point = "/mnt/<a mount directory name under /mnt, such as `test`>",
  extra_configs = {"fs.azure.account.key."+storage_account_name+".blob.core.windows.net":storage_account_access_key})

# List all files which need to be compressed
import os
modelPath  = '/dbfs/mnt/databricks/Models/predictBaseTerm/noNormalizationCode/2020-01-10-13-43/9_0.8147903598547376'
filenames = [os.path.join(root, name) for root, dirs, files in os.walk(top=modelPath , topdown=False) for name in files]
# print(filenames)

# Directly zip files to Azure Blob Storage as a blob
# zipPath is the absoluted path of the compressed file on the mount point, such as `/dbfs/mnt/test/demo.zip`
zipPath = '/dbfs/mnt/<a mount directory name under /mnt, such as `test`>/demo.zip'
import zipfile
with zipfile.ZipFile(zipPath, 'w') as myzip:
  for filename in filenames:
#    print(filename)
    myzip.write(filename)

我尝试将我的 test 容器挂载到 dbfs 和 运行 我的示例代码,然后我得到 demo.zip 文件,其中包含我的 test容器,如下图