将文件从 Azure 文件加载到 Azure Databricks

Question

寻找使用 Azure 文件 SDK 将文件上传到我的 azure databricks blob 存储的方法

我使用这个 page

中的函数尝试了很多东西

但没有任何效果。我不明白为什么

示例：

file_service = FileService(account_name='MYSECRETNAME', account_key='mySECRETkey')

generator = file_service.list_directories_and_files('MYSECRETNAME/test') #listing file in folder /test, working well
for file_or_dir in generator:
    print(file_or_dir.name)

file_service.get_file_to_path('MYSECRETNAME','test/tables/input/referentials/','test.xlsx','/dbfs/FileStore/test6.xlsx')

with test.xlsx = 我的 azure 文件中的文件名

/dbfs/FileStore/test6.xlsx => 在我的 dbfs 系统中上传文件的路径

我收到错误消息：

Exception=The specified resource name contains invalid characters

尝试更改名称但似乎不起作用

edit：我什至不确定该函数是否正在执行我想要的操作。从 Azure 文件加载文件的最佳方式是什么？

Answer 1

根据我的经验，我认为从 Azure 文件加载文件的最佳方法是直接通过 url 使用 sas 令牌读取文件。

例如，如下图，它是我test文件共享中的一个名为test.xlsx的文件，我使用Azure Storage Explorer查看它，然后生成它的url使用 sas 令牌。

图 1. 右击文件然后点击 Get Shared Access Signature...

图2.直接读取文件内容必须select选项Read权限

图 3. 使用 sas 令牌

复制 url

这是我的示例代码，您可以使用 Azure Databricks 中文件的 sas 令牌 url 运行它。

import pandas as pd

url_sas_token = 'https://<my account name>.file.core.windows.net/test/test.xlsx?st=2020-01-27T10%3A16%3A12Z&se=2020-01-28T10%3A16%3A12Z&sp=rl&sv=2018-03-28&sr=f&sig=XXXXXXXXXXXXXXXXX'
# Directly read the file content from its url with sas token to get a pandas dataframe
pdf = pd.read_excel(url_sas_token )
# Then, to convert the pandas dataframe to a PySpark dataframe in Azure Databricks
df = spark.createDataFrame(pdf)

或者，要使用 Azure 文件存储 SDK 为您的文件生成带有 sas 令牌的 url 或获取文件的字节以供读取，请参考官方文档 Develop for Azure Files with Python 和下面是我的示例代码。

# Create a client of Azure File Service as same as yours
from azure.storage.file import FileService

account_name = '<your account name>'
account_key = '<your account key>'
share_name = 'test'
directory_name = None
file_name = 'test.xlsx'

file_service = FileService(account_name=account_name, account_key=account_key)

生成文件的 sas 令牌url

from azure.storage.file import FilePermissions
from datetime import datetime, timedelta
sas_token = file_service.generate_file_shared_access_signature(share_name, directory_name, file_name, permission=FilePermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1))

url_sas_token = f"https://{account_name}.file.core.windows.net/{share_name}/{file_name}?{sas_token}"
import pandas as pd
pdf = pd.read_excel(url_sas_token)
df = spark.createDataFrame(pdf)

或使用get_file_to_stream函数读取文件内容

from io import BytesIO
import pandas as pd

stream = BytesIO()
file_service.get_file_to_stream(share_name, directory_name, file_name, stream)
pdf = pd.read_excel(stream)
df = spark.createDataFrame(pdf)

将文件从 Azure 文件加载到 Azure Databricks

Load file from Azure Files to Azure Databricks

python

azure

azure-storage

azure-files

azure-databricks