write/save 从 azure databricks 到 azure 文件共享的数据框
write/save Dataframe to azure file share from azure databricks
如何从 azure databricks spark 作业写入 azure 文件共享。
我配置了 Hadoop 存储键和值。
spark.sparkContext.hadoopConfiguration.set(
"fs.azure.account.key.STORAGEKEY.file.core.windows.net",
"SECRETVALUE"
)
val wasbFileShare =
s"wasbs://testfileshare@STORAGEKEY.file.core.windows.net/testPath"
df.coalesce(1).write.mode("overwrite").csv(wasbBlob)
尝试将数据帧保存到 Azure 文件共享时,尽管存在 URI,但我看到以下“未找到资源”错误。
Exception in thread "main" org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: The requested URI does not represent any resource on the server.
遗憾的是,Azure databricks 不支持读取和写入 Azure 文件共享。
Azure Databricks 支持的数据源:https://docs.microsoft.com/en-us/azure/databricks/data/data-sources/
我建议您就此提供反馈:
https://feedback.azure.com/forums/909463-azure-databricks
你在这些论坛中分享的所有反馈都将由负责构建 Azure 的 Microsoft 工程团队监控和审查。
您可以查看解决类似问题的 SO 线程:
下面是将 CSV 数据直接写入 Azure Databricks Notebook 中的 Azure blob 存储容器的代码片段。
# Configure blob storage account access key globally
spark.conf.set("fs.azure.account.key.chepra.blob.core.windows.net", "gv7nVIXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXdlOiA==")
output_container_path = "wasbs://sampledata@chepra.blob.core.windows.net"
output_blob_folder = "%s/wrangled_data_folder" % output_container_path
# write the dataframe as a single file to blob storage
(dataframe
.coalesce(1)
.write
.mode("overwrite")
.option("header", "true")
.format("com.databricks.spark.csv")
.save(output_blob_folder))
# Get the name of the wrangled-data CSV file that was just saved to Azure blob storage (it starts with 'part-')
files = dbutils.fs.ls(output_blob_folder)
output_file = [x for x in files if x.name.startswith("part-")]
# Move the wrangled-data CSV file from a sub-folder (wrangled_data_folder) to the root of the blob container
# While simultaneously changing the file name
dbutils.fs.mv(output_file[0].path, "%s/predict-transform-output.csv" % output_container_path)
从数据块连接到 Azure 文件共享的步骤
首先在 Databricks 中使用 pip install 为 Python 安装 Microsoft Azure 存储文件共享客户端库。 https://pypi.org/project/azure-storage-file-share/
安装后,创建一个存储帐户。然后你可以从数据块创建一个文件共享
from azure.storage.fileshare import ShareClient
share = ShareClient.from_connection_string(conn_str="<connection_string consists of FileEndpoint=myFileEndpoint(https://storageaccountname.file.core.windows.net/);SharedAccessSignature=sasToken>", share_name="<file share name that you want to create>")
share.create_share()
这段代码是通过databricks上传一个文件到fileshare
from azure.storage.fileshare import ShareFileClient
file_client = ShareFileClient.from_connection_string(conn_str="<connection_string consists of FileEndpoint=myFileEndpoint(https://storageaccountname.file.core.windows.net/);SharedAccessSignature=sasToken>", share_name="<your_fileshare_name>", file_path="my_file")
with open("./SampleSource.txt", "rb") as source_file:
file_client.upload_file(source_file)
参考此 link 了解更多信息 https://pypi.org/project/azure-storage-file-share/
如何从 azure databricks spark 作业写入 azure 文件共享。
我配置了 Hadoop 存储键和值。
spark.sparkContext.hadoopConfiguration.set(
"fs.azure.account.key.STORAGEKEY.file.core.windows.net",
"SECRETVALUE"
)
val wasbFileShare =
s"wasbs://testfileshare@STORAGEKEY.file.core.windows.net/testPath"
df.coalesce(1).write.mode("overwrite").csv(wasbBlob)
尝试将数据帧保存到 Azure 文件共享时,尽管存在 URI,但我看到以下“未找到资源”错误。
Exception in thread "main" org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: The requested URI does not represent any resource on the server.
遗憾的是,Azure databricks 不支持读取和写入 Azure 文件共享。
Azure Databricks 支持的数据源:https://docs.microsoft.com/en-us/azure/databricks/data/data-sources/
我建议您就此提供反馈:
https://feedback.azure.com/forums/909463-azure-databricks
你在这些论坛中分享的所有反馈都将由负责构建 Azure 的 Microsoft 工程团队监控和审查。
您可以查看解决类似问题的 SO 线程:
下面是将 CSV 数据直接写入 Azure Databricks Notebook 中的 Azure blob 存储容器的代码片段。
# Configure blob storage account access key globally
spark.conf.set("fs.azure.account.key.chepra.blob.core.windows.net", "gv7nVIXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXdlOiA==")
output_container_path = "wasbs://sampledata@chepra.blob.core.windows.net"
output_blob_folder = "%s/wrangled_data_folder" % output_container_path
# write the dataframe as a single file to blob storage
(dataframe
.coalesce(1)
.write
.mode("overwrite")
.option("header", "true")
.format("com.databricks.spark.csv")
.save(output_blob_folder))
# Get the name of the wrangled-data CSV file that was just saved to Azure blob storage (it starts with 'part-')
files = dbutils.fs.ls(output_blob_folder)
output_file = [x for x in files if x.name.startswith("part-")]
# Move the wrangled-data CSV file from a sub-folder (wrangled_data_folder) to the root of the blob container
# While simultaneously changing the file name
dbutils.fs.mv(output_file[0].path, "%s/predict-transform-output.csv" % output_container_path)
从数据块连接到 Azure 文件共享的步骤
首先在 Databricks 中使用 pip install 为 Python 安装 Microsoft Azure 存储文件共享客户端库。 https://pypi.org/project/azure-storage-file-share/
安装后,创建一个存储帐户。然后你可以从数据块创建一个文件共享
from azure.storage.fileshare import ShareClient
share = ShareClient.from_connection_string(conn_str="<connection_string consists of FileEndpoint=myFileEndpoint(https://storageaccountname.file.core.windows.net/);SharedAccessSignature=sasToken>", share_name="<file share name that you want to create>")
share.create_share()
这段代码是通过databricks上传一个文件到fileshare
from azure.storage.fileshare import ShareFileClient
file_client = ShareFileClient.from_connection_string(conn_str="<connection_string consists of FileEndpoint=myFileEndpoint(https://storageaccountname.file.core.windows.net/);SharedAccessSignature=sasToken>", share_name="<your_fileshare_name>", file_path="my_file")
with open("./SampleSource.txt", "rb") as source_file:
file_client.upload_file(source_file)
参考此 link 了解更多信息 https://pypi.org/project/azure-storage-file-share/