Databricks dbutils 不显示特定文件夹下的文件夹列表
Databricks dbutils not displaying folder list under specfic folder
我在一个容器下有三个文件夹
文件夹结构
folder1
|_ file1.json
|_ file2.json
folder2
|_ sub-folder1
|_ file1.json
|_ sub_folder2
|_ sub-folder01
|_ file2.json
folder3
|_ sub-folder1
|_ file1.json
注意:folder2
只有文件夹列表,其中可能有文件,我正在尝试迭代并在 python 代码中查找特定文件名。
from pyspark.sql.functions import col,lit
from datetime import datetime
app_storage_acct_name= 'mystorageaccnt1'
app_storage_acct_scope="{}-scope".format(app_storage_acct_name)
config_secret_set_url = "fs.azure.account.key.{}.blob.core.windows.net".format(app_storage_acct_name)
secret = dbutils.secrets.get(scope = app_storage_acct_scope, key = app_storage_acct_key)
dbutils.fs.mount(
source = "wasbs://mycontainer1@mystirageaccnt1.blob.core.windows.net",
mount_point = "/mnt/my-data-src",
extra_configs = {config_secret_set_url:dbutils.secrets.get(scope = app_storage_acct_scope, key = app_storage_acct_key)})
dbutils.fs.ls('/mnt/my-data-src/')
以上代码打印了三个文件夹,我也在 blob 存储资源管理器中看到了这些文件夹
Out[29]: [FileInfo(path='dbfs:/mnt/my-data-src/folder1/', name='folder1/', size=0),
FileInfo(path='dbfs:/mnt/my-data-src/folder2/', name='folder2/', size=0),
FileInfo(path='dbfs:/mnt/my-data-src/folder3/', name='folder3/', size=0)]
当我在下面使用时,列出了文件
dbutils.fs.ls('/mnt/my-data-src/folder1/')
- 输出如下
Out[30]: [FileInfo(path='dbfs:/mnt/my-data-src/folder1/file1.json', name='file1.json', size=1011),
FileInfo(path='dbfs:/mnt/my-data-src....,
当我尝试使用
列出 folder2 下的文件夹时
dbutils.fs.ls('/mnt/my-data-src/folder2/')
- 输出
java.io.FileNotFoundException: File /folder2 does not exist.
ExecutionError Traceback (most recent call last)
<command-2660727172978602> in <module>
----> 1 dbutils.fs.ls('/mnt/my-data-src/folder2/')
/databricks/python_shell/dbruntime/dbutils.py in f_with_exception_handling(*args, **kwargs)
317 exc.__context__ = None
318 exc.__cause__ = None
--> 319 raise exc
320
321 return f_with_exception_handling
ExecutionError: An error occurred while calling z:com.databricks.backend.daemon.dbutils.FSUtils.ls.
: java.io.FileNotFoundException: File /folder2 does not exist.
at shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem.listStatus(NativeAzureFileSystem.java:2468)
at com.databricks.backend.daemon.data.client.DBFSV2.$anonfun$listStatus(DatabricksFileSystemV2.scala:95)
at com.databricks.s3a.S3AExceptionUtils$.convertAWSExceptionToJavaIOException(DatabricksStreamUtils.scala:66)
at com.databricks.backend.daemon.data.client.DBFSV2.$anonfun$listStatus(DatabricksFileSystemV2.scala:92)
at com.databricks.logging.UsageLogging.$anonfun$recordOperation(UsageLogging.scala:395)
at com.databricks.logging.UsageLogging.executeThunkAndCaptureResultTags(UsageLogging.scala:484)
at com.databricks.logging.UsageLogging.$anonfun$recordOperationWithResultTags(UsageLogging.scala:504)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext(UsageLogging.scala:266)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:261)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:258)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.withAttributionContext(DatabricksFileSystemV2.scala:510)
at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:305)
at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:297)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.withAttributionTags(DatabricksFileSystemV2.scala:510)
at com.databricks.logging.UsageLogging.recordOperationWithResultTags(UsageLogging.scala:479)
at com.databricks.logging.UsageLogging.recordOperationWithResultTags$(UsageLogging.scala:404)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.recordOperationWithResultTags(DatabricksFileSystemV2.scala:510)
at com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:395)
at com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:367)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.recordOperation(DatabricksFileSystemV2.scala:510)
at com.databricks.backend.daemon.data.client.DBFSV2.listStatus(DatabricksFileSystemV2.scala:92)
at com.databricks.backend.daemon.data.client.DatabricksFileSystem.listStatus(DatabricksFileSystem.scala:150)
at com.databricks.backend.daemon.dbutils.FSUtils$.$anonfun$ls(DBUtilsCore.scala:154)
at com.databricks.backend.daemon.dbutils.FSUtils$.withFsSafetyCheck(DBUtilsCore.scala:91)
at com.databricks.backend.daemon.dbutils.FSUtils$.ls(DBUtilsCore.scala:153)
at com.databricks.backend.daemon.dbutils.FSUtils.ls(DBUtilsCore.scala)
at sun.reflect.GeneratedMethodAccessor223.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
为什么 dbutils.fs.ls()
没有列出在这种情况下有文件夹的文件夹的任何具体原因?
回答:
我试图直接访问一个文件并注意到它是 blob 类型 Append Blob
。 dbutils.fs.ls('/mnt/my-data-src/folder2/file.json)
报告以下消息。
shaded.databricks.org.apache.hadoop.fs.azure.AzureException: hadoop_azure_shaded.com.microsoft.azure.storage.StorageException: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.
有什么方法可以列出数据块中附加的 blob 类型吗?
经过少量研究,在文档中找到了 link。 Official Doc
- 终于有一种方法可以将它们列为 Databricks 笔记本中的文件。参考 git sample link
第 1 步。安装 azure-storage-blob
模块,工作区内有临时集群。
%pip install azure-storage-blob
第二步.获取azure storeage的连接字符串
from azure.storage.blob import ContainerClient
CONNECTION_STRING_OF_AZURE_BLOB_STORAGE='<connection-string-blob-storage-of-Access (IAM)>'
container = ContainerClient.from_connection_string(CONNECTION_STRING_OF_AZURE_BLOB_STORAGE, container_name="my-app-container")
#print(len(item))
blob_list = container.list_blobs()
for blob in blob_list:
print(blob.name + '\n')
- 通过上面的代码,我能够列出每个文件夹中的所有文件。
Azure Databricks 确实支持使用 Hadoop API 访问追加 blob,但仅在追加到文件时才支持。
此问题没有解决方法。
使用 Azure CLI 或 Azure Storage SDK Python 来确定目录是否包含追加 blob 或对象是追加 blob。
您可以使用 RDD API 实现 Spark SQL UDF 或自定义函数,以使用 Python.[=11= 的 Azure 存储 SDK 加载、读取或转换 blob。 ]
针对此问题已给出 official documentation。
我在一个容器下有三个文件夹
文件夹结构
folder1
|_ file1.json
|_ file2.json
folder2
|_ sub-folder1
|_ file1.json
|_ sub_folder2
|_ sub-folder01
|_ file2.json
folder3
|_ sub-folder1
|_ file1.json
注意:folder2
只有文件夹列表,其中可能有文件,我正在尝试迭代并在 python 代码中查找特定文件名。
from pyspark.sql.functions import col,lit
from datetime import datetime
app_storage_acct_name= 'mystorageaccnt1'
app_storage_acct_scope="{}-scope".format(app_storage_acct_name)
config_secret_set_url = "fs.azure.account.key.{}.blob.core.windows.net".format(app_storage_acct_name)
secret = dbutils.secrets.get(scope = app_storage_acct_scope, key = app_storage_acct_key)
dbutils.fs.mount(
source = "wasbs://mycontainer1@mystirageaccnt1.blob.core.windows.net",
mount_point = "/mnt/my-data-src",
extra_configs = {config_secret_set_url:dbutils.secrets.get(scope = app_storage_acct_scope, key = app_storage_acct_key)})
dbutils.fs.ls('/mnt/my-data-src/')
以上代码打印了三个文件夹,我也在 blob 存储资源管理器中看到了这些文件夹
Out[29]: [FileInfo(path='dbfs:/mnt/my-data-src/folder1/', name='folder1/', size=0),
FileInfo(path='dbfs:/mnt/my-data-src/folder2/', name='folder2/', size=0),
FileInfo(path='dbfs:/mnt/my-data-src/folder3/', name='folder3/', size=0)]
当我在下面使用时,列出了文件
dbutils.fs.ls('/mnt/my-data-src/folder1/')
- 输出如下
Out[30]: [FileInfo(path='dbfs:/mnt/my-data-src/folder1/file1.json', name='file1.json', size=1011),
FileInfo(path='dbfs:/mnt/my-data-src....,
当我尝试使用
列出 folder2 下的文件夹时dbutils.fs.ls('/mnt/my-data-src/folder2/')
- 输出
java.io.FileNotFoundException: File /folder2 does not exist.
ExecutionError Traceback (most recent call last)
<command-2660727172978602> in <module>
----> 1 dbutils.fs.ls('/mnt/my-data-src/folder2/')
/databricks/python_shell/dbruntime/dbutils.py in f_with_exception_handling(*args, **kwargs)
317 exc.__context__ = None
318 exc.__cause__ = None
--> 319 raise exc
320
321 return f_with_exception_handling
ExecutionError: An error occurred while calling z:com.databricks.backend.daemon.dbutils.FSUtils.ls.
: java.io.FileNotFoundException: File /folder2 does not exist.
at shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem.listStatus(NativeAzureFileSystem.java:2468)
at com.databricks.backend.daemon.data.client.DBFSV2.$anonfun$listStatus(DatabricksFileSystemV2.scala:95)
at com.databricks.s3a.S3AExceptionUtils$.convertAWSExceptionToJavaIOException(DatabricksStreamUtils.scala:66)
at com.databricks.backend.daemon.data.client.DBFSV2.$anonfun$listStatus(DatabricksFileSystemV2.scala:92)
at com.databricks.logging.UsageLogging.$anonfun$recordOperation(UsageLogging.scala:395)
at com.databricks.logging.UsageLogging.executeThunkAndCaptureResultTags(UsageLogging.scala:484)
at com.databricks.logging.UsageLogging.$anonfun$recordOperationWithResultTags(UsageLogging.scala:504)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext(UsageLogging.scala:266)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:261)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:258)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.withAttributionContext(DatabricksFileSystemV2.scala:510)
at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:305)
at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:297)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.withAttributionTags(DatabricksFileSystemV2.scala:510)
at com.databricks.logging.UsageLogging.recordOperationWithResultTags(UsageLogging.scala:479)
at com.databricks.logging.UsageLogging.recordOperationWithResultTags$(UsageLogging.scala:404)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.recordOperationWithResultTags(DatabricksFileSystemV2.scala:510)
at com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:395)
at com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:367)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.recordOperation(DatabricksFileSystemV2.scala:510)
at com.databricks.backend.daemon.data.client.DBFSV2.listStatus(DatabricksFileSystemV2.scala:92)
at com.databricks.backend.daemon.data.client.DatabricksFileSystem.listStatus(DatabricksFileSystem.scala:150)
at com.databricks.backend.daemon.dbutils.FSUtils$.$anonfun$ls(DBUtilsCore.scala:154)
at com.databricks.backend.daemon.dbutils.FSUtils$.withFsSafetyCheck(DBUtilsCore.scala:91)
at com.databricks.backend.daemon.dbutils.FSUtils$.ls(DBUtilsCore.scala:153)
at com.databricks.backend.daemon.dbutils.FSUtils.ls(DBUtilsCore.scala)
at sun.reflect.GeneratedMethodAccessor223.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
为什么 dbutils.fs.ls()
没有列出在这种情况下有文件夹的文件夹的任何具体原因?
回答:
我试图直接访问一个文件并注意到它是 blob 类型 Append Blob
。 dbutils.fs.ls('/mnt/my-data-src/folder2/file.json)
报告以下消息。
shaded.databricks.org.apache.hadoop.fs.azure.AzureException: hadoop_azure_shaded.com.microsoft.azure.storage.StorageException: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.
有什么方法可以列出数据块中附加的 blob 类型吗?
经过少量研究,在文档中找到了 link。 Official Doc
- 终于有一种方法可以将它们列为 Databricks 笔记本中的文件。参考 git sample link
第 1 步。安装 azure-storage-blob
模块,工作区内有临时集群。
%pip install azure-storage-blob
第二步.获取azure storeage的连接字符串
from azure.storage.blob import ContainerClient
CONNECTION_STRING_OF_AZURE_BLOB_STORAGE='<connection-string-blob-storage-of-Access (IAM)>'
container = ContainerClient.from_connection_string(CONNECTION_STRING_OF_AZURE_BLOB_STORAGE, container_name="my-app-container")
#print(len(item))
blob_list = container.list_blobs()
for blob in blob_list:
print(blob.name + '\n')
- 通过上面的代码,我能够列出每个文件夹中的所有文件。
Azure Databricks 确实支持使用 Hadoop API 访问追加 blob,但仅在追加到文件时才支持。
此问题没有解决方法。
使用 Azure CLI 或 Azure Storage SDK Python 来确定目录是否包含追加 blob 或对象是追加 blob。
您可以使用 RDD API 实现 Spark SQL UDF 或自定义函数,以使用 Python.[=11= 的 Azure 存储 SDK 加载、读取或转换 blob。 ]
针对此问题已给出 official documentation。