Not able to cat dbfs file in databricks community edition cluster. FileNotFoundError: [Errno 2] No such file or directory:

Not able to cat dbfs file in databricks community edition cluster. FileNotFoundError: [Errno 2] No such file or directory:

正在尝试读取 databricks 社区版 集群中的增量日志文件。 (databricks-7.2 版本)

df=spark.range(100).toDF("id")
df.show()
df.repartition(1).write.mode("append").format("delta").save("/user/delta_test")

with open('/user/delta_test/_delta_log/00000000000000000000.json','r')  as f:
  for l in f:
    print(l)

Getting file not found error:

FileNotFoundError: [Errno 2] No such file or directory: '/user/delta_test/_delta_log/00000000000000000000.json'
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<command-1759925981994211> in <module>
----> 1 with open('/user/delta_test/_delta_log/00000000000000000000.json','r')  as f:
      2   for l in f:
      3     print(l)

FileNotFoundError: [Errno 2] No such file or directory: '/user/delta_test/_delta_log/00000000000000000000.json'

我已经尝试添加 /dbfs/dbfs:/ 什么都没有解决,仍然出现同样的错误。

with open('/dbfs/user/delta_test/_delta_log/00000000000000000000.json','r')  as f:
  for l in f:
    print(l)

但是使用 dbutils.fs.head 我能够读取文件。

dbutils.fs.head("/user/delta_test/_delta_log/00000000000000000000.json")

'{"commitInfo":{"timestamp":1598224183331,"userId":"284520831744638","userName":"","operation":"WRITE","operationParameters":{"mode":"Append","partitionBy":"[]"},"notebook":{"","isolationLevel":"WriteSerializable","isBlindAppend":true,"operationMetrics":{"numFiles":"1","numOutputBytes":"1171","numOutputRows":"100"}}}\n{"protocol":{"minReaderVersi...etc

我们如何read/cat dbfs file 在带有 python open method 的数据块中?

默认情况下,此数据位于 DBFS 上,您的代码需要了解如何访问它。 Python 不知道 - 这就是它失败的原因。

但有一个解决方法 - DBFS 安装到位于 /dbfs 的节点,因此您只需将其附加到您的文件名:而不是 /user/delta_test/_delta_log/00000000000000000000.json,使用 /dbfs/user/delta_test/_delta_log/00000000000000000000.json

更新:在社区版中,在 DBR 7+ 中,此挂载被禁用。解决方法是使用 dbutils.fs.cp 命令将文件从 DBFS 复制到本地目录,例如 /tmp/var/tmp,然后从中读取:

dbutils.fs.cp("/file_on_dbfs", "file:///tmp/local_file")

请注意,如果您不指定 URI 架构,则文件默认引用 DBFS,要引用您需要使用 file:// 前缀的本地文件(请参阅 docs) .