如何通过 Scala 代码从 ADLS Gen2 路径获取所有叶文件夹的列表？

Question

我们有文件夹和子文件夹，里面有年、月、日文件夹。我们如何使用 dbutils.fs.ls 实用程序仅获取最后一个叶级文件夹列表？

示例路径：

abfss://cont@storage.dfs.core.windows.net/customer/data/V1/2021/ abfss://cont@storage.dfs.core.windows.net/customer/data/V1/2022/ abfss://cont@storage.dfs.core.windows.net/customer/data/V1/2022/03/24/15/a.parquet abfss://cont@storage.dfs.core.windows.net/customer/data/V1/2022/03/25/15/b.parquet . .

该函数应该 return 仅最后一个叶级文件夹列表，即

abfss://cont@storage.dfs.core.windows.net/customer/data/V1/2022/03/24/15 abfss://cont@storage.dfs.core.windows.net/customer/data/V1/2022/03/25/15

编辑：

我已经尝试了下面的功能并且它可以工作但是当某些文件夹为空并出现错误“java.lang.UnsupportedOperationException：empty.reduceLeft”时它会失败。请帮忙。

def listLeafDirectories(path: String): Array[String] =
  dbutils.fs.ls(path).map(file => {
    // Work around double encoding bug
    val path = file.path.replace("%25", "%").replace("%25", "%")
    if (file.isDir) listLeafDirectories(path)
    else Array[String](path.substring(0,path.lastIndexOf("/")+1))
  }).reduce(_ ++ _).distinct

Answer 1

The function should return only last leaf level folder list

你可以使用scala的折叠功能。

def fold[A1 >: A](z: A1)(op: (A1, A1) => A1): A1

使用指定的关联二元运算符折叠此列表的元素。 IterableOnce 中的默认实现等同于 foldLeft 但可能会被覆盖以获得更高效的遍历顺序。

未指定对元素执行操作的顺序，并且可能是不确定的。

returns 在所有元素和 z 之间应用折叠运算符 op 的结果，如果此列表为空，则为 z。

有关详细信息，请参阅此 link

Answer 2

以下功能对我有用

 def listDirectories(dir: String, recurse: Boolean): Array[String] = {
    dbutils.fs.ls(dir).map(file => {
      val path = file.path.replace("%25", "%").replace("%25", "%")
      if (file.isDir) listDirectories(path,recurse)
      else Array[String](path.substring(0, path.lastIndexOf("/")+1))
    }).reduceOption(_ union _).getOrElse(Array()).distinct
  }

如何通过 Scala 代码从 ADLS Gen2 路径获取所有叶文件夹的列表？

How to get list of all leaf folders from ADLS Gen2 path via Scala code?

scala

apache-spark

databricks

azure-databricks