Pyspark FileSystem fs.listStatus(sc._jvm.org.apache.hadoop.fs.Path(path)) 只有 returns 第一个子目录

Pyspark FileSystem fs.listStatus(sc._jvm.org.apache.hadoop.fs.Path(path)) only returns the first sub-directory

hdfs
pyspark

我想在不使用 hadoop fs -ls [path] 的情况下在 Pyspark 中递归地遍历给定的 hdfs 路径。我尝试了建议的解决方案 here, but found that listStatus() only returns me the status of the first sub-directory in the given path. According to this documentation，listStatus 应该 return "the statuses of the files/directories in the given path if the path is a directory." 我错过了什么？

我正在使用 Hadoop 2.9.2、Spark 2.3.2 和 Python 2.7。

我无法完全重现场景，但我认为这与以下事实有关：如果路径不是目录，则该路径上的 listStatus() 将 return 列表长度为 1，仅包含路径本身。

Pyspark FileSystem fs.listStatus(sc._jvm.org.apache.hadoop.fs.Path(path)) 只有 returns 第一个子目录

Pyspark FileSystem fs.listStatus(sc._jvm.org.apache.hadoop.fs.Path(path)) only returns the first sub-directory

hdfs

pyspark