Pyspark 如何遍历目录获取文件并计算行数

Question

我正在尝试遍历 hdfs 目录及其子目录以获取 csv 文件并计算每个文件中的行数。我正在尝试以下代码片段，但它一直向我抛出错误“IllegalArgumentException：'Pathname /hdfs:/data/msd from /hdfs:/data/msd is not a valid DFS filename.'”

hadoop = sc._jvm.org.apache.hadoop
fs = hadoop.fs.FileSystem
conf = hadoop.conf.Configuration() 
path = hadoop.fs.Path("/hdfs:///data/msd")

for f in fs.get(conf).listStatus(path):
    print(f.getPath(), f.getLen())

Answer 1

只需删除路径中的第一个斜杠即可。应该是 hdfs:///data/msd 而不是

Pyspark 如何遍历目录获取文件并计算行数

Pyspark how to loop through a directory fetch files and count number of rows

hadoop

hdfs

apache-spark

pyspark