使用 pySpark 将部分文件从 hdfs 读取到数据框中

Question

我有多个文件存储在 hdfs 位置，如下所示

/user/project/202005/part-01798

/user/project/202005/part-01799

这样的零件文件有2000个。每个文件的格式为

{'Name':'abc','Age':28,'Marks':[20,25,30]} 
{'Name':...}

等等。我有 2 个问题

1) How to check whether these are multiple files or multiple partitions of the same file
2) How to read these in a data frame using pyspark

Answer 1

由于这些文件位于一个目录中，并且命名为 part-xxxxx 文件，因此您可以放心地假设这些是同一数据集的多个部分文件。如果这些是分区，它们应该像这样保存 /user/project/date=202005/*
假设这些是 csv 文件，您可以指定目录“/user/project/202005”作为 spark 的输入，如下所示

df = spark.read.csv('/user/project/202005/*',header=True, inferSchema=True)

Read part files from hdfs into data frame using pySpark