{DataFrameWriter CSV to HDFS file system} 不分区写入数据

Question

在这里，df 是我们输出的数据框，因为我正在使用 dataframewriter 将整个输出写入目录，但是所有数据都按照下面提到的方式进行分区..

$ hdfs dfs -ls /path to hdfs directory..

Found 4 items

-rw-r--r--   3 xxxxxx xxxxxxx          0 2022-04-28 23:19 path to hdfs directory../_SUCCESS

-rw-r--r--   3 xxxxxx xxxxxx        238 2022-04-28 23:19 path to hdfs directory../part-00000-4bc48c17-5c85-44be-bf34-3645d2b2e085-c000.csv

-rw-r--r--   3 xxxxxxx xxxxxxx    6204498 2022-04-28 23:19 path to hdfs directory../part-00043-4bc48c17-5c85-44be-bf34-3645d2b2e085-c000.csv

-rw-r--r--   3 xxxxxxx xxxxxxx    5875627 2022-04-28 23:19 path to hdfs directory../part-00191-4bc48c17-5c85-44be-bf34-3645d2b2e085-c000.csv

我想将所有数据放入一个 CSV 文件中，下面的代码中是否还有其他选项..

df.write.mode("overwrite").csv('path to hdfs directory', header = True, sep = ',')

数据在 df 中大约有 55k 行。

Answer 1

您可以使用 coalesce(1) 制作单个 CSV 文件

df.coalesce(1).write.mode("overwrite").csv('path to hdfs directory', header = True, sep = ',')

{DataFrameWriter CSV to HDFS file system} 不分区写入数据

{DataFrameWriter CSV to HDFS file system} write data without partitioning

csv

hdfs

dataframe

apache-spark-sql

pyspark