Spark 在 hdfs 上写入镶木地板

Question

我安装了 3 个节点的 hadoop 和 spark。我想从 rdbms 中获取数据到数据帧中，并将这些数据写入 HDFS 上的镶木地板。 "dfs.replication" 值为 1 。

当我使用以下命令尝试此操作时，我看到所有 HDFS 块都位于我执行 spark-shell.

的节点上

scala> xfact.write.parquet("hdfs://sparknode01.localdomain:9000/xfact")

这是预期的行为还是应该将所有块分布在整个集群中？

谢谢

Answer 1

由于您正在将数据写入 HDFS，因此这不依赖于 spark，而是依赖于 HDFS。来自 Hadoop : Definitive Guide

Hadoop’s default strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy).

所以是的，这是预期的行为。

Answer 2

就像@nik 说的，我和很多人一起工作，它为我完成了：

这是 python 片段：

columns = xfact.columns test = sqlContext.createDataFrame(xfact.rdd.map(lambda a: a),columns) test.write.mode('overwrite').parquet('hdfs://sparknode01.localdomain:9000/xfact')

Spark 在 hdfs 上写入镶木地板

Spark write to parquet on hdfs

hadoop

scala

hdfs

apache-spark

parquet