无法将收集的 RDD 保存到驱动程序的本地文件系统

Question

我试图在调用 collect() 之后保存 RDD。我在 Host-1 上调用 spark-submit（我假设驱动程序是我从中调用 spark-submit 脚本的主机，所以在这种情况下，Host-1 是驱动程序），从 HBase 获取一些数据，运行对其进行一些操作，然后在 RDD 上调用 collect() 并迭代收集的列表并将其保存到本地文件系统文件。本质上：

if __name__ == "__main__":
    sc = SparkContext(appName="HBaseInputFormat")
    # read the data from hbase
    # ...
    # ...
    output = new_rdd.collect()

    with open("/var/tmp/tmpfile.csv", 'w') as tmpf:
        for o in output:
            print (o)
            tmpf.write("%s\n"%str(o))
    tmpf.close()

这实际上适用于保存在 /var/tmp/tmpfile.csv 中的数据，除了数据保存在与 Driver 不同的主机上，比方说 Host-3。我的印象是 collect 总是会在 Driver 主机上收集分布式数据集，因此文件也应该在 Driver 上创建。我哪里错了？

Answer 1

I am assuming the Driver is the host from which I invoke the spark-submit script so in this case Host-1 is the Driver

这不正确！请参阅 running spark on yarn.

上的文档

In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.

您可能运行在 yarn-cluster 模式下 ning spark，并且驱动程序被选择在集群中的一个节点上。

将其更改为 yarn-client，驱动程序将运行在您提交作业的节点上。

无法将收集的 RDD 保存到驱动程序的本地文件系统

Cannot save collect-ed RDD to local file system of Driver

python

hadoop

hbase

apache-spark

pyspark