带有自定义 Hadoop 文件系统的 Spark

Question

我已经有一个带有 Yarn 的集群，配置为在 core-site.xml:

中使用自定义 Hadoop 文件系统

<property>
    <name>fs.custom.impl</name>
    <value>package.of.custom.class.CustomFileSystem</value>
</property>

我想运行这个 Yarn 集群上的一个 Spark 作业，它从这个 CustomFilesystem 读取一个输入 RDD：

final JavaPairRDD<String, String> files = 
        sparkContext.wholeTextFiles("custom://path/to/directory");

有没有什么方法可以在不重新配置 Spark 的情况下做到这一点？即我能否将 Spark 指向现有的核心-site.xml，最好的方法是什么？

Answer 1

将 HADOOP_CONF_DIR 设置为包含 core-site.xml 的目录。（这记录在 Running Spark on YARN 中。）

您仍然需要确保 package.of.custom.class.CustomFileSystem 在类路径中。

Spark with custom Hadoop FileSystem