Apache Spark 3 是否支持 GPU 用于 Spark RDD?

Does Apache Spark 3 support GPU usage for Spark RDDs?

我目前正在尝试使用 Hail(用 python 和 Scala 编写的基因组分析库)运行 基因组分析管道。近期Apache Spark 3发布,支持GPU使用

我试过 spark-rapids 库启动一个带有 gpu 节点的本地 slurm 集群。我能够初始化集群。然而,当我尝试 运行ning 冰雹任务时,执行者不断被杀死。

在Hail论坛查询,得到的回复是

That’s a GPU code generator for Spark-SQL, and Hail doesn’t use any Spark-SQL interfaces, only the RDD interfaces.

那么Spark3是不是不支持RDD接口使用GPU?

截至目前,spark-rapids 不支持将 GPU 用于 RDD 接口。

来源:Link

Apache Spark 3.0+ lets users provide a plugin that can replace the backend for SQL and DataFrame operations. This requires no API changes from the user. The plugin will replace SQL operations it supports with GPU accelerated versions. If an operation is not supported it will fall back to using the Spark CPU version. Note that the plugin cannot accelerate operations that manipulate RDDs directly.

这里是 spark-rapids 团队的回答

来源:Link

We do not support running the RDD API on GPUs at this time. We only support the SQL/Dataframe API, and even then only a subset of the operators. This is because we are translating individual Catalyst operators into GPU enabled equivalent operators. I would love to be able to support the RDD API, but that would require us to be able to take arbitrary java, scala, and python code and run it on the GPU. We are investigating ways to try to accomplish some of this, but right now it is very difficult to do. That is especially true for libraries like Hail, which use python as an API, but the data analysis is done in C/C++.