是否有类似于 Hadoop Streaming 的 Apache Spark 对应物?

Is there any Apache Spark counterpart similar to Hadoop Streaming?

我想用 C++ 实现一些高度自定义的处理逻辑。 Hadoop Streaming 使我能够将 C++ 编码的逻辑集成到 MapReduce 处理管道中。我想知道我是否可以用 Apache Spark 做同样的事情。

最接近(但不完全等价)的解决方案是RDD.pipe方法:

Return an RDD created by piping elements to a forked external process. The resulting RDD is computed by executing the given process once per partition. All elements of each input partition are written to a process's stdin as lines of input separated by a newline. The resulting partition consists of the process's stdout output, with each line of stdout resulting in one element of the output partition. A process is invoked even for empty partitions.

The print behavior can be customized by providing two functions.

Spark test suite 提供了许多使用示例。