是否可以从任务中获取和使用 JavaSparkContext？

Is it possible to get and use a JavaSparkContext from within a task?

我遇到过这样一种情况，我想在 Spark and/or Spark Streaming 管道（在 Java 中）中执行 "lookup"。查找有点复杂，但幸运的是，我有一些可以重复使用的现有 Spark 管道（可能是 DataFrames）。

对于每条传入记录，我想可能从任务中启动一个 spark 作业，以获取必要的信息来装饰它。

考虑到性能影响，永远是个好主意吗？

不考虑性能影响，这可能吗？

Is it possible to get and use a JavaSparkContext from within a task?

没有。 spark 上下文仅在驱动程序上有效，Spark 将阻止对其进行序列化。因此，不可能在任务中使用 Spark 上下文。

For every incoming record, I'd like to potentially launch a spark job from the task to get the necessary information to decorate it with. Considering the performance implications, would this ever be a good idea?

如果没有更多细节，我的总括答案是：可能不是一个好主意。

Not considering the performance implications, is this even possible?

是的，可能是通过将基础集合引入驱动程序 (collect) 并对其进行迭代。如果该集合不适合驱动程序的内存，请上一点。

如果我们需要处理每条记录，请考虑使用 'decorating' 数据集执行某种形式的 join - 这将只是一项大型作业，而不是大量的小型作业。

是否可以从任务中获取和使用 JavaSparkContext？

Is it possible to get and use a JavaSparkContext from within a task?

apache-spark

spark-streaming

apache-spark-sql