在单元测试中模拟 Spark RDD

Question

是否可以在不使用 sparkContext 的情况下模拟 RDD？

我想对以下实用函数进行单元测试：

 def myUtilityFunction(data1: org.apache.spark.rdd.RDD[myClass1], data2: org.apache.spark.rdd.RDD[myClass2]): org.apache.spark.rdd.RDD[myClass1] = {...}

所以我需要将 data1 和 data2 传递给 myUtilityFunction。如何从模拟 org.apache.spark.rdd.RDD[myClass1] 创建 data1，而不是从 SparkContext 创建真正的 RDD？谢谢！

Answer 1

RDD 非常复杂，模拟它们可能不是创建测试数据的最佳方式。相反，我建议对您的数据使用 sc.parallelize。我也（有点偏见）认为 https://github.com/holdenk/spark-testing-base 可以提供一个特性来设置和拆除测试的 Spark 上下文。

Answer 2

我完全同意@Holden 的观点！

模拟 RDDS 很困难；在本地执行单元测试首选 Spark 上下文，如 programming guide.

中所推荐

我知道这在技术上可能不是单元测试，但希望接近够了。

Unit Testing

Spark is friendly to unit testing with any popular unit test framework. Simply create a SparkContext in your test with the master URL set to local, run your operations, and then call SparkContext.stop() to tear it down. Make sure you stop the context within a finally block or the test framework’s tearDown method, as Spark does not support two contexts running concurrently in the same program.

但是如果您真的有兴趣并且仍然想尝试模拟 RDD，我建议您阅读 ImplicitSuite 测试代码。

他们伪模拟 RDD 的唯一原因是测试 implict 是否与编译器配合良好，但他们实际上并不需要真正的 RDD。

def mockRDD[T]: org.apache.spark.rdd.RDD[T] = null

而且它甚至不是真正的模拟。它只是创建一个 RDD[T]

类型的空对象

在单元测试中模拟 Spark RDD

Mock a Spark RDD in the unit tests

unit-testing

scala

mocking

scalatest

apache-spark