如果 select 条件基于 RDD 转换，spark 是否将整个 table 加载到内存中？

Question

DataSet<Row> a = spark.read().format("com.memsql.spark.connector").option("query", "select * from a");
a = a.filter((row)-> row.x = row.y)
Sring xstring = "...select all values of x from a and make comma separated string"
DataSet<Row> b = spark.read().format("com.memsql.spark.connector").option("query", "select * from b where x in " + xstring);
b.show()

在这种情况下，spark 会将整个 b table 加载到内存中，然后过滤掉 xtring 行，或者它实际上创建该 xstring，然后在内存中加载 table b 的一个子集，当我们来电秀

Answer 1

当使用 option("query", "select * from .......") 查询 memsql 时，整个结果（不是 table）将从 memsql 读取到执行程序中。 MemSQL Spark Connector 2.0 支持列和过滤器下推，SQL 需要具有过滤器和连接条件，而不是在数据帧上应用过滤器和连接。在您的示例中，将使用谓词下推。在您的示例中 - 整个 table 'a' 将被读取，因为没有过滤条件，将构建 xstring，然后仅读取 table 'b' 匹配 x in (...)条件。

这是 memsql 文档 explaining this。

如果 select 条件基于 RDD 转换，spark 是否将整个 table 加载到内存中？

Does spark load the entire table in memory if select condition is based on RDD transformation?

directed-acyclic-graphs

apache-spark

apache-spark-sql