重用数据集的 Spark 持久化功能

Question

假设我通过不同的转换（连接、映射等）创建了数据集并将其保存到 hbase 中的 table A。现在我想将相同的数据集保存到 hbase 中的另一个 tables，其中包含 selecting 特定列。在这种情况下，保存到 table A 后是否应该使用 persist 功能？或者如果我只使用 select 功能，那没关系？

例如：

Dataset<Row> ds = //computing dataset by different transformations
//save ds to table A in hbase

ds.persist();

Dataset<Row> ds2 = ds.select(col("X"));
//save ds2 to table B in hbase

Dataset<Row> ds3 = ds.select(col("Y"),col("Z"));
//save ds3 to table C in hbase

ds.unpersist();

Answer 1

你可以做到

Dataset<Row> ds = //computing dataset by different transformations
ds.persist();    
//save ds to table A in hbase

Dataset<Row> ds2 = ds.select(col("X"));
//save ds2 to table B in hbase

Dataset<Row> ds3 = ds.select(col("Y"),col("Z"));
//save ds3 to table C in hbase

ds.unpersist();

这样您就可以持久化所有内容，然后将不同的列集保存到不同的表中。

Answer 2

Scala 是惰性的，在这种情况下，这意味着如果您不持久化数据，将为每个操作重做所有转换。因此，如果计算数据集 ds

Dataset<Row> ds = //computing dataset by different transformations

需要很长时间，那么持久化数据绝对有利。为了获得最佳效果，我建议在第一次保存（保存到 table A）之前完成。在persiting完成之后，所有的数据读取和转换都会进行两次。

请注意，在完成对数据集和后续数据集的所有操作之前，您不应使用 unpersist()。

重用数据集的 Spark 持久化功能

Spark persist function in reusing dataset

java

persistence

caching

dataset

apache-spark