从另一个数据源读取路径时，如何避免使用 collectAsList()？

Question

我将 parquet 路径显示为 table 列，然后需要将此列列表作为输入传递给 readFrom parquet。

List<Row> rows = spark.read.<datasource>.select("path").collectAsList();

List<String> paths =  <convert the rows to string>

spark.read.parquet(paths).

collectAsList 是一项开销很大的操作，需要将数据传送给驱动程序。

有没有更好的方法？

Answer 1

Is there a better approach?

不，别无选择。

spark.read.parquet表示的代码将始终在驱动程序上执行。驱动程序将告诉各个执行程序各自的执行程序应该加载 parquet 文件的哪一部分，然后执行程序将加载数据。但是协调哪个执行者应该处理镶木地板文件的哪一部分是驱动程序的任务。因此必须将路径发送给驱动程序。

坏消息之后是好消息：确实 collectAsList 很贵，但 没那么贵 。 collectAsList 在处理 巨大的 数据帧时非常昂贵。 Huge 在这种情况下意味着数以百万计的行。我怀疑您是否打算加载那么多镶木地板文件。只要路径列表“仅”包含几万行，将此列表发送给驱动程序就没有问题。运行驱动程序的标准 JVM 将轻松处理这样的列表。

Answer 2

另一种方法是使用 CollectionAccumulator。

df.javaRDD().forEachPartition{
//executor code
//add values to the accumulator
}

//Note that this is executed in the driver
List<String> paths = accumulator.value

从另一个数据源读取路径时，如何避免使用 collectAsList()？

How can I avoid a collectAsList() when path is read from another data source?

optimization

collect

apache-spark