apache 箭头 - 并行处理的充分性

apache arrow - adequacy for parallel processing

我有一个庞大的数据集，正在使用 Apache Spark 进行数据处理。

使用 Apache Arrow，我们可以将 Spark 兼容的数据帧转换为 Pandas 兼容的数据帧，并对其进行运行操作。

通过转换数据帧，它会达到 Spark 中的并行处理性能还是表现得像 Pandas？

如您在文档中所见here

Note that even with Arrow, toPandas() results in the collection of all records in the DataFrame to the driver program and should be done on a small subset of the data

当数据移动到Pandas数据帧时，数据将发送给驱动程序。这意味着如果驱动程序需要处理的数据太多，您可能会遇到性能问题。因此，如果您决定使用 Pandas，请尝试在调用 toPandas() 方法之前对数据进行分组。

一旦转换为 Pandas 数据帧，它就不会具有相同的并行化，因为 Spark 执行程序不会处理该场景。 Arrow 的妙处在于可以直接从 Spark data frame 移动到 Pandas，但是你得考虑数据的大小

另一种可能性是使用其他框架，如 Koalas。它具有 Pandas 的一些“优点”，但已集成到 Spark 中。

apache 箭头 - 并行处理的充分性

apache arrow - adequacy for parallel processing

pandas

apache-spark

apache-arrow