调用 dataframe.distinct() 会导致将内容混洗到驱动程序以进行最终区分吗？

Question

我有以下代码尝试读取一些 json，区分它们并将输出写入单个 json 文件。我的问题是我是否应该在 .distinct() 之后 .collect() 还是它会在幕后发生？

val manyJsons = sqlContext.read.json(someJsonDirectory)
val distinctJsons = manyJsons.distinct()
distinctJsons.coalesce(1).write.json(jsonDirectoryWithOneFile)

Answer 1

如果要写入磁盘中的文件，则不需要 .collect()

.distinct() 将shuffle 的数据找到duplicates 和remove duplicates.

.coalesce(1) 在您的代码中，在写入文件之前将所有 partitions 移动到一个节点。这等同于 .collect()。唯一的区别是 .collect() 会将所有分区移动到 driver node，但 .coalesce 可能会也可能不会将所有分区移动到 driver node。 .coalesce(1) 用于创建一个分区，以便输出文件只有一个。

调用 dataframe.distinct() 会导致将内容混洗到驱动程序以进行最终区分吗？

Will calling dataframe.distinct() result in shuffling the contents to the driver for a final distinction?

scala

dataframe

apache-spark

databricks