使用 RDD.foreach 创建 Dataframe 并在 Spark scala 中对 Dataframe 执行操作

Question

我正在尝试读取 spark read.textfile 中的配置文件，其中基本上包含我的 table 列表。我的任务是遍历 table 列表并将 Avro 转换为 ORC 格式。请找到我下面的代码片段，它将执行逻辑。

val tableList = spark.read.textFile('tables.txt')
tableList.collect().foreach(tblName => {
val df = spark.read.format("avro").load(inputPath+ "/" + tblName)
df.write.format("orc").mode("overwrite").save(outputPath+"/"+tblName)})

请在下面找到我的配置

DriverMemory: 4GB

ExecutorMemory: 10GB

NoOfExecutors: 5

Input DataSize: 45GB

我的问题是这将在执行器或驱动程序中执行？这会抛出内存不足错误？请评论您的建议。

val tableList = spark.read.textFile('tables.txt')

tableList.collect().foreach(tblName => {

val df = spark.read.format("avro").load(inputPath+ "/" + tblName)

df.write.format("orc").mode("overwrite").save(outputPath+"/"+tblName)}

)

Answer 1

我建议取消收集，因为它是一项操作，因此 45gb 文件中的所有数据都加载到内存中。你可以试试这样的

val tableList = spark.read.textFile('tables.txt')
tableList.foreach(tblName => {
val df = spark.read.format("avro").load(inputPath+ "/" + tblName)
df.write.format("orc").mode("overwrite").save(outputPath+"/"+tblName)})

Answer 2

回复：

will this execute in Executor or Driver?

调用 tableList.collect() 后，'tables.txt' 的内容将被带到驱动程序应用程序。如果它在驱动程序内存中，它应该没问题。然而，Dataframe 上的保存操作将在执行器上执行。

回复：

This will throw Out of Memory Error ?

你遇到过吗？ IMO，除非您的 tables.txt 太大，否则您应该 alright.I 我假设输入数据大小为 45 GB 是 tables.txt.

中提到的表中的数据

希望对您有所帮助。

使用 RDD.foreach 创建 Dataframe 并在 Spark scala 中对 Dataframe 执行操作

Use RDD.foreach to Create a Dataframe and execute actions on the Dataframe in Spark scala

out-of-memory

executor

apache-spark

apache-spark-sql