在独立模式下编写 Parquet 文件是可行的..多工作模式失败
Writing Parquet file in standalone mode works..multiple worker mode fails
在 Spark 版本 1.6.1(代码在 Scala 2.10 中)中,我正在尝试将数据帧写入 Parquet 文件:
import sc.implicits._
val triples = file.map(p => _parse(p, " ", true)).toDF()
triples.write.mode(SaveMode.Overwrite).parquet("hdfs://some.external.ip.address:9000/tmp/table.parquet")
当我在开发模式下执行时,一切正常。如果我在同一台机器上的 docker 环境(单独的 docker 容器)中以独立模式设置一个 master 和一个 worker,它也可以正常工作。当我尝试在集群上执行它时失败(1 名主人,5 名工作人员)。如果我在 master 上本地设置它也可以。
当我尝试执行它时,我得到以下堆栈跟踪:
{
"duration": "18.716 secs",
"classPath": "LDFSparkLoaderJobTest2",
"startTime": "2016-07-18T11:41:03.299Z",
"context": "sql-context",
"result": {
"errorClass": "org.apache.spark.SparkException",
"cause": "Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, curry-n3): java.lang.NullPointerException
at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:147)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113)
at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetRelation.scala:101)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.abortTask(WriterContainer.scala:294)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:271)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$$anonfun$apply$mcV$sp.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$$anonfun$apply$mcV$sp.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)\n\nDriver stacktrace:",
"stack":[
"org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)",
"org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage.apply(DAGScheduler.scala:1419)",
"org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage.apply(DAGScheduler.scala:1418)",
"scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)",
"scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)",
"org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)",
"org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed.apply(DAGScheduler.scala:799)",
"org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed.apply(DAGScheduler.scala:799)",
"scala.Option.foreach(Option.scala:236)",
"org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)",
"org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)",
"org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)",
"org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)",
"org.apache.spark.util.EventLoop$$anon.run(EventLoop.scala:48)",
"org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)",
"org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)",
"org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)",
"org.apache.spark.SparkContext.runJob(SparkContext.scala:1922)",
"org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:150)",
"org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run.apply(InsertIntoHadoopFsRelation.scala:108)",
"org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run.apply(InsertIntoHadoopFsRelation.scala:108)",
"org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)",
"org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)",
"org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)",
"org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)",
"org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)",
"org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:132)",
"org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:130)",
"org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)",
"org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)",
"org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)",
"org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)",
"org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)",
"org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)",
"org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)",
"org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:334)",
"LDFSparkLoaderJobTest2$.readFile(SparkLoaderJob.scala:55)",
"LDFSparkLoaderJobTest2$.runJob(SparkLoaderJob.scala:48)",
"LDFSparkLoaderJobTest2$.runJob(SparkLoaderJob.scala:18)",
"spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture.apply(JobManagerActor.scala:268)",
"scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1(Future.scala:24)",
"scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)",
"java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)",
"java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)",
"java.lang.Thread.run(Thread.java:745)"
],
"causingClass": "org.apache.spark.SparkException",
"message": "Job aborted."
},
"status": "ERROR",
"jobId": "54ad3056-3aaa-415f-8352-ca8c57e02fe9"
}
备注:
- 作业是通过 Spark Jobserver 提交的。
- 需要转换为Parquet文件的文件大小为15.1MB。
问题:
- 是不是我做错了什么(我跟着the docs)
- 或者是否有其他方法可以创建 Parquet 文件,以便我的所有工作人员都可以访问它?
我遇到了不完全相同但类似的问题,即在集群模式下将数据帧写入 parquet 文件。在删除文件时,这些问题消失了,就在写入之前,使用这个方便的函数 'write(..)' :
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
..
def main(arg: Array[String]) {
..
val fs = FileSystem.get(sc.hadoopConfiguration)
..
def write(df:DataFrame, fn:String ) = {
val op1=s"hdfs:///user/you/$fn"
fs.delete(new Path(op1))
df.write.parquet(op1)
}
试一试,告诉我它是否适合你...
在您的独立设置中,只有一名工作人员在使用 ParquetRecordWriter
。所以它工作得很好。
在真实测试的情况下,即 集群(1 名主人,5 名工作人员)。 和 ParquetRecordWriter
它会失败,因为您同时写入多个工人...
请尝试以下。
import sc.implicits._
val triples = file.map(p => _parse(p, " ", true)).toDF()
triples.write.mode(SaveMode.Append).parquet("hdfs://some.external.ip.address:9000/tmp/table.parquet")
在 Spark 版本 1.6.1(代码在 Scala 2.10 中)中,我正在尝试将数据帧写入 Parquet 文件:
import sc.implicits._
val triples = file.map(p => _parse(p, " ", true)).toDF()
triples.write.mode(SaveMode.Overwrite).parquet("hdfs://some.external.ip.address:9000/tmp/table.parquet")
当我在开发模式下执行时,一切正常。如果我在同一台机器上的 docker 环境(单独的 docker 容器)中以独立模式设置一个 master 和一个 worker,它也可以正常工作。当我尝试在集群上执行它时失败(1 名主人,5 名工作人员)。如果我在 master 上本地设置它也可以。
当我尝试执行它时,我得到以下堆栈跟踪:
{
"duration": "18.716 secs",
"classPath": "LDFSparkLoaderJobTest2",
"startTime": "2016-07-18T11:41:03.299Z",
"context": "sql-context",
"result": {
"errorClass": "org.apache.spark.SparkException",
"cause": "Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, curry-n3): java.lang.NullPointerException
at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:147)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113)
at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetRelation.scala:101)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.abortTask(WriterContainer.scala:294)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:271)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$$anonfun$apply$mcV$sp.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$$anonfun$apply$mcV$sp.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)\n\nDriver stacktrace:",
"stack":[
"org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)",
"org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage.apply(DAGScheduler.scala:1419)",
"org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage.apply(DAGScheduler.scala:1418)",
"scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)",
"scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)",
"org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)",
"org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed.apply(DAGScheduler.scala:799)",
"org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed.apply(DAGScheduler.scala:799)",
"scala.Option.foreach(Option.scala:236)",
"org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)",
"org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)",
"org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)",
"org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)",
"org.apache.spark.util.EventLoop$$anon.run(EventLoop.scala:48)",
"org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)",
"org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)",
"org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)",
"org.apache.spark.SparkContext.runJob(SparkContext.scala:1922)",
"org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:150)",
"org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run.apply(InsertIntoHadoopFsRelation.scala:108)",
"org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run.apply(InsertIntoHadoopFsRelation.scala:108)",
"org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)",
"org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)",
"org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)",
"org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)",
"org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)",
"org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:132)",
"org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:130)",
"org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)",
"org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)",
"org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)",
"org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)",
"org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)",
"org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)",
"org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)",
"org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:334)",
"LDFSparkLoaderJobTest2$.readFile(SparkLoaderJob.scala:55)",
"LDFSparkLoaderJobTest2$.runJob(SparkLoaderJob.scala:48)",
"LDFSparkLoaderJobTest2$.runJob(SparkLoaderJob.scala:18)",
"spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture.apply(JobManagerActor.scala:268)",
"scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1(Future.scala:24)",
"scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)",
"java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)",
"java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)",
"java.lang.Thread.run(Thread.java:745)"
],
"causingClass": "org.apache.spark.SparkException",
"message": "Job aborted."
},
"status": "ERROR",
"jobId": "54ad3056-3aaa-415f-8352-ca8c57e02fe9"
}
备注:
- 作业是通过 Spark Jobserver 提交的。
- 需要转换为Parquet文件的文件大小为15.1MB。
问题:
- 是不是我做错了什么(我跟着the docs)
- 或者是否有其他方法可以创建 Parquet 文件,以便我的所有工作人员都可以访问它?
我遇到了不完全相同但类似的问题,即在集群模式下将数据帧写入 parquet 文件。在删除文件时,这些问题消失了,就在写入之前,使用这个方便的函数 'write(..)' :
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
..
def main(arg: Array[String]) {
..
val fs = FileSystem.get(sc.hadoopConfiguration)
..
def write(df:DataFrame, fn:String ) = {
val op1=s"hdfs:///user/you/$fn"
fs.delete(new Path(op1))
df.write.parquet(op1)
}
试一试,告诉我它是否适合你...
在您的独立设置中,只有一名工作人员在使用
ParquetRecordWriter
。所以它工作得很好。在真实测试的情况下,即 集群(1 名主人,5 名工作人员)。 和
ParquetRecordWriter
它会失败,因为您同时写入多个工人...
请尝试以下。
import sc.implicits._
val triples = file.map(p => _parse(p, " ", true)).toDF()
triples.write.mode(SaveMode.Append).parquet("hdfs://some.external.ip.address:9000/tmp/table.parquet")