示例 spark CSV 和 JSON 程序不是 运行 in windows

sample spark CSV and JSON program not running in windows

我在Windows10台机器上运行ning spark程序。

我正在尝试 运行 下面的 spark 程序

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql._
import org.apache.spark.sql.SQLImplicits
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.TypedColumn
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.Encoders
import com.databricks.spark.csv

object json1 {

  def main(args : Array[String]){
    val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]").set("spark.executor.memory", "1g")
    val sc = new org.apache.spark.SparkContext(conf)
    val sqlc = new org.apache.spark.sql.SQLContext(sc)

    val NyseDF = sqlc.load("com.databricks.spark.csv",Map("path" -> args(0),"header"->"true"))
    NyseDF.registerTempTable("NYSE")
    NyseDF.printSchema()

  }



}

当我 运行 程序通过 运行 eclispse 中的应用程序模式并传递参数时

作为

src/test/resources/demo.text

失败并出现以下错误。

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/10/10 11:02:18 INFO SparkContext: Running Spark version 1.6.0
16/10/10 11:02:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/10/10 11:02:18 INFO SecurityManager: Changing view acls to: subho
16/10/10 11:02:18 INFO SecurityManager: Changing modify acls to: subho
16/10/10 11:02:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(subho); users with modify permissions: Set(subho)
16/10/10 11:02:19 INFO Utils: Successfully started service 'sparkDriver' on port 61108.
16/10/10 11:02:20 INFO Slf4jLogger: Slf4jLogger started
16/10/10 11:02:20 INFO Remoting: Starting remoting
16/10/10 11:02:20 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.1.116:61121]
16/10/10 11:02:20 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 61121.
16/10/10 11:02:20 INFO SparkEnv: Registering MapOutputTracker
16/10/10 11:02:20 INFO SparkEnv: Registering BlockManagerMaster
16/10/10 11:02:21 INFO DiskBlockManager: Created local directory at C:\Users\subho\AppData\Local\Temp\blockmgr-69afda02-ccd1-41d1-aa25-830ba366a75c
16/10/10 11:02:21 INFO MemoryStore: MemoryStore started with capacity 1128.4 MB
16/10/10 11:02:21 INFO SparkEnv: Registering OutputCommitCoordinator
16/10/10 11:02:21 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/10/10 11:02:21 INFO SparkUI: Started SparkUI at http://192.168.1.116:4040
16/10/10 11:02:21 INFO Executor: Starting executor ID driver on host localhost
16/10/10 11:02:21 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 61132.
16/10/10 11:02:21 INFO NettyBlockTransferService: Server created on 61132
16/10/10 11:02:21 INFO BlockManagerMaster: Trying to register BlockManager
16/10/10 11:02:21 INFO BlockManagerMasterEndpoint: Registering block manager localhost:61132 with 1128.4 MB RAM, BlockManagerId(driver, localhost, 61132)
16/10/10 11:02:21 INFO BlockManagerMaster: Registered BlockManager
16/10/10 11:02:23 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 107.7 KB, free 107.7 KB)
16/10/10 11:02:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 9.8 KB, free 117.5 KB)
16/10/10 11:02:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:61132 (size: 9.8 KB, free: 1128.4 MB)
16/10/10 11:02:23 INFO SparkContext: Created broadcast 0 from textFile at TextFile.scala:30
16/10/10 11:02:23 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
    at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278)
    at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300)
    at org.apache.hadoop.util.Shell.<clinit>(Shell.java:293)
    at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:76)
    at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362)
    at org.apache.spark.SparkContext$$anonfun$hadoopFile$$anonfun.apply(SparkContext.scala:1015)
    at org.apache.spark.SparkContext$$anonfun$hadoopFile$$anonfun.apply(SparkContext.scala:1015)
    at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf.apply(HadoopRDD.scala:176)
    at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf.apply(HadoopRDD.scala:176)
    at scala.Option.map(Option.scala:146)
    at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195)
    at org.apache.spark.rdd.RDD$$anonfun$partitions.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions.apply(RDD.scala:237)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions.apply(RDD.scala:237)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.rdd.RDD$$anonfun$take.apply(RDD.scala:1293)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.take(RDD.scala:1288)
    at com.databricks.spark.csv.CsvRelation.firstLine$lzycompute(CsvRelation.scala:174)
    at com.databricks.spark.csv.CsvRelation.firstLine(CsvRelation.scala:169)
    at com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:147)
    at com.databricks.spark.csv.CsvRelation.<init>(CsvRelation.scala:70)
    at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:138)
    at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:40)
    at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:28)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
    at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1153)
    at json1$.main(json1.scala:22)
    at json1.main(json1.scala)
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/C:/Users/subho/Desktop/code-master/simple-spark-project/src/test/resources/demo.text
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
    at org.apache.spark.rdd.RDD$$anonfun$partitions.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions.apply(RDD.scala:237)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions.apply(RDD.scala:237)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.rdd.RDD$$anonfun$take.apply(RDD.scala:1293)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.take(RDD.scala:1288)
    at com.databricks.spark.csv.CsvRelation.firstLine$lzycompute(CsvRelation.scala:174)
    at com.databricks.spark.csv.CsvRelation.firstLine(CsvRelation.scala:169)
    at com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:147)
    at com.databricks.spark.csv.CsvRelation.<init>(CsvRelation.scala:70)
    at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:138)
    at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:40)
    at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:28)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
    at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1153)
    at json1$.main(json1.scala:22)
    at json1.main(json1.scala)
16/10/10 11:02:23 INFO SparkContext: Invoking stop() from shutdown hook
16/10/10 11:02:23 INFO SparkUI: Stopped Spark web UI at http://192.168.1.116:4040
16/10/10 11:02:23 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/10/10 11:02:23 INFO MemoryStore: MemoryStore cleared
16/10/10 11:02:23 INFO BlockManager: BlockManager stopped
16/10/10 11:02:24 INFO BlockManagerMaster: BlockManagerMaster stopped
16/10/10 11:02:24 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/10/10 11:02:24 INFO SparkContext: Successfully stopped SparkContext
16/10/10 11:02:24 INFO ShutdownHookManager: Shutdown hook called
16/10/10 11:02:24 INFO ShutdownHookManager: Deleting directory C:\Users\subho\AppData\Local\Temp\spark-7f53ea20-a38c-46d5-8476-a1ae040736ac

以下是主要错误消息

输入路径不存在:file:/C:/Users/subho/Desktop/code-master/simple-spark-project/src/test/resources/demo.text

我在下面的位置有文件。

!]1

当我运行下面的程序运行成功时,

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql._
import org.apache.spark.sql.SQLImplicits
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.TypedColumn
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.Encoders
import com.databricks.spark.csv

object json1 {

  def main(args : Array[String]){
    val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]").set("spark.executor.memory", "1g")
    val sc = new org.apache.spark.SparkContext(conf)
    val sqlc = new org.apache.spark.sql.SQLContext(sc)

   /* val NyseDF = sqlc.load("com.databricks.spark.csv",Map("path" -> args(0),"header"->"true"))
    NyseDF.registerTempTable("NYSE")
    NyseDF.printSchema()

    print(sqlc.sql("select distinct(symbol) from NYSE").collect().toList)*/

    val PersonDF = sqlc.jsonFile("src/test/resources/Person.json")

   // PersonDF.printSchema()

    PersonDF.registerTempTable("Person")

    sqlc.sql("select * from Person where age < 60").collect().foreach(print)

  }

下面是日志文件。

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/10/10 11:54:12 INFO SparkContext: Running Spark version 1.6.0
16/10/10 11:54:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/10/10 11:54:13 INFO SecurityManager: Changing view acls to: subho
16/10/10 11:54:13 INFO SecurityManager: Changing modify acls to: subho
16/10/10 11:54:13 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(subho); users with modify permissions: Set(subho)
16/10/10 11:54:14 INFO Utils: Successfully started service 'sparkDriver' on port 51113.
16/10/10 11:54:14 INFO Slf4jLogger: Slf4jLogger started
16/10/10 11:54:14 INFO Remoting: Starting remoting
16/10/10 11:54:15 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.1.116:51126]
16/10/10 11:54:15 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 51126.
16/10/10 11:54:15 INFO SparkEnv: Registering MapOutputTracker
16/10/10 11:54:15 INFO SparkEnv: Registering BlockManagerMaster
16/10/10 11:54:15 INFO DiskBlockManager: Created local directory at C:\Users\subho\AppData\Local\Temp\blockmgr-a52a5d5a-075b-4859-8434-935fdaba8538
16/10/10 11:54:15 INFO MemoryStore: MemoryStore started with capacity 1128.4 MB
16/10/10 11:54:15 INFO SparkEnv: Registering OutputCommitCoordinator
16/10/10 11:54:15 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/10/10 11:54:15 INFO SparkUI: Started SparkUI at http://192.168.1.116:4040
16/10/10 11:54:15 INFO Executor: Starting executor ID driver on host localhost
16/10/10 11:54:15 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 51137.
16/10/10 11:54:15 INFO NettyBlockTransferService: Server created on 51137
16/10/10 11:54:15 INFO BlockManagerMaster: Trying to register BlockManager
16/10/10 11:54:15 INFO BlockManagerMasterEndpoint: Registering block manager localhost:51137 with 1128.4 MB RAM, BlockManagerId(driver, localhost, 51137)
16/10/10 11:54:15 INFO BlockManagerMaster: Registered BlockManager
16/10/10 11:54:17 INFO JSONRelation: Listing file:/C:/Users/subho/Desktop/code-master/simple-spark-project/src/test/resources/Person.json on driver
16/10/10 11:54:17 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
    at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278)
    at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300)
    at org.apache.hadoop.util.Shell.<clinit>(Shell.java:293)
    at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:76)
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:447)
    at org.apache.spark.sql.execution.datasources.json.JSONRelation.org$apache$spark$sql$execution$datasources$json$JSONRelation$$createBaseRdd(JSONRelation.scala:98)
    at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$$anonfun$apply.apply(JSONRelation.scala:115)
    at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$$anonfun$apply.apply(JSONRelation.scala:115)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun.apply(JSONRelation.scala:115)
    at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun.apply(JSONRelation.scala:109)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:109)
    at org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:108)
    at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:636)
    at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:635)
    at org.apache.spark.sql.execution.datasources.LogicalRelation.<init>(LogicalRelation.scala:37)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
    at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:244)
    at org.apache.spark.sql.SQLContext.jsonFile(SQLContext.scala:1011)
    at json1$.main(json1.scala:28)
    at json1.main(json1.scala)
16/10/10 11:54:18 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 128.0 KB, free 128.0 KB)
16/10/10 11:54:18 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 14.1 KB, free 142.1 KB)
16/10/10 11:54:18 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:51137 (size: 14.1 KB, free: 1128.4 MB)
16/10/10 11:54:18 INFO SparkContext: Created broadcast 0 from jsonFile at json1.scala:28
16/10/10 11:54:18 INFO FileInputFormat: Total input paths to process : 1
16/10/10 11:54:18 INFO SparkContext: Starting job: jsonFile at json1.scala:28
16/10/10 11:54:18 INFO DAGScheduler: Got job 0 (jsonFile at json1.scala:28) with 2 output partitions
16/10/10 11:54:18 INFO DAGScheduler: Final stage: ResultStage 0 (jsonFile at json1.scala:28)
16/10/10 11:54:18 INFO DAGScheduler: Parents of final stage: List()
16/10/10 11:54:18 INFO DAGScheduler: Missing parents: List()
16/10/10 11:54:18 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at jsonFile at json1.scala:28), which has no missing parents
16/10/10 11:54:18 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.2 KB, free 146.3 KB)
16/10/10 11:54:18 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.4 KB, free 148.6 KB)
16/10/10 11:54:18 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:51137 (size: 2.4 KB, free: 1128.4 MB)
16/10/10 11:54:18 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/10/10 11:54:18 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at jsonFile at json1.scala:28)
16/10/10 11:54:18 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
16/10/10 11:54:18 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2113 bytes)
16/10/10 11:54:18 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 2113 bytes)
16/10/10 11:54:18 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
16/10/10 11:54:18 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
16/10/10 11:54:18 INFO HadoopRDD: Input split: file:/C:/Users/subho/Desktop/code-master/simple-spark-project/src/test/resources/Person.json:0+92
16/10/10 11:54:18 INFO HadoopRDD: Input split: file:/C:/Users/subho/Desktop/code-master/simple-spark-project/src/test/resources/Person.json:92+93
16/10/10 11:54:18 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/10/10 11:54:18 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/10/10 11:54:18 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
16/10/10 11:54:18 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
16/10/10 11:54:18 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
16/10/10 11:54:18 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
16/10/10 11:54:19 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2886 bytes result sent to driver
16/10/10 11:54:19 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2886 bytes result sent to driver
16/10/10 11:54:19 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1287 ms on localhost (1/2)
16/10/10 11:54:19 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 1264 ms on localhost (2/2)
16/10/10 11:54:19 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
16/10/10 11:54:19 INFO DAGScheduler: ResultStage 0 (jsonFile at json1.scala:28) finished in 1.314 s
16/10/10 11:54:19 INFO DAGScheduler: Job 0 finished: jsonFile at json1.scala:28, took 1.413653 s
16/10/10 11:54:20 INFO BlockManagerInfo: Removed broadcast_1_piece0 on localhost:51137 in memory (size: 2.4 KB, free: 1128.4 MB)
16/10/10 11:54:20 INFO ContextCleaner: Cleaned accumulator 1
16/10/10 11:54:20 INFO BlockManagerInfo: Removed broadcast_0_piece0 on localhost:51137 in memory (size: 14.1 KB, free: 1128.4 MB)
16/10/10 11:54:21 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 59.6 KB, free 59.6 KB)
16/10/10 11:54:21 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 13.8 KB, free 73.3 KB)
16/10/10 11:54:21 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:51137 (size: 13.8 KB, free: 1128.4 MB)
16/10/10 11:54:21 INFO SparkContext: Created broadcast 2 from collect at json1.scala:34
16/10/10 11:54:21 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 128.0 KB, free 201.3 KB)
16/10/10 11:54:21 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 14.1 KB, free 215.4 KB)
16/10/10 11:54:21 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:51137 (size: 14.1 KB, free: 1128.3 MB)
16/10/10 11:54:21 INFO SparkContext: Created broadcast 3 from collect at json1.scala:34
16/10/10 11:54:21 INFO FileInputFormat: Total input paths to process : 1
16/10/10 11:54:21 INFO SparkContext: Starting job: collect at json1.scala:34
16/10/10 11:54:21 INFO DAGScheduler: Got job 1 (collect at json1.scala:34) with 2 output partitions
16/10/10 11:54:21 INFO DAGScheduler: Final stage: ResultStage 1 (collect at json1.scala:34)
16/10/10 11:54:21 INFO DAGScheduler: Parents of final stage: List()
16/10/10 11:54:21 INFO DAGScheduler: Missing parents: List()
16/10/10 11:54:21 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[9] at collect at json1.scala:34), which has no missing parents
16/10/10 11:54:21 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 7.6 KB, free 223.0 KB)
16/10/10 11:54:21 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 4.1 KB, free 227.1 KB)
16/10/10 11:54:21 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:51137 (size: 4.1 KB, free: 1128.3 MB)
16/10/10 11:54:21 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1006
16/10/10 11:54:21 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[9] at collect at json1.scala:34)
16/10/10 11:54:21 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
16/10/10 11:54:21 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2113 bytes)
16/10/10 11:54:21 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, partition 1,PROCESS_LOCAL, 2113 bytes)
16/10/10 11:54:21 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
16/10/10 11:54:21 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
16/10/10 11:54:21 INFO HadoopRDD: Input split: file:/C:/Users/subho/Desktop/code-master/simple-spark-project/src/test/resources/Person.json:92+93
16/10/10 11:54:21 INFO HadoopRDD: Input split: file:/C:/Users/subho/Desktop/code-master/simple-spark-project/src/test/resources/Person.json:0+92
16/10/10 11:54:22 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:51137 in memory (size: 13.8 KB, free: 1128.4 MB)
16/10/10 11:54:22 INFO GenerateUnsafeProjection: Code generated in 548.352258 ms
16/10/10 11:54:22 INFO GeneratePredicate: Code generated in 5.245214 ms
16/10/10 11:54:22 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 2283 bytes result sent to driver
16/10/10 11:54:22 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 2536 bytes result sent to driver
16/10/10 11:54:22 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 755 ms on localhost (1/2)
16/10/10 11:54:22 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 759 ms on localhost (2/2)
16/10/10 11:54:22 INFO DAGScheduler: ResultStage 1 (collect at json1.scala:34) finished in 0.760 s
16/10/10 11:54:22 INFO DAGScheduler: Job 1 finished: collect at json1.scala:34, took 0.779652 s
16/10/10 11:54:22 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
[53,Barack,Obama]16/10/10 11:54:22 INFO SparkContext: Invoking stop() from shutdown hook
16/10/10 11:54:22 INFO SparkUI: Stopped Spark web UI at http://192.168.1.116:4040
16/10/10 11:54:22 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/10/10 11:54:22 INFO MemoryStore: MemoryStore cleared
16/10/10 11:54:22 INFO BlockManager: BlockManager stopped
16/10/10 11:54:22 INFO BlockManagerMaster: BlockManagerMaster stopped
16/10/10 11:54:22 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/10/10 11:54:22 INFO SparkContext: Successfully stopped SparkContext
16/10/10 11:54:22 INFO ShutdownHookManager: Shutdown hook called
16/10/10 11:54:22 INFO ShutdownHookManager: Deleting directory C:\Users\subho\AppData\Local\Temp\spark-6cab6329-83f1-4af4-b64c-c869550405a4
16/10/10 11:54:22 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.

感谢和问候,

堆栈跟踪的重要部分在这里:

ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

一种可能是下载 winutils.exe(例如从 here),放入名为 bin 的文件夹(作为子目录或您的主目录,例如 C:\Users\XXX\bin\winutils.exe) 然后在代码的开头添加这一行:

System.setProperty("hadoop.home.dir", raw"C:\Users\XXX\")