spark 导入 apache 库(数学)

spark import apache library (math)

我正在尝试 运行 一个简单的 spark 应用程序

这是我的 Scala 文件:

/* SimpleApp.scala */
import org.apache.spark.SparkContext

import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.commons.math3.random.RandomDataGenerator


object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "/home/donbeo/Applications/spark/spark-1.1.0/README.md" // Should be some file on your system
    val conf = new SparkConf().setAppName("Simple Application")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))


    println("A random number")

  val randomData = new RandomDataGenerator()

  println(randomData.nextLong(0, 100))
  }
}

这是我的 sbt 文件

   name := "Simple Project"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0"

libraryDependencies += "org.apache.commons" % "commons-math3" % "3.3"

当我尝试 运行 代码时出现此错误

donbeo@donbeo-HP-EliteBook-Folio-9470m:~/Applications/spark/spark-1.1.0$ ./bin/spark-submit  --class "SimpleApp"  --master local[4]  /home/donbeo/Documents/scala_code/simpleApp/target/scala-2.10/simple-project_2.10-1.0.jar
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/02/04 17:42:41 WARN Utils: Your hostname, donbeo-HP-EliteBook-Folio-9470m resolves to a loopback address: 127.0.1.1; using 192.168.1.45 instead (on interface wlan0)
15/02/04 17:42:41 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/02/04 17:42:41 INFO SecurityManager: Changing view acls to: donbeo,
15/02/04 17:42:41 INFO SecurityManager: Changing modify acls to: donbeo,
15/02/04 17:42:41 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(donbeo, ); users with modify permissions: Set(donbeo, )
15/02/04 17:42:42 INFO Slf4jLogger: Slf4jLogger started
15/02/04 17:42:42 INFO Remoting: Starting remoting
15/02/04 17:42:42 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.45:45935]
15/02/04 17:42:42 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@192.168.1.45:45935]
15/02/04 17:42:42 INFO Utils: Successfully started service 'sparkDriver' on port 45935.
15/02/04 17:42:42 INFO SparkEnv: Registering MapOutputTracker
15/02/04 17:42:42 INFO SparkEnv: Registering BlockManagerMaster
15/02/04 17:42:42 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20150204174242-bbb1
15/02/04 17:42:42 INFO Utils: Successfully started service 'Connection manager for block manager' on port 55674.
15/02/04 17:42:42 INFO ConnectionManager: Bound socket to port 55674 with id = ConnectionManagerId(192.168.1.45,55674)
15/02/04 17:42:42 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
15/02/04 17:42:42 INFO BlockManagerMaster: Trying to register BlockManager
15/02/04 17:42:42 INFO BlockManagerMasterActor: Registering block manager 192.168.1.45:55674 with 265.4 MB RAM
15/02/04 17:42:42 INFO BlockManagerMaster: Registered BlockManager
15/02/04 17:42:42 INFO HttpFileServer: HTTP File server directory is /tmp/spark-49443053-833e-4596-9073-d74075483d35
15/02/04 17:42:42 INFO HttpServer: Starting HTTP Server
15/02/04 17:42:42 INFO Utils: Successfully started service 'HTTP file server' on port 41309.
15/02/04 17:42:42 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/02/04 17:42:42 INFO SparkUI: Started SparkUI at http://192.168.1.45:4040
15/02/04 17:42:42 INFO SparkContext: Added JAR file:/home/donbeo/Documents/scala_code/simpleApp/target/scala-2.10/simple-project_2.10-1.0.jar at http://192.168.1.45:41309/jars/simple-project_2.10-1.0.jar with timestamp 1423071762914
15/02/04 17:42:42 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@192.168.1.45:45935/user/HeartbeatReceiver
15/02/04 17:42:43 INFO MemoryStore: ensureFreeSpace(32768) called with curMem=0, maxMem=278302556
15/02/04 17:42:43 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 32.0 KB, free 265.4 MB)
15/02/04 17:42:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/02/04 17:42:43 WARN LoadSnappy: Snappy native library not loaded
15/02/04 17:42:43 INFO FileInputFormat: Total input paths to process : 1
15/02/04 17:42:43 INFO SparkContext: Starting job: count at SimpleApp.scala:13
15/02/04 17:42:43 INFO DAGScheduler: Got job 0 (count at SimpleApp.scala:13) with 2 output partitions (allowLocal=false)
15/02/04 17:42:43 INFO DAGScheduler: Final stage: Stage 0(count at SimpleApp.scala:13)
15/02/04 17:42:43 INFO DAGScheduler: Parents of final stage: List()
15/02/04 17:42:43 INFO DAGScheduler: Missing parents: List()
15/02/04 17:42:43 INFO DAGScheduler: Submitting Stage 0 (FilteredRDD[2] at filter at SimpleApp.scala:13), which has no missing parents
15/02/04 17:42:43 INFO MemoryStore: ensureFreeSpace(2616) called with curMem=32768, maxMem=278302556
15/02/04 17:42:43 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.6 KB, free 265.4 MB)
15/02/04 17:42:43 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (FilteredRDD[2] at filter at SimpleApp.scala:13)
15/02/04 17:42:43 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
15/02/04 17:42:43 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1283 bytes)
15/02/04 17:42:43 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1283 bytes)
15/02/04 17:42:43 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/02/04 17:42:43 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
15/02/04 17:42:43 INFO Executor: Fetching http://192.168.1.45:41309/jars/simple-project_2.10-1.0.jar with timestamp 1423071762914
15/02/04 17:42:43 INFO Utils: Fetching http://192.168.1.45:41309/jars/simple-project_2.10-1.0.jar to /tmp/fetchFileTemp3120003338190168194.tmp
15/02/04 17:42:43 INFO Executor: Adding file:/tmp/spark-ec5e14c2-9e58-4132-a4c9-2569d237a407/simple-project_2.10-1.0.jar to class loader
15/02/04 17:42:43 INFO CacheManager: Partition rdd_1_0 not found, computing it
15/02/04 17:42:43 INFO CacheManager: Partition rdd_1_1 not found, computing it
15/02/04 17:42:43 INFO HadoopRDD: Input split: file:/home/donbeo/Applications/spark/spark-1.1.0/README.md:0+2405
15/02/04 17:42:43 INFO HadoopRDD: Input split: file:/home/donbeo/Applications/spark/spark-1.1.0/README.md:2405+2406
15/02/04 17:42:43 INFO MemoryStore: ensureFreeSpace(7512) called with curMem=35384, maxMem=278302556
15/02/04 17:42:43 INFO MemoryStore: Block rdd_1_1 stored as values in memory (estimated size 7.3 KB, free 265.4 MB)
15/02/04 17:42:43 INFO BlockManagerInfo: Added rdd_1_1 in memory on 192.168.1.45:55674 (size: 7.3 KB, free: 265.4 MB)
15/02/04 17:42:43 INFO BlockManagerMaster: Updated info of block rdd_1_1
15/02/04 17:42:43 INFO MemoryStore: ensureFreeSpace(8352) called with curMem=42896, maxMem=278302556
15/02/04 17:42:43 INFO MemoryStore: Block rdd_1_0 stored as values in memory (estimated size 8.2 KB, free 265.4 MB)
15/02/04 17:42:43 INFO BlockManagerInfo: Added rdd_1_0 in memory on 192.168.1.45:55674 (size: 8.2 KB, free: 265.4 MB)
15/02/04 17:42:43 INFO BlockManagerMaster: Updated info of block rdd_1_0
15/02/04 17:42:43 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2300 bytes result sent to driver
15/02/04 17:42:43 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2300 bytes result sent to driver
15/02/04 17:42:43 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 179 ms on localhost (1/2)
15/02/04 17:42:43 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 176 ms on localhost (2/2)
15/02/04 17:42:43 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
15/02/04 17:42:43 INFO DAGScheduler: Stage 0 (count at SimpleApp.scala:13) finished in 0.198 s
15/02/04 17:42:43 INFO SparkContext: Job finished: count at SimpleApp.scala:13, took 0.292364402 s
15/02/04 17:42:43 INFO SparkContext: Starting job: count at SimpleApp.scala:14
15/02/04 17:42:43 INFO DAGScheduler: Got job 1 (count at SimpleApp.scala:14) with 2 output partitions (allowLocal=false)
15/02/04 17:42:43 INFO DAGScheduler: Final stage: Stage 1(count at SimpleApp.scala:14)
15/02/04 17:42:43 INFO DAGScheduler: Parents of final stage: List()
15/02/04 17:42:43 INFO DAGScheduler: Missing parents: List()
15/02/04 17:42:43 INFO DAGScheduler: Submitting Stage 1 (FilteredRDD[3] at filter at SimpleApp.scala:14), which has no missing parents
15/02/04 17:42:43 INFO MemoryStore: ensureFreeSpace(2616) called with curMem=51248, maxMem=278302556
15/02/04 17:42:43 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.6 KB, free 265.4 MB)
15/02/04 17:42:43 INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 (FilteredRDD[3] at filter at SimpleApp.scala:14)
15/02/04 17:42:43 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
15/02/04 17:42:43 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, ANY, 1283 bytes)
15/02/04 17:42:43 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, ANY, 1283 bytes)
15/02/04 17:42:43 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
15/02/04 17:42:43 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
15/02/04 17:42:43 INFO BlockManager: Found block rdd_1_1 locally
15/02/04 17:42:43 INFO BlockManager: Found block rdd_1_0 locally
15/02/04 17:42:43 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 1731 bytes result sent to driver
15/02/04 17:42:43 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 1731 bytes result sent to driver
15/02/04 17:42:43 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 14 ms on localhost (1/2)
15/02/04 17:42:43 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 17 ms on localhost (2/2)
15/02/04 17:42:43 INFO DAGScheduler: Stage 1 (count at SimpleApp.scala:14) finished in 0.017 s
15/02/04 17:42:43 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
15/02/04 17:42:43 INFO SparkContext: Job finished: count at SimpleApp.scala:14, took 0.034833058 s
Lines with a: 83, Lines with b: 38
A random number
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/math3/random/RandomDataGenerator
    at SimpleApp$.main(SimpleApp.scala:20)
    at SimpleApp.main(SimpleApp.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.math3.random.RandomDataGenerator
    at java.net.URLClassLoader.run(URLClassLoader.java:366)
    at java.net.URLClassLoader.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    ... 9 more
donbeo@donbeo-HP-EliteBook-Folio-9470m:~/Applications/spark/spark-1.1.0$ 

我认为我在导入 math3 库时做错了什么。

这里详细说明了我是如何安装spark并构建项目的

您需要指定 common-math3 jar 的路径,可以使用 --jars 选项

./bin/spark-submit  --class "SimpleApp"                          \  
                    --master local[4]                            \
                    --jars  <specify-path-of-commons-math3-jar>  \ 
                    /home/donbeo/Documents/scala_code/simpleApp/target/scala-2.10/simple-project_2.10-1.0.jar

或者,您可以构建一个包含所有依赖项的程序集 jar。

编辑: 如何构建组装罐:

在文件中 build.sbt

import AssemblyKeys._

import sbtassembly.Plugin._

name := "Simple Project"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0" % "provided"

libraryDependencies += "org.apache.commons" % "commons-math3" % "3.3"

// This statement includes the assembly plugin capabilities
assemblySettings

// Configure jar named used with the assembly plug-in
jarName in assembly := "simple-app-assembly.jar"

// A special option to exclude Scala itself form our assembly jar, since Spark
// already bundles Scala.
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)

在文件中 project/assembly.sbt

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.11.2")

然后制作一个组装罐如下:

sbt assembly