文件中不存在内容的 Apache Spark NumberformatException
Apache Spark NumberformatException with content not existing in file
在本地模式下执行我的 spark 应用程序非常好,但集群上的 运行 为日期字段 "yyyy-MM-dd hh:MM:ss" 提供了一个异常,但有以下异常:
15/02/05 16:56:04 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, kmobd-dnode2.qudosoft.de): java.lang.NumberFormatException: For input string: ".1244E.1244E22"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
at java.text.DigitList.getDouble(DigitList.java:169)
at java.text.DecimalFormat.parse(DecimalFormat.java:2056)
at java.text.SimpleDateFormat.subParse(SimpleDateFormat.java:2162)
at java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1514)
at java.text.DateFormat.parse(DateFormat.java:364)
at de.qudosoft.bd.econda.userjourneymapper.ClassifingMapper.call(ClassifingMapper.java:24)
at de.qudosoft.bd.econda.userjourneymapper.ClassifingMapper.call(ClassifingMapper.java:10)
at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun.apply(JavaPairRDD.scala:1002)
at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun.apply(JavaPairRDD.scala:1002)
at scala.collection.Iterator$$anon.next(Iterator.scala:328)
at scala.collection.Iterator$$anon.hasNext(Iterator.scala:389)
at scala.collection.Iterator$$anon.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:365)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
我不明白的是我的数据中不存在值“.1244E.1244E22”。我将 Apache Spark 1.2.0 与 Cloudera Manager CDH 5.3.0 和 Hadoop 2.5.0 一起使用。
这是我的pom.xml:
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.5.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.3.1</version>
</dependency>
<dependency>
<groupId>org.testng</groupId>
<artifactId>testng</artifactId>
<version>6.1.1</version>
<scope>test</scope>
</dependency>
</dependencies>
<properties>
<java.version>1.8</java.version>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>${java.version}</source>
<target>${java.version}</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.4.1</version>
<configuration>
<!-- get all project dependencies -->
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<!-- MainClass in mainfest make a executable jar -->
<archive>
<manifest>
<mainClass>de.qudosoft.bd.econda.userjourneymapper.Main</mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<!-- bind to the packaging phase -->
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
有人遇到过类似的问题吗?
问题很可能是您的解析器是在 static/object 实例级别定义的。 SimpleDateFormat class 不是线程安全的,因此状态会被竞争线程破坏。
尝试在使用之前在函数级别移动您的解析器构造。它不是那么优雅或高效,但它应该证明问题。
您也可以尝试对解析调用进行互斥,看看是否有帮助。 Profile/test 两者兼而有之,看看哪种更适合您。
祝你好运!
在本地模式下执行我的 spark 应用程序非常好,但集群上的 运行 为日期字段 "yyyy-MM-dd hh:MM:ss" 提供了一个异常,但有以下异常:
15/02/05 16:56:04 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, kmobd-dnode2.qudosoft.de): java.lang.NumberFormatException: For input string: ".1244E.1244E22"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
at java.text.DigitList.getDouble(DigitList.java:169)
at java.text.DecimalFormat.parse(DecimalFormat.java:2056)
at java.text.SimpleDateFormat.subParse(SimpleDateFormat.java:2162)
at java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1514)
at java.text.DateFormat.parse(DateFormat.java:364)
at de.qudosoft.bd.econda.userjourneymapper.ClassifingMapper.call(ClassifingMapper.java:24)
at de.qudosoft.bd.econda.userjourneymapper.ClassifingMapper.call(ClassifingMapper.java:10)
at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun.apply(JavaPairRDD.scala:1002)
at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun.apply(JavaPairRDD.scala:1002)
at scala.collection.Iterator$$anon.next(Iterator.scala:328)
at scala.collection.Iterator$$anon.hasNext(Iterator.scala:389)
at scala.collection.Iterator$$anon.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:365)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
我不明白的是我的数据中不存在值“.1244E.1244E22”。我将 Apache Spark 1.2.0 与 Cloudera Manager CDH 5.3.0 和 Hadoop 2.5.0 一起使用。
这是我的pom.xml:
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.5.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.3.1</version>
</dependency>
<dependency>
<groupId>org.testng</groupId>
<artifactId>testng</artifactId>
<version>6.1.1</version>
<scope>test</scope>
</dependency>
</dependencies>
<properties>
<java.version>1.8</java.version>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>${java.version}</source>
<target>${java.version}</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.4.1</version>
<configuration>
<!-- get all project dependencies -->
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<!-- MainClass in mainfest make a executable jar -->
<archive>
<manifest>
<mainClass>de.qudosoft.bd.econda.userjourneymapper.Main</mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<!-- bind to the packaging phase -->
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
有人遇到过类似的问题吗?
问题很可能是您的解析器是在 static/object 实例级别定义的。 SimpleDateFormat class 不是线程安全的,因此状态会被竞争线程破坏。
尝试在使用之前在函数级别移动您的解析器构造。它不是那么优雅或高效,但它应该证明问题。
您也可以尝试对解析调用进行互斥,看看是否有帮助。 Profile/test 两者兼而有之,看看哪种更适合您。
祝你好运!