文件中不存在内容的 Apache Spark NumberformatException

Question

在本地模式下执行我的 spark 应用程序非常好，但集群上的运行为日期字段 "yyyy-MM-dd hh:MM:ss" 提供了一个异常，但有以下异常：

15/02/05 16:56:04 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, kmobd-dnode2.qudosoft.de): java.lang.NumberFormatException: For input string: ".1244E.1244E22"
    at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
    at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
    at java.lang.Double.parseDouble(Double.java:538)
    at java.text.DigitList.getDouble(DigitList.java:169)
    at java.text.DecimalFormat.parse(DecimalFormat.java:2056)
    at java.text.SimpleDateFormat.subParse(SimpleDateFormat.java:2162)
    at java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1514)
    at java.text.DateFormat.parse(DateFormat.java:364)
    at de.qudosoft.bd.econda.userjourneymapper.ClassifingMapper.call(ClassifingMapper.java:24)
    at de.qudosoft.bd.econda.userjourneymapper.ClassifingMapper.call(ClassifingMapper.java:10)
    at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun.apply(JavaPairRDD.scala:1002)
    at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun.apply(JavaPairRDD.scala:1002)
    at scala.collection.Iterator$$anon.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon.hasNext(Iterator.scala:389)
    at scala.collection.Iterator$$anon.hasNext(Iterator.scala:327)
    at org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:365)
    at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)
    at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:56)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

我不明白的是我的数据中不存在值“.1244E.1244E22”。我将 Apache Spark 1.2.0 与 Cloudera Manager CDH 5.3.0 和 Hadoop 2.5.0 一起使用。

这是我的pom.xml：

    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>2.5.0</version>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>com.google.code.gson</groupId>
        <artifactId>gson</artifactId>
        <version>2.3.1</version>
    </dependency>
    <dependency>
        <groupId>org.testng</groupId>
        <artifactId>testng</artifactId>
        <version>6.1.1</version>
        <scope>test</scope>
    </dependency>
</dependencies>

<properties>
    <java.version>1.8</java.version>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.1</version>
            <configuration>
                <source>${java.version}</source>
                <target>${java.version}</target>
            </configuration>
        </plugin>

        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-assembly-plugin</artifactId>
            <version>2.4.1</version>
            <configuration>
                <!-- get all project dependencies -->
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
                <!-- MainClass in mainfest make a executable jar -->
                <archive>
                    <manifest>
                        <mainClass>de.qudosoft.bd.econda.userjourneymapper.Main</mainClass>
                    </manifest>
                </archive>

            </configuration>
            <executions>
                <execution>
                    <id>make-assembly</id>
                    <!-- bind to the packaging phase -->
                    <phase>package</phase>
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>

    </plugins>
</build>

有人遇到过类似的问题吗？

Answer 1

问题很可能是您的解析器是在 static/object 实例级别定义的。 SimpleDateFormat class 不是线程安全的，因此状态会被竞争线程破坏。

尝试在使用之前在函数级别移动您的解析器构造。它不是那么优雅或高效，但它应该证明问题。

您也可以尝试对解析调用进行互斥，看看是否有帮助。 Profile/test 两者兼而有之，看看哪种更适合您。

祝你好运！

文件中不存在内容的 Apache Spark NumberformatException

Apache Spark NumberformatException with content not existing in file

cluster-computing

hdfs

apache-spark