count throws java.lang.NumberFormatException: 从启用了 inferSchema 的对象存储加载的文件上为 null
count throws java.lang.NumberFormatException: null on the file loaded from object store with inferSchema enabled
当启用 inferSchema 时,从 IBM Blue Mix 对象存储加载的数据帧上的 count() 会抛出以下异常:
Name: org.apache.spark.SparkException
Message: Job aborted due to stage failure: Task 3 in stage 43.0 failed 10 times, most recent failure: Lost task 3.9 in stage 43.0 (TID 166, yp-spark-dal09-env5-0034): java.lang.NumberFormatException: null
at java.lang.Integer.parseInt(Integer.java:554)
at java.lang.Integer.parseInt(Integer.java:627)
at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:241)
at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser.apply(CSVRelation.scala:116)
at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser.apply(CSVRelation.scala:85)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$$anonfun$apply.apply(CSVFileFormat.scala:128)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$$anonfun$apply.apply(CSVFileFormat.scala:127)
at scala.collection.Iterator$$anon.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon.hasNext(Iterator.scala:408)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon.hasNext(FileScanRDD.scala:91)
如果我禁用 inferSchema,我不会得到上述异常。
为什么我会收到此异常?默认情况下,如果启用 inferSchema,databricks 会读取多少行?
这实际上是拖入 spark 2.0
的 spark-csv
包 (null value still not correctly parsed #192) 的问题。已更正并推入 spark 2.1
.
这是相关的 PR:[SPARK-18269][SQL] CSV datasource should read null properly when schema is lager than parsed tokens。
由于您已经在使用 spark 2.0,因此您可以轻松升级到 2.1 并删除 spark-csv
软件包。反正也不需要。
当启用 inferSchema 时,从 IBM Blue Mix 对象存储加载的数据帧上的 count() 会抛出以下异常:
Name: org.apache.spark.SparkException
Message: Job aborted due to stage failure: Task 3 in stage 43.0 failed 10 times, most recent failure: Lost task 3.9 in stage 43.0 (TID 166, yp-spark-dal09-env5-0034): java.lang.NumberFormatException: null
at java.lang.Integer.parseInt(Integer.java:554)
at java.lang.Integer.parseInt(Integer.java:627)
at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:241)
at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser.apply(CSVRelation.scala:116)
at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser.apply(CSVRelation.scala:85)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$$anonfun$apply.apply(CSVFileFormat.scala:128)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$$anonfun$apply.apply(CSVFileFormat.scala:127)
at scala.collection.Iterator$$anon.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon.hasNext(Iterator.scala:408)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon.hasNext(FileScanRDD.scala:91)
如果我禁用 inferSchema,我不会得到上述异常。 为什么我会收到此异常?默认情况下,如果启用 inferSchema,databricks 会读取多少行?
这实际上是拖入 spark 2.0
的 spark-csv
包 (null value still not correctly parsed #192) 的问题。已更正并推入 spark 2.1
.
这是相关的 PR:[SPARK-18269][SQL] CSV datasource should read null properly when schema is lager than parsed tokens。
由于您已经在使用 spark 2.0,因此您可以轻松升级到 2.1 并删除 spark-csv
软件包。反正也不需要。