从行中读取列时出现 NullPointerException
NullPointerException while reading a column from the row
当值为 null 时,以下用于从行读取值的 Scala (Spark 1.6) 代码失败并返回 NullPointerException
。
val test = row.getAs[Int]("ColumnName").toString
虽然这很好用
val test1 = row.getAs[Int]("ColumnName") // returns 0 for null
val test2 = test1.toString // converts to String fine
导致 NullPointerException
的原因是什么?处理此类情况的推荐方法是什么?
PS:从 DataFrame 获取行如下:
val myRDD = myDF.repartition(partitions)
.mapPartitions{ rows =>
rows.flatMap{ row =>
functionWithRows(row) //has above logic to read null column which fails
}
}
functionWithRows
则有上面提到的NullPointerException
.
MyDF 架构:
root
|-- LDID: string (nullable = true)
|-- KTAG: string (nullable = true)
|-- ColumnName: integer (nullable = true)
为了避免空值,更好的做法是在检查之前使用isNullAt
,正如documentation建议的那样:
getAs
<T> T getAs(int i)
Returns the value at position i
. For primitive types if value is null it
returns 'zero value' specific for primitive ie. 0
for Int
- use isNullAt
to ensure that value is not null
不过我同意这种行为令人困惑。
getAs
定义为:
def getAs[T](i: Int): T = get(i).asInstanceOf[T]
当我们执行 toString 时,我们调用 Object.toString
,它不依赖于类型,所以 asInstanceOf[T]
被编译器丢弃,即
row.getAs[Int](0).toString -> row.get(0).toString
我们可以通过编写一个简单的 Scala 代码来确认:
import org.apache.spark.sql._
object Test {
val row = Row(null)
row.getAs[Int](0).toString
}
然后编译它:
$ scalac -classpath $SPARK_HOME/jars/'*' -print test.scala
[[syntax trees at end of cleanup]] // test.scala
package <empty> {
object Test extends Object {
private[this] val row: org.apache.spark.sql.Row = _;
<stable> <accessor> def row(): org.apache.spark.sql.Row = Test.this.row;
def <init>(): Test.type = {
Test.super.<init>();
Test.this.row = org.apache.spark.sql.Row.apply(scala.this.Predef.genericWrapArray(Array[Object]{null}));
Test.this.row().getAs(0).toString();
()
}
}
}
所以正确的方法是:
String.valueOf(row.getAs[Int](0))
当值为 null 时,以下用于从行读取值的 Scala (Spark 1.6) 代码失败并返回 NullPointerException
。
val test = row.getAs[Int]("ColumnName").toString
虽然这很好用
val test1 = row.getAs[Int]("ColumnName") // returns 0 for null
val test2 = test1.toString // converts to String fine
导致 NullPointerException
的原因是什么?处理此类情况的推荐方法是什么?
PS:从 DataFrame 获取行如下:
val myRDD = myDF.repartition(partitions)
.mapPartitions{ rows =>
rows.flatMap{ row =>
functionWithRows(row) //has above logic to read null column which fails
}
}
functionWithRows
则有上面提到的NullPointerException
.
MyDF 架构:
root
|-- LDID: string (nullable = true)
|-- KTAG: string (nullable = true)
|-- ColumnName: integer (nullable = true)
为了避免空值,更好的做法是在检查之前使用isNullAt
,正如documentation建议的那样:
getAs
<T> T getAs(int i)
Returns the value at position
i
. For primitive types if value is null it returns 'zero value' specific for primitive ie.0
forInt
- useisNullAt
to ensure that value is not null
不过我同意这种行为令人困惑。
getAs
定义为:
def getAs[T](i: Int): T = get(i).asInstanceOf[T]
当我们执行 toString 时,我们调用 Object.toString
,它不依赖于类型,所以 asInstanceOf[T]
被编译器丢弃,即
row.getAs[Int](0).toString -> row.get(0).toString
我们可以通过编写一个简单的 Scala 代码来确认:
import org.apache.spark.sql._
object Test {
val row = Row(null)
row.getAs[Int](0).toString
}
然后编译它:
$ scalac -classpath $SPARK_HOME/jars/'*' -print test.scala
[[syntax trees at end of cleanup]] // test.scala
package <empty> {
object Test extends Object {
private[this] val row: org.apache.spark.sql.Row = _;
<stable> <accessor> def row(): org.apache.spark.sql.Row = Test.this.row;
def <init>(): Test.type = {
Test.super.<init>();
Test.this.row = org.apache.spark.sql.Row.apply(scala.this.Predef.genericWrapArray(Array[Object]{null}));
Test.this.row().getAs(0).toString();
()
}
}
}
所以正确的方法是:
String.valueOf(row.getAs[Int](0))