Spark sql 在进行数据类型转换时将错误记录转换为 Null

Spark sql converts bad records to Null while doing datatype casting

我有以下数据框:

val simpleData = Seq(Row("James ","","Smith","36636","M",3000),
  Row("Michael ","Rose","","40288","M",4000),
  Row("Robert ","","Williams","42114","M",4000),
  Row("Maria ","Anne","Jones","39192","F",4000),
  Row("Jen","Mary","Brown","bad","F",-1)
)
    
val simpleSchema = StructType(Array(
  StructField("firstname",StringType,true),
  StructField("middlename",StringType,true),
  StructField("lastname",StringType,true),
  StructField("id", StringType, true),
  StructField("gender", StringType, true),
  StructField("salary", IntegerType, true)
))
    
val df = spark.createDataFrame(spark.sparkContext.parallelize(simpleData),simpleSchema)

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|   id|gender|salary|
+---------+----------+--------+-----+------+------+
|   James |          |   Smith|36636|     M|  3000|
| Michael |      Rose|        |40288|     M|  4000|
|  Robert |          |Williams|42114|     M|  4000|
|   Maria |      Anne|   Jones|39192|     F|  4000|
|      Jen|      Mary|   Brown|Rose |     F|    -1|
+---------+----------+--------+-----+------+------+

我是 运行 下面的示例代码,我想在转换后将字符串列转换为整数。

df.createOrReplaceTempView("EMP")
val df2 = spark.sql("select cast(id as INT) from EMP")

+-----+
|   id|
+-----+
|36636|
|40288|
|42114|
|39192|
| null|
+-----+

此处所有整数数据都已正确转换,但“Rose”已转换为 null。

你能帮我看看当有坏记录时如何抛出异常吗? 是否有任何 spark 配置设置?

此外,如果查询中有多个转换,如何获取出现此问题的确切列名。

如果转换出错,Spark 不会抛出。

作为捕获这些错误的自定义方法,您可以编写一个 UDF,如果您将其转换为 null,则会抛出该错误。但是,这会降低脚本的性能,因为 Spark 无法优化 UDF 执行。

由于 Spark 3.0 和票证 SPARK-30292 的更正,将 spark.sql.ansi.enabled 配置设置为 true 将在您尝试将无效字符串转换为数字时引发异常:

spark.conf.set("spark.sql.ansi.enabled", "true")
df.createOrReplaceTempView("EMP")
val df2 = spark.sql("select cast(id as INT) from EMP")

抛出 NumberFormatException。有关详细信息,请参阅 https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html#cast