Spark sql 在进行数据类型转换时将错误记录转换为 Null
Spark sql converts bad records to Null while doing datatype casting
我有以下数据框:
val simpleData = Seq(Row("James ","","Smith","36636","M",3000),
Row("Michael ","Rose","","40288","M",4000),
Row("Robert ","","Williams","42114","M",4000),
Row("Maria ","Anne","Jones","39192","F",4000),
Row("Jen","Mary","Brown","bad","F",-1)
)
val simpleSchema = StructType(Array(
StructField("firstname",StringType,true),
StructField("middlename",StringType,true),
StructField("lastname",StringType,true),
StructField("id", StringType, true),
StructField("gender", StringType, true),
StructField("salary", IntegerType, true)
))
val df = spark.createDataFrame(spark.sparkContext.parallelize(simpleData),simpleSchema)
+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname| id|gender|salary|
+---------+----------+--------+-----+------+------+
| James | | Smith|36636| M| 3000|
| Michael | Rose| |40288| M| 4000|
| Robert | |Williams|42114| M| 4000|
| Maria | Anne| Jones|39192| F| 4000|
| Jen| Mary| Brown|Rose | F| -1|
+---------+----------+--------+-----+------+------+
我是 运行 下面的示例代码,我想在转换后将字符串列转换为整数。
df.createOrReplaceTempView("EMP")
val df2 = spark.sql("select cast(id as INT) from EMP")
+-----+
| id|
+-----+
|36636|
|40288|
|42114|
|39192|
| null|
+-----+
此处所有整数数据都已正确转换,但“Rose”已转换为 null。
你能帮我看看当有坏记录时如何抛出异常吗?
是否有任何 spark 配置设置?
此外,如果查询中有多个转换,如何获取出现此问题的确切列名。
如果转换出错,Spark 不会抛出。
作为捕获这些错误的自定义方法,您可以编写一个 UDF,如果您将其转换为 null,则会抛出该错误。但是,这会降低脚本的性能,因为 Spark 无法优化 UDF 执行。
由于 Spark 3.0 和票证 SPARK-30292 的更正,将 spark.sql.ansi.enabled
配置设置为 true
将在您尝试将无效字符串转换为数字时引发异常:
spark.conf.set("spark.sql.ansi.enabled", "true")
df.createOrReplaceTempView("EMP")
val df2 = spark.sql("select cast(id as INT) from EMP")
抛出 NumberFormatException
。有关详细信息,请参阅 https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html#cast。
我有以下数据框:
val simpleData = Seq(Row("James ","","Smith","36636","M",3000),
Row("Michael ","Rose","","40288","M",4000),
Row("Robert ","","Williams","42114","M",4000),
Row("Maria ","Anne","Jones","39192","F",4000),
Row("Jen","Mary","Brown","bad","F",-1)
)
val simpleSchema = StructType(Array(
StructField("firstname",StringType,true),
StructField("middlename",StringType,true),
StructField("lastname",StringType,true),
StructField("id", StringType, true),
StructField("gender", StringType, true),
StructField("salary", IntegerType, true)
))
val df = spark.createDataFrame(spark.sparkContext.parallelize(simpleData),simpleSchema)
+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname| id|gender|salary|
+---------+----------+--------+-----+------+------+
| James | | Smith|36636| M| 3000|
| Michael | Rose| |40288| M| 4000|
| Robert | |Williams|42114| M| 4000|
| Maria | Anne| Jones|39192| F| 4000|
| Jen| Mary| Brown|Rose | F| -1|
+---------+----------+--------+-----+------+------+
我是 运行 下面的示例代码,我想在转换后将字符串列转换为整数。
df.createOrReplaceTempView("EMP")
val df2 = spark.sql("select cast(id as INT) from EMP")
+-----+
| id|
+-----+
|36636|
|40288|
|42114|
|39192|
| null|
+-----+
此处所有整数数据都已正确转换,但“Rose”已转换为 null。
你能帮我看看当有坏记录时如何抛出异常吗? 是否有任何 spark 配置设置?
此外,如果查询中有多个转换,如何获取出现此问题的确切列名。
如果转换出错,Spark 不会抛出。
作为捕获这些错误的自定义方法,您可以编写一个 UDF,如果您将其转换为 null,则会抛出该错误。但是,这会降低脚本的性能,因为 Spark 无法优化 UDF 执行。
由于 Spark 3.0 和票证 SPARK-30292 的更正,将 spark.sql.ansi.enabled
配置设置为 true
将在您尝试将无效字符串转换为数字时引发异常:
spark.conf.set("spark.sql.ansi.enabled", "true")
df.createOrReplaceTempView("EMP")
val df2 = spark.sql("select cast(id as INT) from EMP")
抛出 NumberFormatException
。有关详细信息,请参阅 https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html#cast。