Spark 数据框无法比较空值
Spark dataframe not able to Compare Null values
大家好,我有 2 个数据帧,我正在比较这两个数据帧的值,并根据值将值分配给一个新数据帧。
所有场景都工作正常期望空字段比较,即如果两个数据帧值都为空,那么它应该显示为 "varified" 但它给我的是 "not varified" 我正在分享我的数据帧数据和代码我' m 使用和下面最终数据框的结果。
scala> df1.show()
+---+-----+---+--------+------+-------+
| id| name|age|lastname| city|country|
+---+-----+---+--------+------+-------+
| 1|rohan| 26| sharma|mumbai| india|
| 2|rohan| 26| sharma| null| india|
| 3|rohan| 26| null|mumbai| india|
| 4|rohan| 26| sharma|mumbai| india|
+---+-----+---+--------+------+-------+
scala> df2.show()
+----+------+-----+----------+------+---------+
|o_id|o_name|o_age|o_lastname|o_city|o_country|
+----+------+-----+----------+------+---------+
| 1| rohan| 26| sharma|mumbai| india|
| 2| rohan| 26| sharma| null| india|
| 3| rohan| 26| sharma|mumbai| india|
| 4| rohan| 26| null|mumbai| india|
+----+------+-----+----------+------+---------+
val df3 = df1.join(df2, df1("id") === df2("o_id"))
.withColumn("result", when(df1("name") === df2("o_name") &&
df1("age") === df2("o_age") &&
df1("lastname") === df2("o_lastname") &&
df1("city") === df2("o_city") &&
df1("country") === df2("o_country"), "Varified")
.otherwise("Not Varified")).show()
+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
| id| name|age|lastname| city|country|o_id|o_name|o_age|o_lastname|o_city|o_country| result|
+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
| 1|rohan| 26| sharma|mumbai| india| 1| rohan| 26| sharma|mumbai| india| Varified|
| 2|rohan| 26| sharma| null| india| 2| rohan| 26| sharma| null| india|Not Varified|
| 3|rohan| 26| null|mumbai| india| 3| rohan| 26| sharma|mumbai| india|Not Varified|
| 4|rohan| 26| sharma|mumbai| india| 4| rohan| 26| null|mumbai| india|Not Varified|
+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
我希望 id '2' 也应该显示为 'Varified'。但是城市在两个列中都为空,然后显示为 'Not Varified'。
有人可以指导我如何修改我的 df3 查询以便它也可以检查 null 并且对于 id '2' 也可以在结果列中显示为 'Varified'。
在您的 when+otherwise
语句中添加 <=>
(或)||
运算符并检查 .isNull 用于 last_name and city
列。
null=null
returns null
我们无法匹配的原因。
spark.sql("select null=null").show()
//+-------------+
//|(NULL = NULL)|
//+-------------+
//| null|
//+-------------+
Using <=>,isnull():
spark.sql("select null<=>null, isnull(null) = isnull(null)").show()
//+---------------+---------------------------------+
//|(NULL <=> NULL)|((NULL IS NULL) = (NULL IS NULL))|
//+---------------+---------------------------------+
//| true| true|
//+---------------+---------------------------------+
Example:
df1.join(df2, df1("id") === df2("o_id")).
withColumn("result", when( (df1("name") === df2("o_name")) && (df1("age") === df2("o_age") ) &&
(df1("lastname") === df2("o_lastname")|| (df1("lastname").isNull === df2("o_lastname").isNull)) &&
(df1("city") === df2("o_city")|| (df1("city").isNull === df2("o_city").isNull)) &&
(df1("country") === df2("o_country")), "Varified").otherwise("Not Varified")).
show()
//or using <>
df1.join(df2, df1("id") === df2("o_id")).withColumn("result", when( (df1("name") === df2("o_name")) && (df1("age") === df2("o_age")) && (df1("lastname") <=> df2("o_lastname")) && (df1("city") <=> df2("o_city")) && (df1("country") === df2("o_country")), "Varified").otherwise("Not Varified")).show()
//+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
//| id| name|age|lastname| city|country|o_id|o_name|o_age|o_lastname|o_city|o_country| result|
//+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
//| 1|rohan| 26| sharma|mumbai| india| 1| rohan| 26| sharma|mumbai| india| Varified|
//| 2|rohan| 26| sharma| null| india| 2| rohan| 26| sharma| null| india| Varified|
//| 3|rohan| 26| null|mumbai| india| 3| rohan| 26| sharma|mumbai| india|Not Varified|
//| 4|rohan| 26| sharma|mumbai| india| 4| rohan| 26| null|mumbai| india|Not Varified|
//+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
使用<=>
代替===
val df3 = df1.join(df2, df1("id") === df2("o_id"))
.withColumn("result", when(df1("name") <=> df2("o_name") &&
df1("age") <=> df2("o_age") &&
df1("lastname") <=> df2("o_lastname") &&
df1("city") <=> df2("o_city") &&
df1("country") <=> df2("o_country"), "Varified")
.otherwise("Not Varified")).show()
spark.sql("SELECT NULL AS city1, NULL AS city2").select($"city1" <=> $"city2").show
结果
+-----------------+
|(city1 <=> city2)|
+-----------------+
| true |
+-----------------+
大家好,我有 2 个数据帧,我正在比较这两个数据帧的值,并根据值将值分配给一个新数据帧。 所有场景都工作正常期望空字段比较,即如果两个数据帧值都为空,那么它应该显示为 "varified" 但它给我的是 "not varified" 我正在分享我的数据帧数据和代码我' m 使用和下面最终数据框的结果。
scala> df1.show()
+---+-----+---+--------+------+-------+
| id| name|age|lastname| city|country|
+---+-----+---+--------+------+-------+
| 1|rohan| 26| sharma|mumbai| india|
| 2|rohan| 26| sharma| null| india|
| 3|rohan| 26| null|mumbai| india|
| 4|rohan| 26| sharma|mumbai| india|
+---+-----+---+--------+------+-------+
scala> df2.show()
+----+------+-----+----------+------+---------+
|o_id|o_name|o_age|o_lastname|o_city|o_country|
+----+------+-----+----------+------+---------+
| 1| rohan| 26| sharma|mumbai| india|
| 2| rohan| 26| sharma| null| india|
| 3| rohan| 26| sharma|mumbai| india|
| 4| rohan| 26| null|mumbai| india|
+----+------+-----+----------+------+---------+
val df3 = df1.join(df2, df1("id") === df2("o_id"))
.withColumn("result", when(df1("name") === df2("o_name") &&
df1("age") === df2("o_age") &&
df1("lastname") === df2("o_lastname") &&
df1("city") === df2("o_city") &&
df1("country") === df2("o_country"), "Varified")
.otherwise("Not Varified")).show()
+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
| id| name|age|lastname| city|country|o_id|o_name|o_age|o_lastname|o_city|o_country| result|
+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
| 1|rohan| 26| sharma|mumbai| india| 1| rohan| 26| sharma|mumbai| india| Varified|
| 2|rohan| 26| sharma| null| india| 2| rohan| 26| sharma| null| india|Not Varified|
| 3|rohan| 26| null|mumbai| india| 3| rohan| 26| sharma|mumbai| india|Not Varified|
| 4|rohan| 26| sharma|mumbai| india| 4| rohan| 26| null|mumbai| india|Not Varified|
+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
我希望 id '2' 也应该显示为 'Varified'。但是城市在两个列中都为空,然后显示为 'Not Varified'。 有人可以指导我如何修改我的 df3 查询以便它也可以检查 null 并且对于 id '2' 也可以在结果列中显示为 'Varified'。
在您的 when+otherwise
语句中添加 <=>
(或)||
运算符并检查 .isNull 用于 last_name and city
列。
null=null
returns null
我们无法匹配的原因。
spark.sql("select null=null").show()
//+-------------+
//|(NULL = NULL)|
//+-------------+
//| null|
//+-------------+
Using <=>,isnull():
spark.sql("select null<=>null, isnull(null) = isnull(null)").show()
//+---------------+---------------------------------+
//|(NULL <=> NULL)|((NULL IS NULL) = (NULL IS NULL))|
//+---------------+---------------------------------+
//| true| true|
//+---------------+---------------------------------+
Example:
df1.join(df2, df1("id") === df2("o_id")).
withColumn("result", when( (df1("name") === df2("o_name")) && (df1("age") === df2("o_age") ) &&
(df1("lastname") === df2("o_lastname")|| (df1("lastname").isNull === df2("o_lastname").isNull)) &&
(df1("city") === df2("o_city")|| (df1("city").isNull === df2("o_city").isNull)) &&
(df1("country") === df2("o_country")), "Varified").otherwise("Not Varified")).
show()
//or using <>
df1.join(df2, df1("id") === df2("o_id")).withColumn("result", when( (df1("name") === df2("o_name")) && (df1("age") === df2("o_age")) && (df1("lastname") <=> df2("o_lastname")) && (df1("city") <=> df2("o_city")) && (df1("country") === df2("o_country")), "Varified").otherwise("Not Varified")).show()
//+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
//| id| name|age|lastname| city|country|o_id|o_name|o_age|o_lastname|o_city|o_country| result|
//+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
//| 1|rohan| 26| sharma|mumbai| india| 1| rohan| 26| sharma|mumbai| india| Varified|
//| 2|rohan| 26| sharma| null| india| 2| rohan| 26| sharma| null| india| Varified|
//| 3|rohan| 26| null|mumbai| india| 3| rohan| 26| sharma|mumbai| india|Not Varified|
//| 4|rohan| 26| sharma|mumbai| india| 4| rohan| 26| null|mumbai| india|Not Varified|
//+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
使用<=>
代替===
val df3 = df1.join(df2, df1("id") === df2("o_id"))
.withColumn("result", when(df1("name") <=> df2("o_name") &&
df1("age") <=> df2("o_age") &&
df1("lastname") <=> df2("o_lastname") &&
df1("city") <=> df2("o_city") &&
df1("country") <=> df2("o_country"), "Varified")
.otherwise("Not Varified")).show()
spark.sql("SELECT NULL AS city1, NULL AS city2").select($"city1" <=> $"city2").show
结果
+-----------------+
|(city1 <=> city2)|
+-----------------+
| true |
+-----------------+