Spark 数据框无法比较空值

Question

大家好，我有 2 个数据帧，我正在比较这两个数据帧的值，并根据值将值分配给一个新数据帧。所有场景都工作正常期望空字段比较，即如果两个数据帧值都为空，那么它应该显示为 "varified" 但它给我的是 "not varified" 我正在分享我的数据帧数据和代码我' m 使用和下面最终数据框的结果。

    scala> df1.show()
    +---+-----+---+--------+------+-------+
    | id| name|age|lastname|  city|country|
    +---+-----+---+--------+------+-------+
    |  1|rohan| 26|  sharma|mumbai|  india|
    |  2|rohan| 26|  sharma|  null|  india|
    |  3|rohan| 26|    null|mumbai|  india|
    |  4|rohan| 26|  sharma|mumbai|  india|
    +---+-----+---+--------+------+-------+
    scala> df2.show()
    +----+------+-----+----------+------+---------+
    |o_id|o_name|o_age|o_lastname|o_city|o_country|
    +----+------+-----+----------+------+---------+
    |   1| rohan|   26|    sharma|mumbai|    india|
    |   2| rohan|   26|    sharma|  null|    india|
    |   3| rohan|   26|    sharma|mumbai|    india|
    |   4| rohan|   26|      null|mumbai|    india|
    +----+------+-----+----------+------+---------+

    val df3 = df1.join(df2, df1("id") === df2("o_id"))
    .withColumn("result", when(df1("name") === df2("o_name") && 
    df1("age") === df2("o_age") && 
    df1("lastname") === df2("o_lastname") && 
    df1("city") === df2("o_city")  &&
    df1("country") === df2("o_country"), "Varified")
    .otherwise("Not Varified")).show()

    +---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
    | id| name|age|lastname|  city|country|o_id|o_name|o_age|o_lastname|o_city|o_country|      result|
    +---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
    |  1|rohan| 26|  sharma|mumbai|  india|   1| rohan|   26|    sharma|mumbai|    india|    Varified|
    |  2|rohan| 26|  sharma|  null|  india|   2| rohan|   26|    sharma|  null|    india|Not Varified|
    |  3|rohan| 26|    null|mumbai|  india|   3| rohan|   26|    sharma|mumbai|    india|Not Varified|
    |  4|rohan| 26|  sharma|mumbai|  india|   4| rohan|   26|      null|mumbai|    india|Not Varified|
    +---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+

我希望 id '2' 也应该显示为 'Varified'。但是城市在两个列中都为空，然后显示为 'Not Varified'。有人可以指导我如何修改我的 df3 查询以便它也可以检查 null 并且对于 id '2' 也可以在结果列中显示为 'Varified'。

Answer 1

在您的 when+otherwise 语句中添加 <=> （或）|| 运算符并检查 .isNull 用于 last_name and city 列。

null=null returns null 我们无法匹配的原因。

spark.sql("select null=null").show()
//+-------------+
//|(NULL = NULL)|
//+-------------+
//|         null|
//+-------------+

Using <=>,isnull():

spark.sql("select null<=>null, isnull(null) = isnull(null)").show()
//+---------------+---------------------------------+
//|(NULL <=> NULL)|((NULL IS NULL) = (NULL IS NULL))|
//+---------------+---------------------------------+
//|           true|                             true|
//+---------------+---------------------------------+

Example:

df1.join(df2, df1("id") === df2("o_id")).
withColumn("result", when( (df1("name") === df2("o_name")) && (df1("age") === df2("o_age") ) && 
(df1("lastname") === df2("o_lastname")|| (df1("lastname").isNull === df2("o_lastname").isNull)) && 
(df1("city") === df2("o_city")|| (df1("city").isNull === df2("o_city").isNull))  && 
(df1("country") === df2("o_country")), "Varified").otherwise("Not Varified")).
show()

//or using <>
df1.join(df2, df1("id") === df2("o_id")).withColumn("result", when( (df1("name") === df2("o_name")) && (df1("age") === df2("o_age")) && (df1("lastname") <=> df2("o_lastname")) && (df1("city") <=> df2("o_city"))  && (df1("country") === df2("o_country")), "Varified").otherwise("Not Varified")).show()

//+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
//| id| name|age|lastname|  city|country|o_id|o_name|o_age|o_lastname|o_city|o_country|      result|
//+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
//|  1|rohan| 26|  sharma|mumbai|  india|   1| rohan|   26|    sharma|mumbai|    india|    Varified|
//|  2|rohan| 26|  sharma|  null|  india|   2| rohan|   26|    sharma|  null|    india|    Varified|
//|  3|rohan| 26|    null|mumbai|  india|   3| rohan|   26|    sharma|mumbai|    india|Not Varified|
//|  4|rohan| 26|  sharma|mumbai|  india|   4| rohan|   26|      null|mumbai|    india|Not Varified|
//+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+

Answer 2

使用<=>代替===

val df3 = df1.join(df2, df1("id") === df2("o_id"))
    .withColumn("result", when(df1("name") <=> df2("o_name") && 
    df1("age") <=> df2("o_age") && 
    df1("lastname") <=> df2("o_lastname") && 
    df1("city") <=> df2("o_city")  &&
    df1("country") <=> df2("o_country"), "Varified")
    .otherwise("Not Varified")).show()

spark.sql("SELECT NULL AS city1, NULL AS city2").select($"city1" <=> $"city2").show

结果

+-----------------+
|(city1 <=> city2)|
+-----------------+
|            true |
+-----------------+

Spark 数据框无法比较空值

Spark dataframe not able to Compare Null values

scala

apache-spark

apache-spark-sql

pyspark-dataframes