Spark sql get collect_set 忽略其他列上的空值

Spark sql get collect_set ignore null on other columns

我需要并集结果和收集集但想忽略 null

val df1 = Seq(
  ("1","Adam","Angra", "Anastasia")
).toDF("id","fname", "mname", "lname")
df1.createOrReplaceTempView("df1")

val df2 = Seq(
  ("1",null,null, "Bosma")
).toDF("id","fname", "mname", "lname")
df2.createOrReplaceTempView("df2")

df2 数据框总是有 fname 和 mname null - 当按 id

分组时,我需要将 lname 连接为列表

当前查询:

select id,fname,mname,collect_set(lname) as lname from (select * from df1 union select * from df2) group by id,fname, mname

实际输出

id  fname   mname   lname
1   Adam    Angra   ["Anastasia"]
1   null    null    ["Bosma"]

预期输出

id  fname   mname   lname
1   Adam    Angra   ["Anastasia","Bosma"]

需要帮助才能通过 SQL 查询获得高于预期的结果

您可以按 id 分组并使用 first 函数(忽略 null 值)得到 fnamemname.

val sql = """
    select id,first(fname, true) as fname,first(mname, true) as mname,collect_set(lname) as lname from
        (select * from df1 union select * from df2)
    group by id
"""
val df = spark.sql(sql)
df.show()