Spark sql get collect_set 忽略其他列上的空值
Spark sql get collect_set ignore null on other columns
我需要并集结果和收集集但想忽略 null
val df1 = Seq(
("1","Adam","Angra", "Anastasia")
).toDF("id","fname", "mname", "lname")
df1.createOrReplaceTempView("df1")
val df2 = Seq(
("1",null,null, "Bosma")
).toDF("id","fname", "mname", "lname")
df2.createOrReplaceTempView("df2")
df2 数据框总是有 fname 和 mname null - 当按 id
分组时,我需要将 lname 连接为列表
当前查询:
select id,fname,mname,collect_set(lname) as lname from (select * from df1 union select * from df2) group by id,fname, mname
实际输出
id fname mname lname
1 Adam Angra ["Anastasia"]
1 null null ["Bosma"]
预期输出
id fname mname lname
1 Adam Angra ["Anastasia","Bosma"]
需要帮助才能通过 SQL 查询获得高于预期的结果
您可以按 id
分组并使用 first
函数(忽略 null
值)得到 fname
、mname
.
val sql = """
select id,first(fname, true) as fname,first(mname, true) as mname,collect_set(lname) as lname from
(select * from df1 union select * from df2)
group by id
"""
val df = spark.sql(sql)
df.show()
我需要并集结果和收集集但想忽略 null
val df1 = Seq(
("1","Adam","Angra", "Anastasia")
).toDF("id","fname", "mname", "lname")
df1.createOrReplaceTempView("df1")
val df2 = Seq(
("1",null,null, "Bosma")
).toDF("id","fname", "mname", "lname")
df2.createOrReplaceTempView("df2")
df2 数据框总是有 fname 和 mname null - 当按 id
分组时,我需要将 lname 连接为列表当前查询:
select id,fname,mname,collect_set(lname) as lname from (select * from df1 union select * from df2) group by id,fname, mname
实际输出
id fname mname lname
1 Adam Angra ["Anastasia"]
1 null null ["Bosma"]
预期输出
id fname mname lname
1 Adam Angra ["Anastasia","Bosma"]
需要帮助才能通过 SQL 查询获得高于预期的结果
您可以按 id
分组并使用 first
函数(忽略 null
值)得到 fname
、mname
.
val sql = """
select id,first(fname, true) as fname,first(mname, true) as mname,collect_set(lname) as lname from
(select * from df1 union select * from df2)
group by id
"""
val df = spark.sql(sql)
df.show()