如何在Spark scala中将结构数组添加到结构数组的结构中
How to add array of struct to struct of array of struct in Spark scala
我有下面的例子
val df_temp1 = Seq(
("1","Adam","Angra", "Anastasia")
).toDF("id","fname", "mname", "lname")
df_temp1.createOrReplaceTempView("df_temp1")
val df1 = spark.sql("""select id,named_struct('opi1',array(named_struct('data_description','fname','data_details',fname),named_struct('data_description','mname','data_details',mname),named_struct('data_description','lname','data_details',lname))) as pi, array(named_struct('data_description','fname','data_details',fname),named_struct('data_description','mname','data_details',mname), named_struct('data_description','lname','data_details',lname)) as opi2 from df_temp1""")
df1.printSchema
df1.show(false)
df1.createOrReplaceTempView("df1")
给出以下输出模式
root
|-- id: string (nullable = true)
|-- pi: struct (nullable = false)
| |-- opi1: array (nullable = false)
| | |-- element: struct (containsNull = false)
| | | |-- data_description: string (nullable = false)
| | | |-- data_details: string (nullable = true)
|-- opi2: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- data_description: string (nullable = false)
| | |-- data_details: string (nullable = true)
低于结果
+---+-----------------------------------------------------+---------------------------------------------------+
|id |pi |opi2 |
+---+-----------------------------------------------------+---------------------------------------------------+
|1 |{[{fname, Adam}, {mname, Angra}, {lname, Anastasia}]}|[{fname, Adam}, {mname, Angra}, {lname, Anastasia}]|
+---+-----------------------------------------------------+---------------------------------------------------+
我希望 opi2 与 opi1 一起包含在 pi 中,因此预期的架构应该如下所示
root
|-- id: string (nullable = true)
|-- pi: struct (nullable = false)
| |-- opi1: array (nullable = false)
| | |-- element: struct (containsNull = false)
| | | |-- data_description: string (nullable = false)
| | | |-- data_details: string (nullable = true)
|----|-- opi2: array (nullable = false)
| | |-- element: struct (containsNull = false)
| | |-- |--data_description: string (nullable = false)
| | |---- |--data_details: string (nullable = true)
预期输出将是 pi 内的两个数组 opi1 和 opi2,如下所示
+---+-----------------------------------------------------+---------------------------------------------------+
|id |pi |
+---+-----------------------------------------------------+---------------------------------------------------+
|1 |{[{fname, Adam}, {mname, Angra}, {lname, Anastasia}],[{fname, Adam}, {mname, Angra}, {lname, Anastasia}]}|
+---+-----------------------------------------------------+---------------------------------------------------+
所以基本上是将现有列添加到结构中(顺便说一句,我使用的是 Spark 2.3,因此无法使用 Spark 2.4 中的任何函数)
只需从 pi.opi1
和 opi2
创建一个新结构
val df2 = spark.sql("select id, named_struct('opi1',pi.opi1, 'opi2', opi2) as pi from df1")
df2.show(false)
df2.printSchema
+---+----------------------------------------------------------------------------------------------------------+
|id |pi |
+---+----------------------------------------------------------------------------------------------------------+
|1 |{[{fname, Adam}, {mname, Angra}, {lname, Anastasia}], [{fname, Adam}, {mname, Angra}, {lname, Anastasia}]}|
+---+----------------------------------------------------------------------------------------------------------+
root
|-- id: string (nullable = true)
|-- pi: struct (nullable = false)
| |-- opi1: array (nullable = false)
| | |-- element: struct (containsNull = false)
| | | |-- data_description: string (nullable = false)
| | | |-- data_details: string (nullable = true)
| |-- opi2: array (nullable = false)
| | |-- element: struct (containsNull = false)
| | | |-- data_description: string (nullable = false)
| | | |-- data_details: string (nullable = true)
我有下面的例子
val df_temp1 = Seq(
("1","Adam","Angra", "Anastasia")
).toDF("id","fname", "mname", "lname")
df_temp1.createOrReplaceTempView("df_temp1")
val df1 = spark.sql("""select id,named_struct('opi1',array(named_struct('data_description','fname','data_details',fname),named_struct('data_description','mname','data_details',mname),named_struct('data_description','lname','data_details',lname))) as pi, array(named_struct('data_description','fname','data_details',fname),named_struct('data_description','mname','data_details',mname), named_struct('data_description','lname','data_details',lname)) as opi2 from df_temp1""")
df1.printSchema
df1.show(false)
df1.createOrReplaceTempView("df1")
给出以下输出模式
root
|-- id: string (nullable = true)
|-- pi: struct (nullable = false)
| |-- opi1: array (nullable = false)
| | |-- element: struct (containsNull = false)
| | | |-- data_description: string (nullable = false)
| | | |-- data_details: string (nullable = true)
|-- opi2: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- data_description: string (nullable = false)
| | |-- data_details: string (nullable = true)
低于结果
+---+-----------------------------------------------------+---------------------------------------------------+
|id |pi |opi2 |
+---+-----------------------------------------------------+---------------------------------------------------+
|1 |{[{fname, Adam}, {mname, Angra}, {lname, Anastasia}]}|[{fname, Adam}, {mname, Angra}, {lname, Anastasia}]|
+---+-----------------------------------------------------+---------------------------------------------------+
我希望 opi2 与 opi1 一起包含在 pi 中,因此预期的架构应该如下所示
root
|-- id: string (nullable = true)
|-- pi: struct (nullable = false)
| |-- opi1: array (nullable = false)
| | |-- element: struct (containsNull = false)
| | | |-- data_description: string (nullable = false)
| | | |-- data_details: string (nullable = true)
|----|-- opi2: array (nullable = false)
| | |-- element: struct (containsNull = false)
| | |-- |--data_description: string (nullable = false)
| | |---- |--data_details: string (nullable = true)
预期输出将是 pi 内的两个数组 opi1 和 opi2,如下所示
+---+-----------------------------------------------------+---------------------------------------------------+
|id |pi |
+---+-----------------------------------------------------+---------------------------------------------------+
|1 |{[{fname, Adam}, {mname, Angra}, {lname, Anastasia}],[{fname, Adam}, {mname, Angra}, {lname, Anastasia}]}|
+---+-----------------------------------------------------+---------------------------------------------------+
所以基本上是将现有列添加到结构中(顺便说一句,我使用的是 Spark 2.3,因此无法使用 Spark 2.4 中的任何函数)
只需从 pi.opi1
和 opi2
val df2 = spark.sql("select id, named_struct('opi1',pi.opi1, 'opi2', opi2) as pi from df1")
df2.show(false)
df2.printSchema
+---+----------------------------------------------------------------------------------------------------------+
|id |pi |
+---+----------------------------------------------------------------------------------------------------------+
|1 |{[{fname, Adam}, {mname, Angra}, {lname, Anastasia}], [{fname, Adam}, {mname, Angra}, {lname, Anastasia}]}|
+---+----------------------------------------------------------------------------------------------------------+
root
|-- id: string (nullable = true)
|-- pi: struct (nullable = false)
| |-- opi1: array (nullable = false)
| | |-- element: struct (containsNull = false)
| | | |-- data_description: string (nullable = false)
| | | |-- data_details: string (nullable = true)
| |-- opi2: array (nullable = false)
| | |-- element: struct (containsNull = false)
| | | |-- data_description: string (nullable = false)
| | | |-- data_details: string (nullable = true)