如何将数组的数组(字符串类型)转换为结构 - Spark/Scala?
How to convert array of array (string type) to struct - Spark/Scala?
我有一个数据框
+---------------------------------------------------------------+---+
|family_name |id |
+---------------------------------------------------------------+---+
|[[John, Doe, Married, 999-999-9999],[Jane, Doe, Married,Wife,]]|id1|
|[[Tom, Riddle, Single, 888-888-8888]] |id2|
+---------------------------------------------------------------+---+
root
|-- family_name: string (nullable = true)
|-- id: string (nullable = true)
我希望将列 fam_name
转换为命名结构数组
`family_name` array<struct<f_name:string,l_name:string,status:string,ph_no:string>>
我能够将 family_name
转换为数组,如下所示
val sch = ArrayType(ArrayType(StringType))
val fam_array = data
.withColumn("family_name_clean", regexp_replace($"family_name", "\[\[", "["))
.withColumn("family_name_clean_clean1", regexp_replace($"family_name_clean", "\]\]", "]"))
.withColumn("ar", toArray($"family_name_clean_clean1"))
//.withColumn("ar1", from_json($"ar", sch))
fam_array.show(false)
fam_array.printSchema()
+---------------------------------------------------------------+---+--------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+
|family_name |id |family_name_clean |family_name_clean_clean1 |ar |
+---------------------------------------------------------------+---+--------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+
|[[John, Doe, Married, 999-999-9999],[Jane, Doe, Married,Wife,]]|id1|[John, Doe, Married, 999-999-9999],[Jane, Doe, Married,Wife,]]|[John, Doe, Married, 999-999-9999],[Jane, Doe, Married,Wife,]|[[John, Doe, Married, 999-999-9999], [Jane, Doe, Married, Wife, ]]|
|[[Tom, Riddle, Single, 888-888-8888]] |id2|[Tom, Riddle, Single, 888-888-8888]] |[Tom, Riddle, Single, 888-888-8888] |[[Tom, Riddle, Single, 888-888-8888]] |
+---------------------------------------------------------------+---+--------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+
root
|-- family_name: string (nullable = true)
|-- id: string (nullable = true)
|-- family_name_clean: string (nullable = true)
|-- family_name_clean_clean1: string (nullable = true)
|-- ar: array (nullable = true)
| |-- element: string (containsNull = true)
sch
是所需类型的模式变量。
如何将列 ar
转换为 array<struct<>>
?
编辑:
我正在使用 Spark 2.3.2
要在给定字符串数组的情况下创建结构数组,您可以使用 struct
function to build a struct given a list of columns combined with element_at
function 在数组的特定索引处提取列元素。
要解决您的具体问题,如您正确所述,您需要做两件事:
- 首先,将您的字符串转换为字符串数组
- 然后,使用这个字符串数组来构建你的结构
在 Spark 3.0 及更高版本中
使用 Spark 3.0,我们可以使用 spark 内置函数执行所有这些步骤。
第一步,我会做如下:
- 首先使用
regexp_replace
function 从 family_name
字符串中删除 [[
和 ]]
- 然后,通过使用
split
function 拆分此字符串来创建第一个数组级别
- 然后,通过使用
transform
和 split
函数拆分前一个数组的每个元素来创建第二个数组级别
第二步,使用struct
function to build a struct
, picking element in arrays using element_at
function。
因此,使用 Spark 3.0 及更高版本的完整代码如下,其中 data
作为输入数据帧:
import org.apache.spark.sql.functions.{col, element_at, regexp_replace, split, struct, transform}
val result = data
.withColumn(
"family_name",
transform(
split( // first level split
regexp_replace(col("family_name"), "\[\[|]]", ""), // remove [[ and ]]
"],\["
),
x => split(x, ",") // split for each element in first level array
)
)
.withColumn("family_name", transform(col("family_name"), x => struct(
element_at(x, 1).as("f_name"), // index starts at 1
element_at(x, 2).as("l_name"),
element_at(x, 3).as("status"),
element_at(x, -1).as("ph_no"), // get last element of array
)))
在火花中 2.X
使用 Spark 2.X,我们必须依赖用户定义的函数。首先,我们需要定义一个 case class
代表我们的 struct
:
case class FamilyName(
f_name: String,
l_name: String,
status: String,
ph_no: String
)
然后,我们定义我们的用户定义函数并将其应用于我们的输入数据框:
import org.apache.spark.sql.functions.{col, udf}
val extract_array = udf((familyName: String) => familyName
.replaceAll("\[\[|]]", "")
.split("],\[")
.map(familyName => {
val explodedFamilyName = familyName.split(",", -1)
FamilyName(
f_name = explodedFamilyName(0),
l_name = explodedFamilyName(1),
status = explodedFamilyName(2),
ph_no = explodedFamilyName(explodedFamilyName.length - 1)
)
})
)
val result = data.withColumn("family_name", extract_array(col("family_name")))
结果
如果您有以下 data
数据框:
+---------------------------------------------------------------+---+
|family_name |id |
+---------------------------------------------------------------+---+
|[[John, Doe, Married, 999-999-9999],[Jane, Doe, Married,Wife,]]|id1|
|[[Tom, Riddle, Single, 888-888-8888]] |id2|
+---------------------------------------------------------------+---+
您得到以下 result
数据框:
+-----------------------------------------------------------------+---+
|family_name |id |
+-----------------------------------------------------------------+---+
|[{John, Doe, Married, 999-999-9999}, {Jane, Doe, Married, }]|id1|
|[{Tom, Riddle, Single, 888-888-8888}] |id2|
+-----------------------------------------------------------------+---+
具有以下架构:
root
|-- family_name: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- f_name: string (nullable = true)
| | |-- l_name: string (nullable = true)
| | |-- status: string (nullable = true)
| | |-- ph_no: string (nullable = true)
|-- id: string (nullable = true)
我有一个数据框
+---------------------------------------------------------------+---+
|family_name |id |
+---------------------------------------------------------------+---+
|[[John, Doe, Married, 999-999-9999],[Jane, Doe, Married,Wife,]]|id1|
|[[Tom, Riddle, Single, 888-888-8888]] |id2|
+---------------------------------------------------------------+---+
root
|-- family_name: string (nullable = true)
|-- id: string (nullable = true)
我希望将列 fam_name
转换为命名结构数组
`family_name` array<struct<f_name:string,l_name:string,status:string,ph_no:string>>
我能够将 family_name
转换为数组,如下所示
val sch = ArrayType(ArrayType(StringType))
val fam_array = data
.withColumn("family_name_clean", regexp_replace($"family_name", "\[\[", "["))
.withColumn("family_name_clean_clean1", regexp_replace($"family_name_clean", "\]\]", "]"))
.withColumn("ar", toArray($"family_name_clean_clean1"))
//.withColumn("ar1", from_json($"ar", sch))
fam_array.show(false)
fam_array.printSchema()
+---------------------------------------------------------------+---+--------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+
|family_name |id |family_name_clean |family_name_clean_clean1 |ar |
+---------------------------------------------------------------+---+--------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+
|[[John, Doe, Married, 999-999-9999],[Jane, Doe, Married,Wife,]]|id1|[John, Doe, Married, 999-999-9999],[Jane, Doe, Married,Wife,]]|[John, Doe, Married, 999-999-9999],[Jane, Doe, Married,Wife,]|[[John, Doe, Married, 999-999-9999], [Jane, Doe, Married, Wife, ]]|
|[[Tom, Riddle, Single, 888-888-8888]] |id2|[Tom, Riddle, Single, 888-888-8888]] |[Tom, Riddle, Single, 888-888-8888] |[[Tom, Riddle, Single, 888-888-8888]] |
+---------------------------------------------------------------+---+--------------------------------------------------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+
root
|-- family_name: string (nullable = true)
|-- id: string (nullable = true)
|-- family_name_clean: string (nullable = true)
|-- family_name_clean_clean1: string (nullable = true)
|-- ar: array (nullable = true)
| |-- element: string (containsNull = true)
sch
是所需类型的模式变量。
如何将列 ar
转换为 array<struct<>>
?
编辑:
我正在使用 Spark 2.3.2
要在给定字符串数组的情况下创建结构数组,您可以使用 struct
function to build a struct given a list of columns combined with element_at
function 在数组的特定索引处提取列元素。
要解决您的具体问题,如您正确所述,您需要做两件事:
- 首先,将您的字符串转换为字符串数组
- 然后,使用这个字符串数组来构建你的结构
在 Spark 3.0 及更高版本中
使用 Spark 3.0,我们可以使用 spark 内置函数执行所有这些步骤。
第一步,我会做如下:
- 首先使用
regexp_replace
function 从 - 然后,通过使用
split
function 拆分此字符串来创建第一个数组级别
- 然后,通过使用
transform
和split
函数拆分前一个数组的每个元素来创建第二个数组级别
family_name
字符串中删除 [[
和 ]]
第二步,使用struct
function to build a struct
, picking element in arrays using element_at
function。
因此,使用 Spark 3.0 及更高版本的完整代码如下,其中 data
作为输入数据帧:
import org.apache.spark.sql.functions.{col, element_at, regexp_replace, split, struct, transform}
val result = data
.withColumn(
"family_name",
transform(
split( // first level split
regexp_replace(col("family_name"), "\[\[|]]", ""), // remove [[ and ]]
"],\["
),
x => split(x, ",") // split for each element in first level array
)
)
.withColumn("family_name", transform(col("family_name"), x => struct(
element_at(x, 1).as("f_name"), // index starts at 1
element_at(x, 2).as("l_name"),
element_at(x, 3).as("status"),
element_at(x, -1).as("ph_no"), // get last element of array
)))
在火花中 2.X
使用 Spark 2.X,我们必须依赖用户定义的函数。首先,我们需要定义一个 case class
代表我们的 struct
:
case class FamilyName(
f_name: String,
l_name: String,
status: String,
ph_no: String
)
然后,我们定义我们的用户定义函数并将其应用于我们的输入数据框:
import org.apache.spark.sql.functions.{col, udf}
val extract_array = udf((familyName: String) => familyName
.replaceAll("\[\[|]]", "")
.split("],\[")
.map(familyName => {
val explodedFamilyName = familyName.split(",", -1)
FamilyName(
f_name = explodedFamilyName(0),
l_name = explodedFamilyName(1),
status = explodedFamilyName(2),
ph_no = explodedFamilyName(explodedFamilyName.length - 1)
)
})
)
val result = data.withColumn("family_name", extract_array(col("family_name")))
结果
如果您有以下 data
数据框:
+---------------------------------------------------------------+---+
|family_name |id |
+---------------------------------------------------------------+---+
|[[John, Doe, Married, 999-999-9999],[Jane, Doe, Married,Wife,]]|id1|
|[[Tom, Riddle, Single, 888-888-8888]] |id2|
+---------------------------------------------------------------+---+
您得到以下 result
数据框:
+-----------------------------------------------------------------+---+
|family_name |id |
+-----------------------------------------------------------------+---+
|[{John, Doe, Married, 999-999-9999}, {Jane, Doe, Married, }]|id1|
|[{Tom, Riddle, Single, 888-888-8888}] |id2|
+-----------------------------------------------------------------+---+
具有以下架构:
root
|-- family_name: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- f_name: string (nullable = true)
| | |-- l_name: string (nullable = true)
| | |-- status: string (nullable = true)
| | |-- ph_no: string (nullable = true)
|-- id: string (nullable = true)