在 Spark 中将字符串转换为地图
Convert String to Map in Spark
csv 文件中带有分隔符 |
的数据下方,我想将 PersonalInfo
列数据的字符串转换为 Map,以便我可以提取所需的信息。
我尝试使用 Cast 将以下 csv 转换为 parquet 格式 String
到 Map
我遇到数据类型不匹配错误。
以下是供您参考的数据。非常感谢您的帮助。
Empcode EmpName PersonalInfo
1 abc """email"":""abc@gmail.com"",""Location"":""India"",""Gender"":""Male"""
2 xyz """email"":""xyz@gmail.com"",""Location"":""US"""
3 pqr """email"":""abc@gmail.com"",""Gender"":""Female"",""Location"":""Europe"",""Mobile"":""1234"""
谢谢
如果要从 PersonalInfo
列创建地图,从 Spark 3.0 开始,您可以按以下步骤进行:
- 使用
split
函数 根据 "",""
拆分字符串
- 对于您获得的字符串数组的每个元素,使用
split
函数 根据"":""
创建子数组
- 使用
regexp_replace
函数 从子数组的元素中删除所有""
- 使用
struct
函数构建地图条目
- 使用
map_from_entries
从您的条目数组构建地图
完整代码如下:
import org.apache.spark.sql.functions.{col, map_from_entries, regexp_replace, split, struct, transform}
val result = data.withColumn("PersonalInfo",
map_from_entries(
transform(
split(col("PersonalInfo"), "\"\",\"\""),
item => struct(
regexp_replace(split(item, "\"\":\"\"")(0), "\"\"", ""),
regexp_replace(split(item, "\"\":\"\"")(1), "\"\"", "")
)
)
)
)
与以下 input_dataframe
:
+-------+-------+---------------------------------------------------------------------------------------------+
|Empcode|EmpName|PersonalInfo |
+-------+-------+---------------------------------------------------------------------------------------------+
|1 |abc |""email"":""abc@gmail.com"",""Location"":""India"",""Gender"":""Male"" |
|2 |xyz |""email"":""xyz@gmail.com"",""Location"":""US"" |
|3 |pqr |""email"":""abc@gmail.com"",""Gender"":""Female"",""Location"":""Europe"",""Mobile"":""1234""|
+-------+-------+---------------------------------------------------------------------------------------------+
你得到以下 result
数据框:
+-------+-------+------------------------------------------------------------------------------+
|Empcode|EmpName|PersonalInfo |
+-------+-------+------------------------------------------------------------------------------+
|1 |abc |{email -> abc@gmail.com, Location -> India, Gender -> Male} |
|2 |xyz |{email -> xyz@gmail.com, Location -> US} |
|3 |pqr |{email -> abc@gmail.com, Gender -> Female, Location -> Europe, Mobile -> 1234}|
+-------+-------+------------------------------------------------------------------------------+
一种简单的方法是在去掉 PersonalInfo
列中的双引号后使用 str_to_map
函数:
val df1 = df.withColumn(
"PersonalInfo",
expr("str_to_map(regexp_replace(PersonalInfo, '\"', ''))")
)
df1.show(false)
//+-------+-------+------------------------------------------------------------------------------+
//|Empcode|EmpName|PersonalInfo |
//+-------+-------+------------------------------------------------------------------------------+
//|1 |abc |{email -> abc@gmail.com, Location -> India, Gender -> Male} |
//|2 |xyz |{email -> xyz@gmail.com, Location -> US} |
//|3 |pqr |{email -> abc@gmail.com, Gender -> Female, Location -> Europe, Mobile -> 1234}|
//+-------+-------+------------------------------------------------------------------------------+
csv 文件中带有分隔符 |
的数据下方,我想将 PersonalInfo
列数据的字符串转换为 Map,以便我可以提取所需的信息。
我尝试使用 Cast 将以下 csv 转换为 parquet 格式 String
到 Map
我遇到数据类型不匹配错误。
以下是供您参考的数据。非常感谢您的帮助。
Empcode EmpName PersonalInfo
1 abc """email"":""abc@gmail.com"",""Location"":""India"",""Gender"":""Male"""
2 xyz """email"":""xyz@gmail.com"",""Location"":""US"""
3 pqr """email"":""abc@gmail.com"",""Gender"":""Female"",""Location"":""Europe"",""Mobile"":""1234"""
谢谢
如果要从 PersonalInfo
列创建地图,从 Spark 3.0 开始,您可以按以下步骤进行:
- 使用
split
函数 根据 - 对于您获得的字符串数组的每个元素,使用
split
函数 根据 - 使用
regexp_replace
函数 从子数组的元素中删除所有 - 使用
struct
函数构建地图条目 - 使用
map_from_entries
从您的条目数组构建地图
"",""
拆分字符串
"":""
创建子数组
""
完整代码如下:
import org.apache.spark.sql.functions.{col, map_from_entries, regexp_replace, split, struct, transform}
val result = data.withColumn("PersonalInfo",
map_from_entries(
transform(
split(col("PersonalInfo"), "\"\",\"\""),
item => struct(
regexp_replace(split(item, "\"\":\"\"")(0), "\"\"", ""),
regexp_replace(split(item, "\"\":\"\"")(1), "\"\"", "")
)
)
)
)
与以下 input_dataframe
:
+-------+-------+---------------------------------------------------------------------------------------------+
|Empcode|EmpName|PersonalInfo |
+-------+-------+---------------------------------------------------------------------------------------------+
|1 |abc |""email"":""abc@gmail.com"",""Location"":""India"",""Gender"":""Male"" |
|2 |xyz |""email"":""xyz@gmail.com"",""Location"":""US"" |
|3 |pqr |""email"":""abc@gmail.com"",""Gender"":""Female"",""Location"":""Europe"",""Mobile"":""1234""|
+-------+-------+---------------------------------------------------------------------------------------------+
你得到以下 result
数据框:
+-------+-------+------------------------------------------------------------------------------+
|Empcode|EmpName|PersonalInfo |
+-------+-------+------------------------------------------------------------------------------+
|1 |abc |{email -> abc@gmail.com, Location -> India, Gender -> Male} |
|2 |xyz |{email -> xyz@gmail.com, Location -> US} |
|3 |pqr |{email -> abc@gmail.com, Gender -> Female, Location -> Europe, Mobile -> 1234}|
+-------+-------+------------------------------------------------------------------------------+
一种简单的方法是在去掉 PersonalInfo
列中的双引号后使用 str_to_map
函数:
val df1 = df.withColumn(
"PersonalInfo",
expr("str_to_map(regexp_replace(PersonalInfo, '\"', ''))")
)
df1.show(false)
//+-------+-------+------------------------------------------------------------------------------+
//|Empcode|EmpName|PersonalInfo |
//+-------+-------+------------------------------------------------------------------------------+
//|1 |abc |{email -> abc@gmail.com, Location -> India, Gender -> Male} |
//|2 |xyz |{email -> xyz@gmail.com, Location -> US} |
//|3 |pqr |{email -> abc@gmail.com, Gender -> Female, Location -> Europe, Mobile -> 1234}|
//+-------+-------+------------------------------------------------------------------------------+