在 Spark 中将字符串转换为地图

Convert String to Map in Spark

csv 文件中带有分隔符 | 的数据下方,我想将 PersonalInfo 列数据的字符串转换为 Map,以便我可以提取所需的信息。

我尝试使用 Cast 将以下 csv 转换为 parquet 格式 StringMap 我遇到数据类型不匹配错误。

以下是供您参考的数据。非常感谢您的帮助。

Empcode EmpName PersonalInfo
1       abc     """email"":""abc@gmail.com"",""Location"":""India"",""Gender"":""Male"""
2       xyz     """email"":""xyz@gmail.com"",""Location"":""US"""
3       pqr     """email"":""abc@gmail.com"",""Gender"":""Female"",""Location"":""Europe"",""Mobile"":""1234"""

谢谢

如果要从 PersonalInfo 列创建地图,从 Spark 3.0 开始,您可以按以下步骤进行:

  • 使用 split 函数
  • 根据 "","" 拆分字符串
  • 对于您获得的字符串数组的每个元素,使用split函数
  • 根据"":""创建子数组
  • 使用regexp_replace函数
  • 从子数组的元素中删除所有""
  • 使用 struct 函数构建地图条目
  • 使用 map_from_entries 从您的条目数组构建地图

完整代码如下:

import org.apache.spark.sql.functions.{col, map_from_entries, regexp_replace, split, struct, transform}

val result = data.withColumn("PersonalInfo",
  map_from_entries(
    transform(
      split(col("PersonalInfo"), "\"\",\"\""),
      item => struct(
        regexp_replace(split(item, "\"\":\"\"")(0), "\"\"", ""),
        regexp_replace(split(item, "\"\":\"\"")(1), "\"\"", "")
      )
    )
  )
)

与以下 input_dataframe:

+-------+-------+---------------------------------------------------------------------------------------------+
|Empcode|EmpName|PersonalInfo                                                                                 |
+-------+-------+---------------------------------------------------------------------------------------------+
|1      |abc    |""email"":""abc@gmail.com"",""Location"":""India"",""Gender"":""Male""                       |
|2      |xyz    |""email"":""xyz@gmail.com"",""Location"":""US""                                              |
|3      |pqr    |""email"":""abc@gmail.com"",""Gender"":""Female"",""Location"":""Europe"",""Mobile"":""1234""|
+-------+-------+---------------------------------------------------------------------------------------------+

你得到以下 result 数据框:

+-------+-------+------------------------------------------------------------------------------+
|Empcode|EmpName|PersonalInfo                                                                  |
+-------+-------+------------------------------------------------------------------------------+
|1      |abc    |{email -> abc@gmail.com, Location -> India, Gender -> Male}                   |
|2      |xyz    |{email -> xyz@gmail.com, Location -> US}                                      |
|3      |pqr    |{email -> abc@gmail.com, Gender -> Female, Location -> Europe, Mobile -> 1234}|
+-------+-------+------------------------------------------------------------------------------+

一种简单的方法是在去掉 PersonalInfo 列中的双引号后使用 str_to_map 函数:

val df1 = df.withColumn(
  "PersonalInfo",
  expr("str_to_map(regexp_replace(PersonalInfo, '\"', ''))")
)

df1.show(false)

//+-------+-------+------------------------------------------------------------------------------+
//|Empcode|EmpName|PersonalInfo                                                                  |
//+-------+-------+------------------------------------------------------------------------------+
//|1      |abc    |{email -> abc@gmail.com, Location -> India, Gender -> Male}                   |
//|2      |xyz    |{email -> xyz@gmail.com, Location -> US}                                      |
//|3      |pqr    |{email -> abc@gmail.com, Gender -> Female, Location -> Europe, Mobile -> 1234}|
//+-------+-------+------------------------------------------------------------------------------+