如何使用 Scala 从 JSON 正文中删除多列
How to drop multiple columns from JSON body using scala
我的数据框中有以下 JSON 结构作为 body
属性。我想根据提供的列表从内容中删除多个 columns/attributes,我该如何在 scala 中执行此操作?
请注意,属性列表本质上是可变的。
假设,
要删除的列列表:List(alias, firstName, lastName)
输入
"Content":{
"alias":"Jon",
"firstName":"Jonathan",
"lastName":"Mathew",
"displayName":"Jonathan Mathew",
"createdDate":"2021-08-10T13:06:35.866Z",
"updatedDate":"2021-08-10T13:06:35.866Z",
"isDeleted":false,
"address":"xx street",
"phone":"xxx90"
}
输出:
"Content":{
"displayName":"Jonathan Mathew",
"createdDate":"2021-08-10T13:06:35.866Z",
"updatedDate":"2021-08-10T13:06:35.866Z",
"isDeleted":false,
"address":"xx street",
"phone":"xxx90"
}
您可以使用 drop
一次删除多个列:
val newDataframe = oldDataframe.drop("alias", "firstName", "lastName")
文档:
/**
* Returns a new Dataset with columns dropped.
* This is a no-op if schema doesn't contain column name(s).
*
* This method can only be used to drop top level columns. the colName string is treated literally
* without further interpretation.
*
* @group untypedrel
* @since 2.0.0
*/
@scala.annotation.varargs
def drop(colNames: String*): DataFrame
您可以从数据框架构中获取属性列表,然后通过创建一个结构来更新列 Content
,该结构包含要删除的列列表中的所有属性。
这是一个完整的工作示例:
val jsonStr = """{"id": 1,"Content":{"alias":"Jon","firstName":"Jonathan","lastName":"Mathew","displayName":"Jonathan Mathew","createdDate":"2021-08-10T13:06:35.866Z","updatedDate":"2021-08-10T13:06:35.866Z","isDeleted":false,"address":"xx street","phone":"xxx90"}}"""
val df = spark.read.json(Seq(jsonStr).toDS)
val attrToDrop = Seq("alias", "firstName", "lastName")
val contentAttrList = df.select("Content.*").columns
val df2 = df.withColumn(
"Content",
struct(
contentAttrList
.filter(!attrToDrop.contains(_))
.map(c => col(s"Content.$c")): _*
)
)
df2.printSchema
//root
// |-- Content: struct (nullable = false)
// | |-- address: string (nullable = true)
// | |-- createdDate: string (nullable = true)
// | |-- displayName: string (nullable = true)
// | |-- isDeleted: boolean (nullable = true)
// | |-- phone: string (nullable = true)
// | |-- updatedDate: string (nullable = true)
// |-- id: long (nullable = true)
我的数据框中有以下 JSON 结构作为 body
属性。我想根据提供的列表从内容中删除多个 columns/attributes,我该如何在 scala 中执行此操作?
请注意,属性列表本质上是可变的。
假设,
要删除的列列表:List(alias, firstName, lastName)
输入
"Content":{
"alias":"Jon",
"firstName":"Jonathan",
"lastName":"Mathew",
"displayName":"Jonathan Mathew",
"createdDate":"2021-08-10T13:06:35.866Z",
"updatedDate":"2021-08-10T13:06:35.866Z",
"isDeleted":false,
"address":"xx street",
"phone":"xxx90"
}
输出:
"Content":{
"displayName":"Jonathan Mathew",
"createdDate":"2021-08-10T13:06:35.866Z",
"updatedDate":"2021-08-10T13:06:35.866Z",
"isDeleted":false,
"address":"xx street",
"phone":"xxx90"
}
您可以使用 drop
一次删除多个列:
val newDataframe = oldDataframe.drop("alias", "firstName", "lastName")
文档:
/**
* Returns a new Dataset with columns dropped.
* This is a no-op if schema doesn't contain column name(s).
*
* This method can only be used to drop top level columns. the colName string is treated literally
* without further interpretation.
*
* @group untypedrel
* @since 2.0.0
*/
@scala.annotation.varargs
def drop(colNames: String*): DataFrame
您可以从数据框架构中获取属性列表,然后通过创建一个结构来更新列 Content
,该结构包含要删除的列列表中的所有属性。
这是一个完整的工作示例:
val jsonStr = """{"id": 1,"Content":{"alias":"Jon","firstName":"Jonathan","lastName":"Mathew","displayName":"Jonathan Mathew","createdDate":"2021-08-10T13:06:35.866Z","updatedDate":"2021-08-10T13:06:35.866Z","isDeleted":false,"address":"xx street","phone":"xxx90"}}"""
val df = spark.read.json(Seq(jsonStr).toDS)
val attrToDrop = Seq("alias", "firstName", "lastName")
val contentAttrList = df.select("Content.*").columns
val df2 = df.withColumn(
"Content",
struct(
contentAttrList
.filter(!attrToDrop.contains(_))
.map(c => col(s"Content.$c")): _*
)
)
df2.printSchema
//root
// |-- Content: struct (nullable = false)
// | |-- address: string (nullable = true)
// | |-- createdDate: string (nullable = true)
// | |-- displayName: string (nullable = true)
// | |-- isDeleted: boolean (nullable = true)
// | |-- phone: string (nullable = true)
// | |-- updatedDate: string (nullable = true)
// |-- id: long (nullable = true)