如何使用 Scala 从 JSON 正文中删除多列

How to drop multiple columns from JSON body using scala

我的数据框中有以下 JSON 结构作为 body 属性。我想根据提供的列表从内容中删除多个 columns/attributes,我该如何在 scala 中执行此操作?

请注意,属性列表本质上是可变的。

假设,

要删除的列列表:List(alias, firstName, lastName)

输入

  "Content":{
     "alias":"Jon",
     "firstName":"Jonathan",
     "lastName":"Mathew",
     "displayName":"Jonathan Mathew",
     "createdDate":"2021-08-10T13:06:35.866Z",
     "updatedDate":"2021-08-10T13:06:35.866Z",
     "isDeleted":false,
     "address":"xx street",
     "phone":"xxx90"
  }

输出:

"Content":{
     "displayName":"Jonathan Mathew",
     "createdDate":"2021-08-10T13:06:35.866Z",
     "updatedDate":"2021-08-10T13:06:35.866Z",
     "isDeleted":false,
     "address":"xx street",
     "phone":"xxx90"
  }

您可以使用 drop 一次删除多个列:

val newDataframe = oldDataframe.drop("alias", "firstName", "lastName")

文档:

/**
   * Returns a new Dataset with columns dropped.
   * This is a no-op if schema doesn't contain column name(s).
   *
   * This method can only be used to drop top level columns. the colName string is treated literally
   * without further interpretation.
   *
   * @group untypedrel
   * @since 2.0.0
   */
  @scala.annotation.varargs
  def drop(colNames: String*): DataFrame 

您可以从数据框架构中获取属性列表,然后通过创建一个结构来更新列 Content,该结构包含要删除的列列表中的所有属性。

这是一个完整的工作示例:

val jsonStr = """{"id": 1,"Content":{"alias":"Jon","firstName":"Jonathan","lastName":"Mathew","displayName":"Jonathan Mathew","createdDate":"2021-08-10T13:06:35.866Z","updatedDate":"2021-08-10T13:06:35.866Z","isDeleted":false,"address":"xx street","phone":"xxx90"}}"""

val df = spark.read.json(Seq(jsonStr).toDS)

val attrToDrop = Seq("alias", "firstName", "lastName")

val contentAttrList = df.select("Content.*").columns

val df2 = df.withColumn(
  "Content",
  struct(
    contentAttrList
      .filter(!attrToDrop.contains(_))
      .map(c => col(s"Content.$c")): _*
  )
)

df2.printSchema
//root
// |-- Content: struct (nullable = false)
// |    |-- address: string (nullable = true)
// |    |-- createdDate: string (nullable = true)
// |    |-- displayName: string (nullable = true)
// |    |-- isDeleted: boolean (nullable = true)
// |    |-- phone: string (nullable = true)
// |    |-- updatedDate: string (nullable = true)
// |-- id: long (nullable = true)