Spark 获取嵌套 json 的列名

Question

我正在尝试通过 DataFrames 从嵌套的 JSON 中获取列名。架构如下：

root
 |-- body: struct (nullable = true)
 |    |-- Sw1: string (nullable = true)
 |    |-- Sw2: string (nullable = true)
 |    |-- Sw3: string (nullable = true)
 |    |-- Sw420: string (nullable = true)
 |-- headers: struct (nullable = true)
 |    |-- endDate: string (nullable = true)
 |    |-- file: string (nullable = true)
 |    |-- startDate: string (nullable = true)

我可以使用 df.columns() 获取列名 "body" 和 "header"，但是当我尝试从正文中获取列名时（例如：Sw1、Sw2、. ..) 与 df.select("body").columns 它总是给我正文列。

有什么建议吗？ :)

Answer 1

很简单：df.select("body.Sw1", "body.Sw2")

Answer 2

如果问题是如何找到嵌套的列名，您可以通过检查 DataFrame 的 schema 来完成。该模式表示为 StructType which can fields of other DataType 个对象（包括其他嵌套结构）。如果您想发现所有字段，则必须递归地遍历这棵树。例如：

import org.apache.spark.sql.types._
def findFields(path: String, dt: DataType): Unit = dt match {
  case s: StructType => 
    s.fields.foreach(f => findFields(path + "." + f.name, f.dataType))
  case other => 
    println(s"$path: $other")
}

这会遍历树并打印出所有叶字段及其类型：

val df = sqlContext.read.json(sc.parallelize("""{"a": {"b": 1}}""" :: Nil))
findFields("", df.schema)

prints: .a.b: LongType

Answer 3

要获取嵌套的列名，请使用如下代码：

从 main 方法调用如下：

findFields(df,df.schema)

方法：

def findFields(df:DataFrame,dt: DataType) = 
{
    val fieldName = dt.asInstanceOf[StructType].fields
    for (value <- fieldName) 
    {
      val colNames = value.productElement(1).asInstanceOf[StructType].fields
      for (f <- colNames)
      {
         println("Inner Columns of "+value.name+" -->>"+f.name)
      }
   }

}

注意：这仅在第一组列都是结构类型时才有效。

Answer 4

如果嵌套的json有一个StructType数组，那么可以使用下面的代码（下面的代码是Michael Armbrust给出的代码的扩展）

import org.apache.spark.sql.types._

def findFields(path: String, dt: DataType): Unit = dt match {
  case s: StructType => 
    s.fields.foreach(f => findFields(path + "." + f.name, f.dataType))
  case s: ArrayType => 
    findFields(path, s.elementType)
  case other => 
    println(s"$path")
}

Spark 获取嵌套 json 的列名

Spark get column names of nested json

java

json

nested

apache-spark-sql

spark-dataframe