PySpark

Question

我想转换对象列表并将它们的属性存储为列。

{
  "heading": 1,
  "columns": [
    {
      "col1": "a",
      "col2": "b",
      "col3": "c"
    },
    {
      "col1": "d",
      "col2": "e",
      "col3": "f"
    }
  ]
}

最终结果

heading | col1 | col2 | col3
1       | a    | b    | c
1       | d    | e    | f

我目前正在展平我的数据（并排除列列）

df = target_table.relationalize('roottable', temp_path)

但是，对于这个用例，我需要列列。我看到了使用 arrays_zip 和 explode 的例子。我需要遍历每个对象还是有更简单的方法来提取每个对象并转换成一行？

Answer 1

使用 Spark SQL 内置函数：inline or inline_outer 可能是处理此问题的最简单方法（当 columns 中允许 NULL 时使用 inline_outer）：

来自 Apache Hive document:

Explodes an array of structs to multiple rows. Returns a row-set with N columns (N = number of top level elements in the struct), one row per struct from the array. (As of Hive 0.10.)

df.selectExpr('heading', 'inline_outer(columns)').show()                                                           
+-------+----+----+----+
|heading|col1|col2|col3|
+-------+----+----+----+
|      1|   a|   b|   c|
|      1|   d|   e|   f|
+-------+----+----+----+

PySpark - 将 JSON 个对象的列表转换为行

PySpark - Convert list of JSON objects to rows

pyspark-sql