Pyspark：使用结构列从 JSON 文件写入 CSV

Question

我正在读取一个包含以下结构的 .json 文件，我需要用列形式的数据生成一个 csv，我知道我不能直接写一个数组类型csv 中的对象，我使用 explode 函数删除我需要的字段，能够以柱状形式保留它们，但是在 csv 中写入数据框时，我在使用 explode 函数时遇到错误，从什么我知道不可能用同一个 select 中的两个变量来做到这一点，有人可以帮我做一些替代的事情吗？

from pyspark.sql.functions import col, explode
from pyspark.sql import SparkSession

spark = (SparkSession.builder
    .master("local[1]")
    .appName("sample")
    .getOrCreate())

df = (spark.read.option("multiline", "true")
    .json("data/origin/crops.json"))

df2 = (explode('history').alias('history'), explode('trial').alias('trial'))
.select('history.started_at', 'history.finished_at', col('id'), trial.is_trial, trial.ws10_max))

(df2.write.format('com.databricks.spark.csv')
.mode('overwrite')
.option("header","true")
.save('data/output/'))

root
 |-- history: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- finished_at: string (nullable = true)
 |    |    |-- started_at: string (nullable = true)
 |-- id: long (nullable = true)
 |-- trial: struct (nullable = true)
 |    |-- is_trial: boolean (nullable = true)
 |    |-- ws10_max: double (nullable = true)

我正在尝试 return 这样的事情

started_at	finished_at	is_trial	ws10_max
First	row	row
Second	row	row

谢谢！

Answer 1

在数组上使用 explode，在结构上使用 select("struct.*")。

df.select("trial", "id", explode('history').alias('history')),
  .select('id', 'history.*', 'trial.*'))

Pyspark：使用结构列从 JSON 文件写入 CSV

Pyspark: Write CSV from JSON file with struct column

struct

apache-spark

pyspark