Databricks Delta Lake - 从 JSON 文件中读取数据

Question

我目前正在学习 Databricks 并结合使用 Python (pyspark) 和 SQL 进行数据转换。

截至目前，我有一个 json 文件，格式如下：

{
    "issuccess": true,
    "jobProcess": "usersList",
    "data": {
        "members": [
            {
                "id": "bot1",     
                "name": "databot",
                "active": true,
                "profile": {
                    "title": "Test Bot",
                    "phone": "1234"
                 },
                 "is_mailbox_active": true
             },
             {
                ....
             }
         ]
     }
}

我可以通过以下方式将这些数据转储到临时视图中。遵循 Python (PySpark) 逻辑：

 usersData = spark \
                .read \
                .option("multiLine", True) \
                .option("mode", "PERMISSIVE") \
                .json("C:\Test\data.json") \
                .createOrReplaceTempView("vw_TestView")

在上面，vw_TestView 数据是 struct 格式。

Column DataType

issuccess boolean

jobProcess string

data struct<members:array<struct<id:string, ....>

作为输出，我只需要 select/display 来自 'data' 列数组的 members 的数据正确的格式。

执行 select * 来自 预计会 return 'results too large....' 错误。此外，由于我最终需要 select 来自 'data' 列的特定内容，我如何为上述视图构建适当的 select 查询。

select 查询输出必须如下所示：

id name profile

bot1 databot { "title": "Test Bot","phone": "1234"}

bot2 userbot { "title": "User Bot","phone": "7890"}

如何实现？

我试过表演

%sql select data.members.* from vw_TestView

但这不支持 'data.members' 列的数据类型，并出现以下消息的错误：

Can only star expand struct data types. ..........

Answer 1

问题是 members 是一个数组。在这种情况下，您需要通过以下操作来做到这一点：

Select members 字段使用 select("members")
使用 explode 函数分解 members 字段 (doc)
从底层结构中提取数据

像这样：

select col.* from (select explode(data.members) as col from vw_TestView)

P.S。所有这些也可以直接通过 PySpark 完成。

Databricks Delta Lake - 从 JSON 文件中读取数据

Databricks Delta Lake - Reading data from JSON file

apache-spark

apache-spark-sql

pyspark

databricks

delta-lake

Column	DataType
issuccess	boolean
jobProcess	string
data	struct<members:array<struct<id:string, ....>

id	name	profile
bot1	databot	{ "title": "Test Bot","phone": "1234"}
bot2	userbot	{ "title": "User Bot","phone": "7890"}