如何为动态值生成 pyarrow 模式

How to generate the pyarrow schema for the dynamic values

我正在尝试为我的 json 消息编写一个 parquest 架构,需要使用 apache_beam

将其写回 GCS 存储桶

我的 json 如下所示:

data = {
    "name": "user_1",
    "result": [
        {
            "subject": "maths",
            "marks": 99
        },
        {
            "subject": "science",
            "marks": 76
        }
    ],
    "section": "A"
}

上面例子中的结果数组可以有很多值最小为1。

这是您需要的架构:

import pyarrow as pa

schema = pa.schema(
    [
        pa.field("name", pa.string()),
        pa.field(
            "result",
            pa.list_(
                pa.struct(
                    [
                        pa.field("subject", pa.string()),
                        pa.field("marks", pa.int32()),
                    ]
                )
            ),
        ),
        pa.field("section", pa.string()),
    ]
)

如果您的文件每行包含一条记录:

{"name": "user_1", "result": [{"subject": "maths", "marks": 99}, {"subject": "science", "marks": 76}], "section": "A"}
{"name": "user_2", "result": [{"subject": "maths", "marks": 10}, {"subject": "science", "marks": 75}], "section": "A"}

您可以使用以下方式加载它:

from pyarrow import json as pa_json
table = pa_json.read_json('filename.json', parse_options=pa_json.ParseOptions(explicit_schema=schema))