如何为动态值生成 pyarrow 模式
How to generate the pyarrow schema for the dynamic values
我正在尝试为我的 json 消息编写一个 parquest 架构,需要使用 apache_beam
将其写回 GCS 存储桶
我的 json 如下所示:
data = {
"name": "user_1",
"result": [
{
"subject": "maths",
"marks": 99
},
{
"subject": "science",
"marks": 76
}
],
"section": "A"
}
上面例子中的结果数组可以有很多值最小为1。
这是您需要的架构:
import pyarrow as pa
schema = pa.schema(
[
pa.field("name", pa.string()),
pa.field(
"result",
pa.list_(
pa.struct(
[
pa.field("subject", pa.string()),
pa.field("marks", pa.int32()),
]
)
),
),
pa.field("section", pa.string()),
]
)
如果您的文件每行包含一条记录:
{"name": "user_1", "result": [{"subject": "maths", "marks": 99}, {"subject": "science", "marks": 76}], "section": "A"}
{"name": "user_2", "result": [{"subject": "maths", "marks": 10}, {"subject": "science", "marks": 75}], "section": "A"}
您可以使用以下方式加载它:
from pyarrow import json as pa_json
table = pa_json.read_json('filename.json', parse_options=pa_json.ParseOptions(explicit_schema=schema))
我正在尝试为我的 json 消息编写一个 parquest 架构,需要使用 apache_beam
将其写回 GCS 存储桶我的 json 如下所示:
data = {
"name": "user_1",
"result": [
{
"subject": "maths",
"marks": 99
},
{
"subject": "science",
"marks": 76
}
],
"section": "A"
}
上面例子中的结果数组可以有很多值最小为1。
这是您需要的架构:
import pyarrow as pa
schema = pa.schema(
[
pa.field("name", pa.string()),
pa.field(
"result",
pa.list_(
pa.struct(
[
pa.field("subject", pa.string()),
pa.field("marks", pa.int32()),
]
)
),
),
pa.field("section", pa.string()),
]
)
如果您的文件每行包含一条记录:
{"name": "user_1", "result": [{"subject": "maths", "marks": 99}, {"subject": "science", "marks": 76}], "section": "A"}
{"name": "user_2", "result": [{"subject": "maths", "marks": 10}, {"subject": "science", "marks": 75}], "section": "A"}
您可以使用以下方式加载它:
from pyarrow import json as pa_json
table = pa_json.read_json('filename.json', parse_options=pa_json.ParseOptions(explicit_schema=schema))