如何将 10G JSON 文件转换为 Avro?
How to convert a 10G JSON file to Avro?
我有一个大约 10G JSON 的文件。每行恰好包含一个 JSON 文档。我想知道将其转换为 Avro 的最佳方法是什么。理想情况下,我希望每个文件保留多个文档(如 10M)。我认为 Avro 支持在同一个文件中包含多个文档。
您应该能够使用 Avro 工具的 fromjson
命令(参见 here for more information and examples). You'll probably want to split your file into 10M chunks beforehand (for example using split(1)
)。
将大型 JSON 文件转换为 Avro 的最简单方法是使用 Avro website.
中的 avro-tools
创建简单模式后,可以直接转换文件。
java -jar avro-tools-1.7.7.jar fromjson --schema-file cpc.avsc --codec deflate test.1g.json > test.1g.deflate.avro
示例架构:
{
"type": "record",
"name": "cpc_schema",
"namespace": "com.streambright.avro",
"fields": [{
"name": "section",
"type": "string",
"doc": "Section of the CPC"
}, {
"name": "class",
"type": "string",
"doc": "Class of the CPC"
}, {
"name": "subclass",
"type": "string",
"doc": "Subclass of the CPC"
}, {
"name": "main_group",
"type": "string",
"doc": "Main-group of the CPC"
}, {
"name": "subgroup",
"type": "string",
"doc": "Subgroup of the CPC"
}, {
"name": "classification_value",
"type": "string",
"doc": "Classification value of the CPC"
}, {
"name": "doc_number",
"type": "string",
"doc": "Patent doc_number"
}, {
"name": "updated_at",
"type": "string",
"doc": "Document update time"
}],
"doc:": "A basic schema for CPC codes"
}
我有一个大约 10G JSON 的文件。每行恰好包含一个 JSON 文档。我想知道将其转换为 Avro 的最佳方法是什么。理想情况下,我希望每个文件保留多个文档(如 10M)。我认为 Avro 支持在同一个文件中包含多个文档。
您应该能够使用 Avro 工具的 fromjson
命令(参见 here for more information and examples). You'll probably want to split your file into 10M chunks beforehand (for example using split(1)
)。
将大型 JSON 文件转换为 Avro 的最简单方法是使用 Avro website.
中的 avro-tools创建简单模式后,可以直接转换文件。
java -jar avro-tools-1.7.7.jar fromjson --schema-file cpc.avsc --codec deflate test.1g.json > test.1g.deflate.avro
示例架构:
{
"type": "record",
"name": "cpc_schema",
"namespace": "com.streambright.avro",
"fields": [{
"name": "section",
"type": "string",
"doc": "Section of the CPC"
}, {
"name": "class",
"type": "string",
"doc": "Class of the CPC"
}, {
"name": "subclass",
"type": "string",
"doc": "Subclass of the CPC"
}, {
"name": "main_group",
"type": "string",
"doc": "Main-group of the CPC"
}, {
"name": "subgroup",
"type": "string",
"doc": "Subgroup of the CPC"
}, {
"name": "classification_value",
"type": "string",
"doc": "Classification value of the CPC"
}, {
"name": "doc_number",
"type": "string",
"doc": "Patent doc_number"
}, {
"name": "updated_at",
"type": "string",
"doc": "Document update time"
}],
"doc:": "A basic schema for CPC codes"
}