Avro 模式不尊重模式定义中的别名
Avro schema not respecting alias in schema definition
Avro 架构 schema.avsc
:
{
"namespace": "standard",
"type": "record",
"name": "agent",
"aliases":["agents"],
"fields": [
{
"name": "id",
"type": ["string", "null"]
},
{
"name": "name",
"type": ["string", "null"],
"aliases":["title", "nickname"]
}
]
}
Python 脚本 main.py
:
from fastavro import writer, reader
from fastavro.schema import load_schema
schema = load_schema('schema.avsc')
avro_data = 'agent.avro'
data = jsonlines.open('data.jsonl')
with open(avro_data, 'wb') as fout:
writer(fout, schema, data, validator=True)
with open(avro_data, 'rb') as fin:
for i in reader(fin, schema):
print(i)
当我的 json 行 data.jsonl
文件如下所示时:
{"id":"1","name":"foo"}
{"id":"2","name":"bar"}
我的 python 脚本 returns:
{'id': '1', 'name': 'foo'}
{'id': '2', 'name': 'bar'}
但是,如果我的 json 行 data.jsonl
文件如下所示:
{"id":"1","title":"foo"}
{"id":"2","title":"bar"}
我的 python 脚本 returns:
{'id': '1', 'name': None}
{'id': '2', 'name': None}
知道为什么 name
列不遵守我在 avro 模式文件中为该特定字段定义的 aliases
属性吗?
当您使用旧模式写入数据,而您希望使用新模式读取数据时,将使用别名。您的示例仅使用一种模式,因此别名仅适用于一种模式。
让我们在示例中使用以下两个模式。这是一个使用 title
字段的“旧”模式:
old_schema.avsc
{
"namespace": "standard",
"type": "record",
"name": "agent",
"aliases":["agents"],
"fields": [
{
"name": "id",
"type": ["string", "null"]
},
{
"name": "title",
"type": ["string", "null"]
}
]
}
还有一个新模式,我们希望新的 name
字段成为旧 title
字段的别名:
new_schema.avsc
{
"namespace": "standard",
"type": "record",
"name": "agent",
"aliases":["agents"],
"fields": [
{
"name": "id",
"type": ["string", "null"]
},
{
"name": "name",
"type": ["string", "null"],
"aliases":["title"]
}
]
}
如果我们使用您的第二个 data.jsonl
,如下所示:
{"id":"1","title":"foo"}
{"id":"2","title":"bar"}
然后我们可以使用您的 main.py
稍作修改的版本,以便使用旧模式写入数据,然后将新模式传递给 reader
以便尊重别名:
from fastavro import writer, reader
from fastavro.schema import load_schema
import jsonlines
old_schema = load_schema('old_schema.avsc')
new_schema = load_schema('new_schema.avsc')
avro_data = 'agent.avro'
data = jsonlines.open('data.jsonl')
# Data is writen with old schema
with open(avro_data, 'wb') as fout:
writer(fout, old_schema, data, validator=True)
# And read with new schema
with open(avro_data, 'rb') as fin:
for i in reader(fin, new_schema):
print(i)
现在输出正确:
{'id': '1', 'name': 'foo'}
{'id': '2', 'name': 'bar'}
Avro 架构 schema.avsc
:
{
"namespace": "standard",
"type": "record",
"name": "agent",
"aliases":["agents"],
"fields": [
{
"name": "id",
"type": ["string", "null"]
},
{
"name": "name",
"type": ["string", "null"],
"aliases":["title", "nickname"]
}
]
}
Python 脚本 main.py
:
from fastavro import writer, reader
from fastavro.schema import load_schema
schema = load_schema('schema.avsc')
avro_data = 'agent.avro'
data = jsonlines.open('data.jsonl')
with open(avro_data, 'wb') as fout:
writer(fout, schema, data, validator=True)
with open(avro_data, 'rb') as fin:
for i in reader(fin, schema):
print(i)
当我的 json 行 data.jsonl
文件如下所示时:
{"id":"1","name":"foo"}
{"id":"2","name":"bar"}
我的 python 脚本 returns:
{'id': '1', 'name': 'foo'}
{'id': '2', 'name': 'bar'}
但是,如果我的 json 行 data.jsonl
文件如下所示:
{"id":"1","title":"foo"}
{"id":"2","title":"bar"}
我的 python 脚本 returns:
{'id': '1', 'name': None}
{'id': '2', 'name': None}
知道为什么 name
列不遵守我在 avro 模式文件中为该特定字段定义的 aliases
属性吗?
当您使用旧模式写入数据,而您希望使用新模式读取数据时,将使用别名。您的示例仅使用一种模式,因此别名仅适用于一种模式。
让我们在示例中使用以下两个模式。这是一个使用 title
字段的“旧”模式:
old_schema.avsc
{
"namespace": "standard",
"type": "record",
"name": "agent",
"aliases":["agents"],
"fields": [
{
"name": "id",
"type": ["string", "null"]
},
{
"name": "title",
"type": ["string", "null"]
}
]
}
还有一个新模式,我们希望新的 name
字段成为旧 title
字段的别名:
new_schema.avsc
{
"namespace": "standard",
"type": "record",
"name": "agent",
"aliases":["agents"],
"fields": [
{
"name": "id",
"type": ["string", "null"]
},
{
"name": "name",
"type": ["string", "null"],
"aliases":["title"]
}
]
}
如果我们使用您的第二个 data.jsonl
,如下所示:
{"id":"1","title":"foo"}
{"id":"2","title":"bar"}
然后我们可以使用您的 main.py
稍作修改的版本,以便使用旧模式写入数据,然后将新模式传递给 reader
以便尊重别名:
from fastavro import writer, reader
from fastavro.schema import load_schema
import jsonlines
old_schema = load_schema('old_schema.avsc')
new_schema = load_schema('new_schema.avsc')
avro_data = 'agent.avro'
data = jsonlines.open('data.jsonl')
# Data is writen with old schema
with open(avro_data, 'wb') as fout:
writer(fout, old_schema, data, validator=True)
# And read with new schema
with open(avro_data, 'rb') as fin:
for i in reader(fin, new_schema):
print(i)
现在输出正确:
{'id': '1', 'name': 'foo'}
{'id': '2', 'name': 'bar'}