Avro 模式不尊重模式定义中的别名

Avro schema not respecting alias in schema definition

Avro 架构 schema.avsc:

{
    "namespace": "standard",
    "type": "record",
    "name": "agent",
    "aliases":["agents"],
    "fields": [
        {
            "name": "id",
            "type": ["string", "null"]
        },
        {
            "name": "name",
            "type": ["string", "null"],
            "aliases":["title", "nickname"]
        }
    ]
}

Python 脚本 main.py:

from fastavro import writer, reader
from fastavro.schema import load_schema

schema = load_schema('schema.avsc')
avro_data = 'agent.avro'
data = jsonlines.open('data.jsonl')

with open(avro_data, 'wb') as fout:
    writer(fout, schema, data, validator=True)

with open(avro_data, 'rb') as fin:
    for i in reader(fin, schema):
        print(i)

当我的 json 行 data.jsonl 文件如下所示时:

{"id":"1","name":"foo"}
{"id":"2","name":"bar"}

我的 python 脚本 returns:

{'id': '1', 'name': 'foo'}
{'id': '2', 'name': 'bar'}

但是,如果我的 json 行 data.jsonl 文件如下所示:

{"id":"1","title":"foo"}
{"id":"2","title":"bar"}

我的 python 脚本 returns:

{'id': '1', 'name': None}
{'id': '2', 'name': None}

知道为什么 name 列不遵守我在 avro 模式文件中为该特定字段定义的 aliases 属性吗?

当您使用旧模式写入数据,而您希望使用新模式读取数据时,将使用别名。您的示例仅使用一种模式,因此别名仅适用于一种模式。

让我们在示例中使用以下两个模式。这是一个使用 title 字段的“旧”模式:

old_schema.avsc

{
    "namespace": "standard",
    "type": "record",
    "name": "agent",
    "aliases":["agents"],
    "fields": [
        {
            "name": "id",
            "type": ["string", "null"]
        },
        {
            "name": "title",
            "type": ["string", "null"]
        }
    ]
}

还有一个新模式,我们希望新的 name 字段成为旧 title 字段的别名:

new_schema.avsc

{
    "namespace": "standard",
    "type": "record",
    "name": "agent",
    "aliases":["agents"],
    "fields": [
        {
            "name": "id",
            "type": ["string", "null"]
        },
        {
            "name": "name",
            "type": ["string", "null"],
            "aliases":["title"]
        }
    ]
}

如果我们使用您的第二个 data.jsonl,如下所示:

{"id":"1","title":"foo"}
{"id":"2","title":"bar"}

然后我们可以使用您的 main.py 稍作修改的版本,以便使用旧模式写入数据,然后将新模式传递给 reader 以便尊重别名:

from fastavro import writer, reader
from fastavro.schema import load_schema
import jsonlines

old_schema = load_schema('old_schema.avsc')
new_schema = load_schema('new_schema.avsc')
avro_data = 'agent.avro'
data = jsonlines.open('data.jsonl')

# Data is writen with old schema
with open(avro_data, 'wb') as fout:
    writer(fout, old_schema, data, validator=True)

# And read with new schema
with open(avro_data, 'rb') as fin:
    for i in reader(fin, new_schema):
        print(i)

现在输出正确:

{'id': '1', 'name': 'foo'}
{'id': '2', 'name': 'bar'}