单一模式中相同类型的 Avro 多条记录

Avro multiple record of same type in single schema

我喜欢在 Avro 架构中多次使用相同的记录类型。考虑这个模式定义

{
    "type": "record",
    "name": "OrderBook",
    "namespace": "my.types",
    "doc": "Test order update",
    "fields": [
        {
            "name": "bids",
            "type": {
                "type": "array",
                "items": {
                    "type": "record",
                    "name": "OrderBookVolume",
                    "namespace": "my.types",
                    "fields": [
                        {
                            "name": "price",
                            "type": "double"
                        },
                        {
                            "name": "volume",
                            "type": "double"
                        }
                    ]
                }
            }
        },
        {
            "name": "asks",
            "type": {
                "type": "array",
                "items": {
                    "type": "record",
                    "name": "OrderBookVolume",
                    "namespace": "my.types",
                    "fields": [
                        {
                            "name": "price",
                            "type": "double"
                        },
                        {
                            "name": "volume",
                            "type": "double"
                        }
                    ]
                }
            }
        }
    ]
}

这不是有效的 Avro 模式,Avro 模式解析器失败并显示

org.apache.avro.SchemaParseException: Can't redefine: my.types.OrderBookVolume

我可以通过将 OrderBookVolume 移动到两个不同的命名空间来使类型唯一来解决这个问题:

{
    "type": "record",
    "name": "OrderBook",
    "namespace": "my.types",
    "doc": "Test order update",
    "fields": [
        {
            "name": "bids",
            "type": {
                "type": "array",
                "items": {
                    "type": "record",
                    "name": "OrderBookVolume",
                    "namespace": "my.types.bid",
                    "fields": [
                        {
                            "name": "price",
                            "type": "double"
                        },
                        {
                            "name": "volume",
                            "type": "double"
                        }
                    ]
                }
            }
        },
        {
            "name": "asks",
            "type": {
                "type": "array",
                "items": {
                    "type": "record",
                    "name": "OrderBookVolume",
                    "namespace": "my.types.ask",
                    "fields": [
                        {
                            "name": "price",
                            "type": "double"
                        },
                        {
                            "name": "volume",
                            "type": "double"
                        }
                    ]
                }
            }
        }
    ]
}

这不是一个有效的解决方案,因为 Avro 代码生成会生成两个不同的 类,如果我想将该类型也用于其他用途而不仅仅是 deser 和 ser,这会非常烦人。

此问题与此问题相关: Avro Spark issue #73

通过在命名空间前加上外部记录名称,增加了对同名嵌套记录的区分。他们的用例可能纯粹与存储相关,因此它可能适用于他们但不适用于我们。

有人知道更好的解决方案吗?这是 Avro 的硬性限制吗?

它没有很好的记录,但 Avro 允许您通过使用被引用名称的完整命名空间来引用以前定义的名称。在您的情况下,以下代码将导致仅生成一个 class,由每个数组引用。它还很好地干燥了模式。

{
    "type": "record",
    "name": "OrderBook",
    "namespace": "my.types",
    "doc": "Test order update",
    "fields": [
        {
            "name": "bids",
            "type": {
                "type": "array",
                "items": {
                    "type": "record",
                    "name": "OrderBookVolume",
                    "namespace": "my.types.bid",
                    "fields": [
                        {
                            "name": "price",
                            "type": "double"
                        },
                        {
                            "name": "volume",
                            "type": "double"
                        }
                    ]
                }
            }
        },
        {
            "name": "asks",
            "type": {
                "type": "array",
                "items": "my.types.bid.OrderBookVolume"
            }
        }
    ]
}

the spec所述:

A schema or protocol may not contain multiple definitions of a fullname.
Further, a name must be defined before it is used ("before" in the
depth-first, left-to-right traversal of the JSON parse tree, where the
types attribute of a protocol is always deemed to come "before" the
messages attribute.)

例如:

{
    "type": "record",
    "namespace": "my.types",
    "name": "OrderBook",
    "fields": [
        {
            "name": "bids",
            "type": {
                "type": "array",
                "items": {
                    "type": "record",
                    "name": "OrderBookVolume",
                    "fields": [
                        {"name": "price", "type": "double"},
                        {"name": "volume", "type": "double"}
                    ]
                }
            }
        },
        {
            "name": "asks",
            "type": {
                "type": "array",
                "items": {
                    "type": "record",
                    "name": "my.types.OrderBookVolume"
                }
            }
        }
    ]
}

第一次出现是 OrderBookVolume 的完整模式。之后就可以参考fullname: my.types.OrderBookVolume.

同样值得注意的是,您不需要为每条记录都设置一个名称空间。它从其父级继承。包含它将覆盖命名空间。