Apache Avro 使用@AvroSchema 从 Java POJO 生成错误的 Avro 模式

Apache Avro generates wrong Avro schema from Java POJO with @AvroSchema

我有一个带日期的简单 POJO,在导入 Google BigQuery 之前,它将作为 Avro 存储在存储器中。日期被转换为长日期,我正在尝试使用 @AvroSchema 覆盖日期字段的架构生成,以便 BigQuery 了解字段的类型。

简单的 POJO:

public class SomeAvroMessage implements Serializable {
    @AvroSchema("{\"type\":\"long\",\"logicalType\":\"timestamp-millis\"}")
    private long tm;
    @AvroSchema("{\"type\":\"long\",\"logicalType\":\"timestamp-millis\"}")
    private long created;

    public SomeAvroMessage() {
    }
}

这以以下 AVRO 模式结束:

{"type":"record","name":"SomeAvroMessage",
"namespace":"some.namespace",
"fields":[
      {"name":"tm","type":{"type":"long","logicalType":"timestamp-millis"}},
      {"name":"created","type":{"type":"long","logicalType":"timestamp-millis"}}
]}

这些似乎是错误的,应该只是 {"name":"tm","type":"long","logicalType":"timestamp-millis"}

这在 Google Dataflow 中使用,Apache Beam 2.22 用 Java 编写。

我是不是漏掉了什么?

{"name":"tm","type":{"type":"long","logicalType":"timestamp-millis"}}是正确的。如果我们把它展开成更清晰的伪代码,就是:

Field {
  name: "tm",
  type: Schema {
    type: "long",
    logicalType: "timestamp-millis"
  }
}

可以看到该字段有一个name和一个type。 Avro 字段的 type 必须是 Avro 模式。 logicalType 字段位于模式内部,不与其相邻。

可以在documentation中找到:

A logical type is an Avro primitive or complex type with extra attributes to represent a derived type. The attribute logicalType must always be present for a logical type, and is a string with the name of one of the logical types listed later in this section. Other attributes may be defined for particular logical types.

文档还给出了 avro 模式中日期类型的示例:

{
  "type": "int",
  "logicalType": "date"
}

基本上您的模式是正确的,每次您需要使用某种逻辑类型时,您都可以像这样构建您的模式。