BigQueryIO.Write withJsonSchema 序列化
BigQueryIO.Write Serialization of withJsonSchema
我的 Beam 管道中有一个 BigQueryIO.Write 阶段,它是通过调用 .withJsonSchema(String)
:
构建的
inputStream.apply(
"save-to-bigquery",
BigQueryIO.<Event>write()
.withJsonSchema(jsonSchema)
.to((ValueInSingleWindow<Event> input) ->
new TableDestination(
"table_name$" + PARTITION_SELECTOR.print(input.getValue().getMetadata().getTimestamp()),
null)
)
.withFormatFunction((ConsumerApiRequest event) ->
new TableRow()
.set("id", event.getMetadata().getUuid())
.set("insertId", event.getMetadata().getUuid())
.set("account_id", event.getAccountId())
...
.set("timestamp", ISODateTimeFormat.dateHourMinuteSecondMillis()
.print(event.getMetadata().getTimestamp())))
.withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
);
我是 运行 通过 DataflowRunner
执行此阶段的,我收到以下错误:
java.lang.IllegalArgumentException:
com.google.api.client.json.JsonParser.parseValue(JsonParser.java:889)
com.google.api.client.json.JsonParser.parse(JsonParser.java:382)
com.google.api.client.json.JsonParser.parse(JsonParser.java:336)
com.google.api.client.json.JsonParser.parse(JsonParser.java:312)
com.google.api.client.json.JsonFactory.fromString(JsonFactory.java:187)
org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers.fromJsonString(BigQueryHelpers.java:156)
org.apache.beam.sdk.io.gcp.bigquery.DynamicDestinationsHelpers$ConstantSchemaDestinations.getSchema(DynamicDestinationsHelpers.java:163)
org.apache.beam.sdk.io.gcp.bigquery.DynamicDestinationsHelpers$ConstantSchemaDestinations.getSchema(DynamicDestinationsHelpers.java:150)
org.apache.beam.sdk.io.gcp.bigquery.CreateTables.processElement(CreateTables.java:103)
Caused by: java.lang.IllegalArgumentException: expected collection or array type but got class com.google.api.services.bigquery.model.TableSchema
com.google.api.client.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:148)
com.google.api.client.util.Preconditions.checkArgument(Preconditions.java:69)
com.google.api.client.json.JsonParser.parseValue(JsonParser.java:723)
com.google.api.client.json.JsonParser.parse(JsonParser.java:382)
com.google.api.client.json.JsonParser.parse(JsonParser.java:336)
com.google.api.client.json.JsonParser.parse(JsonParser.java:312)
com.google.api.client.json.JsonFactory.fromString(JsonFactory.java:187)
org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers.fromJsonString(BigQueryHelpers.java:156)
org.apache.beam.sdk.io.gcp.bigquery.DynamicDestinationsHelpers$ConstantSchemaDestinations.getSchema(DynamicDestinationsHelpers.java:163)
org.apache.beam.sdk.io.gcp.bigquery.DynamicDestinationsHelpers$ConstantSchemaDestinations.getSchema(DynamicDestinationsHelpers.java:150)
org.apache.beam.sdk.io.gcp.bigquery.CreateTables.processElement(CreateTables.java:103)
org.apache.beam.sdk.io.gcp.bigquery.CreateTables$DoFnInvoker.invokeProcessElement(Unknown Source)
org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:177)
org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:141)
com.google.cloud.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:233)
com.google.cloud.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:48)
com.google.cloud.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:52)
com.google.cloud.dataflow.worker.SimpleParDoFn.output(SimpleParDoFn.java:183)
org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:211)
org.apache.beam.runners.core.SimpleDoFnRunner.access0(SimpleDoFnRunner.java:66)
org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:436)
org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:424)
org.apache.beam.sdk.io.gcp.bigquery.PrepareWrite.processElement(PrepareWrite.java:62)
org.apache.beam.sdk.io.gcp.bigquery.PrepareWrite$DoFnInvoker.invokeProcessElement(Unknown Source)
.....
似乎 JSON 在管道创建/序列化时被正确读取,但在执行时反序列化的 JSON 表示被传递以代替 JSON 字符串。我通过 Guava Resources
class:
读取资源文件生成 JSON 字符串
String jsonSchema;
try {
jsonSchema = Resources.toString(Resources.getResource("path_to_json_schema"), Charsets.UTF_8);
} catch (IOException e) {
throw new RuntimeException("Failed to load JSON schema", e);
}
我该如何解决这个序列化问题?
查看引发异常的代码,这似乎是 JSON 解析失败 - 您的 JSON 模式很可能格式不正确。根据 the documentation,它应该看起来像这样:
{
"fields": [
{
"name": string,
"type": string,
"mode": string,
"fields": [
(TableFieldSchema)
],
"description": string
}
]
}
例如:
{
"fields": [
{
"name": "foo",
"type": "INTEGER"
},
{
"name": "bar",
"type": "STRING",
}
]
}
查看失败的 JSON 解析器代码,我怀疑您缺少外部 {"fields": ...}
并且您的 JSON 字符串仅包含 [...]
.
我的 Beam 管道中有一个 BigQueryIO.Write 阶段,它是通过调用 .withJsonSchema(String)
:
inputStream.apply(
"save-to-bigquery",
BigQueryIO.<Event>write()
.withJsonSchema(jsonSchema)
.to((ValueInSingleWindow<Event> input) ->
new TableDestination(
"table_name$" + PARTITION_SELECTOR.print(input.getValue().getMetadata().getTimestamp()),
null)
)
.withFormatFunction((ConsumerApiRequest event) ->
new TableRow()
.set("id", event.getMetadata().getUuid())
.set("insertId", event.getMetadata().getUuid())
.set("account_id", event.getAccountId())
...
.set("timestamp", ISODateTimeFormat.dateHourMinuteSecondMillis()
.print(event.getMetadata().getTimestamp())))
.withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
);
我是 运行 通过 DataflowRunner
执行此阶段的,我收到以下错误:
java.lang.IllegalArgumentException:
com.google.api.client.json.JsonParser.parseValue(JsonParser.java:889)
com.google.api.client.json.JsonParser.parse(JsonParser.java:382)
com.google.api.client.json.JsonParser.parse(JsonParser.java:336)
com.google.api.client.json.JsonParser.parse(JsonParser.java:312)
com.google.api.client.json.JsonFactory.fromString(JsonFactory.java:187)
org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers.fromJsonString(BigQueryHelpers.java:156)
org.apache.beam.sdk.io.gcp.bigquery.DynamicDestinationsHelpers$ConstantSchemaDestinations.getSchema(DynamicDestinationsHelpers.java:163)
org.apache.beam.sdk.io.gcp.bigquery.DynamicDestinationsHelpers$ConstantSchemaDestinations.getSchema(DynamicDestinationsHelpers.java:150)
org.apache.beam.sdk.io.gcp.bigquery.CreateTables.processElement(CreateTables.java:103)
Caused by: java.lang.IllegalArgumentException: expected collection or array type but got class com.google.api.services.bigquery.model.TableSchema
com.google.api.client.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:148)
com.google.api.client.util.Preconditions.checkArgument(Preconditions.java:69)
com.google.api.client.json.JsonParser.parseValue(JsonParser.java:723)
com.google.api.client.json.JsonParser.parse(JsonParser.java:382)
com.google.api.client.json.JsonParser.parse(JsonParser.java:336)
com.google.api.client.json.JsonParser.parse(JsonParser.java:312)
com.google.api.client.json.JsonFactory.fromString(JsonFactory.java:187)
org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers.fromJsonString(BigQueryHelpers.java:156)
org.apache.beam.sdk.io.gcp.bigquery.DynamicDestinationsHelpers$ConstantSchemaDestinations.getSchema(DynamicDestinationsHelpers.java:163)
org.apache.beam.sdk.io.gcp.bigquery.DynamicDestinationsHelpers$ConstantSchemaDestinations.getSchema(DynamicDestinationsHelpers.java:150)
org.apache.beam.sdk.io.gcp.bigquery.CreateTables.processElement(CreateTables.java:103)
org.apache.beam.sdk.io.gcp.bigquery.CreateTables$DoFnInvoker.invokeProcessElement(Unknown Source)
org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:177)
org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:141)
com.google.cloud.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:233)
com.google.cloud.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:48)
com.google.cloud.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:52)
com.google.cloud.dataflow.worker.SimpleParDoFn.output(SimpleParDoFn.java:183)
org.apache.beam.runners.core.SimpleDoFnRunner.outputWindowedValue(SimpleDoFnRunner.java:211)
org.apache.beam.runners.core.SimpleDoFnRunner.access0(SimpleDoFnRunner.java:66)
org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:436)
org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.output(SimpleDoFnRunner.java:424)
org.apache.beam.sdk.io.gcp.bigquery.PrepareWrite.processElement(PrepareWrite.java:62)
org.apache.beam.sdk.io.gcp.bigquery.PrepareWrite$DoFnInvoker.invokeProcessElement(Unknown Source)
.....
似乎 JSON 在管道创建/序列化时被正确读取,但在执行时反序列化的 JSON 表示被传递以代替 JSON 字符串。我通过 Guava Resources
class:
String jsonSchema;
try {
jsonSchema = Resources.toString(Resources.getResource("path_to_json_schema"), Charsets.UTF_8);
} catch (IOException e) {
throw new RuntimeException("Failed to load JSON schema", e);
}
我该如何解决这个序列化问题?
查看引发异常的代码,这似乎是 JSON 解析失败 - 您的 JSON 模式很可能格式不正确。根据 the documentation,它应该看起来像这样:
{
"fields": [
{
"name": string,
"type": string,
"mode": string,
"fields": [
(TableFieldSchema)
],
"description": string
}
]
}
例如:
{
"fields": [
{
"name": "foo",
"type": "INTEGER"
},
{
"name": "bar",
"type": "STRING",
}
]
}
查看失败的 JSON 解析器代码,我怀疑您缺少外部 {"fields": ...}
并且您的 JSON 字符串仅包含 [...]
.