通过 Python API 加载 JSONL 数据时检索 BigQuery 验证错误

Retrieving BigQuery validation errors when loading JSONL data via the Python API

如何在将 JSONL 文件加载到 BigQuery 时检索与验证错误相关的更多信息? (问题不是解决问题)

示例代码:

from google.cloud.bigquery import (
    LoadJobConfig,
    QueryJobConfig,
    Client,
    SourceFormat,
    WriteDisposition
)

# variables depending on the environment
filename = '...'
gcp_project_id = '...'
dataset_name = '...'
table_name = '...'
schema = [ ... ]

# loading data
client = Client(project=project_id)
dataset_ref = client.dataset(dataset_name)
table_ref = dataset_ref.table(table_name)
job_config = LoadJobConfig()
job_config.source_format = SourceFormat.NEWLINE_DELIMITED_JSON
job_config.write_disposition = WriteDisposition.WRITE_APPEND
job_config.schema = schema
LOGGER.info('loading from %s', filename)
with open(filename, "rb") as source_file:
    job = client.load_table_from_file(
        source_file, destination=table_ref, job_config=job_config
    )

    # Waits for table cloud_data_store to complete
    job.result()

这里我使用 bigquery-schema-generator 生成模式(否则 BigQuery 只会查看前 100 行)。

运行 可能会出错并显示以下错误消息 (google.api_core.exceptions.BadRequest):

400 Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.

errors 属性基本上没有提供任何新信息:

[{'reason': 'invalid',
  'message': 'Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.'}]

我也查看了异常的 __dict__,但没有透露任何进一步的信息。

尝试使用 bq 命令行(在本例中没有显式架构)加载 table 会产生更有用的消息:

BigQuery error in load operation: Error processing job '...': Provided Schema does not match Table <table name>. Field <field name> has changed type from TIMESTAMP to DATE

我现在的问题是如何才能从 Python API?

中检索到如此有用的消息

基于已接受答案的解决方案

这是一个副本和过去的解决方法,可以添加它以默认显示更多信息。 (它可能有缺点)

import google.cloud.exceptions
import google.cloud.bigquery.job


def get_improved_bad_request_exception(
    job: google.cloud.bigquery.job.LoadJob
) -> google.cloud.exceptions.BadRequest:
    errors = job.errors
    result = google.cloud.exceptions.BadRequest(
        '; '.join([error['message'] for error in errors]),
        errors=errors
    )
    result._job = job
    return result


def wait_for_load_job(
    job: google.cloud.bigquery.job.LoadJob
):
    try:
        job.result()
    except google.cloud.exceptions.BadRequest as exc:
        raise get_improved_bad_request_exception(job) from exc

然后调用 wait_for_load_job(job) 而不是直接调用 job.result(),将导致更有用的异常(错误消息和 errors 属性)。

为了能够显示更有用的错误消息,您可以导入 google.api_core.exceptions.BadRequest 以捕获异常,然后使用 LoadJob attribute errors 从作业中获取详细的错误消息。

from google.api_core.exceptions import BadRequest
...
...
try:
    load_job.result()# Waits for the job to complete.
except BadRequest:
    for error in load_job.errors:
        print(error["message"])  # error is of type dictionary

为了测试,我使用了示例代码 BQ load json data 并更改了输入文件以产生错误。在文件中,我将 "post_abbr" 的值从字符串更改为数组值。

使用的文件:

{"name": "Alabama", "post_abbr": "AL"}
{"name": "Alaska", "post_abbr":  "AK"}
{"name": "Arizona", "post_abbr": [65,2]}

应用上述代码段后,请参阅下面的输出。最后一条错误消息显示了关于 "post_abbr" 接收非重复字段的数组的实际错误。

Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 3; errors: 1. Please look into the errors[] collection for more details.
Error while reading data, error message: JSON processing encountered too many errors, giving up. Rows: 3; errors: 1; max bad: 0; error percent: 0
Error while reading data, error message: JSON parsing error in row starting at position 78: Array specified for non-repeated field: post_abbr.