通过 Python API 加载 JSONL 数据时检索 BigQuery 验证错误
Retrieving BigQuery validation errors when loading JSONL data via the Python API
如何在将 JSONL 文件加载到 BigQuery 时检索与验证错误相关的更多信息? (问题不是解决问题)
示例代码:
from google.cloud.bigquery import (
LoadJobConfig,
QueryJobConfig,
Client,
SourceFormat,
WriteDisposition
)
# variables depending on the environment
filename = '...'
gcp_project_id = '...'
dataset_name = '...'
table_name = '...'
schema = [ ... ]
# loading data
client = Client(project=project_id)
dataset_ref = client.dataset(dataset_name)
table_ref = dataset_ref.table(table_name)
job_config = LoadJobConfig()
job_config.source_format = SourceFormat.NEWLINE_DELIMITED_JSON
job_config.write_disposition = WriteDisposition.WRITE_APPEND
job_config.schema = schema
LOGGER.info('loading from %s', filename)
with open(filename, "rb") as source_file:
job = client.load_table_from_file(
source_file, destination=table_ref, job_config=job_config
)
# Waits for table cloud_data_store to complete
job.result()
这里我使用 bigquery-schema-generator 生成模式(否则 BigQuery 只会查看前 100 行)。
运行 可能会出错并显示以下错误消息 (google.api_core.exceptions.BadRequest
):
400 Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.
看errors
属性基本上没有提供任何新信息:
[{'reason': 'invalid',
'message': 'Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.'}]
我也查看了异常的 __dict__
,但没有透露任何进一步的信息。
尝试使用 bq
命令行(在本例中没有显式架构)加载 table 会产生更有用的消息:
BigQuery error in load operation: Error processing job '...': Provided Schema does not match Table <table name>. Field <field name> has changed type from TIMESTAMP to
DATE
我现在的问题是如何才能从 Python API?
中检索到如此有用的消息
基于已接受答案的解决方案
这是一个副本和过去的解决方法,可以添加它以默认显示更多信息。 (它可能有缺点)
import google.cloud.exceptions
import google.cloud.bigquery.job
def get_improved_bad_request_exception(
job: google.cloud.bigquery.job.LoadJob
) -> google.cloud.exceptions.BadRequest:
errors = job.errors
result = google.cloud.exceptions.BadRequest(
'; '.join([error['message'] for error in errors]),
errors=errors
)
result._job = job
return result
def wait_for_load_job(
job: google.cloud.bigquery.job.LoadJob
):
try:
job.result()
except google.cloud.exceptions.BadRequest as exc:
raise get_improved_bad_request_exception(job) from exc
然后调用 wait_for_load_job(job)
而不是直接调用 job.result()
,将导致更有用的异常(错误消息和 errors
属性)。
为了能够显示更有用的错误消息,您可以导入 google.api_core.exceptions.BadRequest
以捕获异常,然后使用 LoadJob attribute errors 从作业中获取详细的错误消息。
from google.api_core.exceptions import BadRequest
...
...
try:
load_job.result()# Waits for the job to complete.
except BadRequest:
for error in load_job.errors:
print(error["message"]) # error is of type dictionary
为了测试,我使用了示例代码 BQ load json data 并更改了输入文件以产生错误。在文件中,我将 "post_abbr"
的值从字符串更改为数组值。
使用的文件:
{"name": "Alabama", "post_abbr": "AL"}
{"name": "Alaska", "post_abbr": "AK"}
{"name": "Arizona", "post_abbr": [65,2]}
应用上述代码段后,请参阅下面的输出。最后一条错误消息显示了关于 "post_abbr"
接收非重复字段的数组的实际错误。
Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 3; errors: 1. Please look into the errors[] collection for more details.
Error while reading data, error message: JSON processing encountered too many errors, giving up. Rows: 3; errors: 1; max bad: 0; error percent: 0
Error while reading data, error message: JSON parsing error in row starting at position 78: Array specified for non-repeated field: post_abbr.
如何在将 JSONL 文件加载到 BigQuery 时检索与验证错误相关的更多信息? (问题不是解决问题)
示例代码:
from google.cloud.bigquery import (
LoadJobConfig,
QueryJobConfig,
Client,
SourceFormat,
WriteDisposition
)
# variables depending on the environment
filename = '...'
gcp_project_id = '...'
dataset_name = '...'
table_name = '...'
schema = [ ... ]
# loading data
client = Client(project=project_id)
dataset_ref = client.dataset(dataset_name)
table_ref = dataset_ref.table(table_name)
job_config = LoadJobConfig()
job_config.source_format = SourceFormat.NEWLINE_DELIMITED_JSON
job_config.write_disposition = WriteDisposition.WRITE_APPEND
job_config.schema = schema
LOGGER.info('loading from %s', filename)
with open(filename, "rb") as source_file:
job = client.load_table_from_file(
source_file, destination=table_ref, job_config=job_config
)
# Waits for table cloud_data_store to complete
job.result()
这里我使用 bigquery-schema-generator 生成模式(否则 BigQuery 只会查看前 100 行)。
运行 可能会出错并显示以下错误消息 (google.api_core.exceptions.BadRequest
):
400 Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.
看errors
属性基本上没有提供任何新信息:
[{'reason': 'invalid',
'message': 'Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.'}]
我也查看了异常的 __dict__
,但没有透露任何进一步的信息。
尝试使用 bq
命令行(在本例中没有显式架构)加载 table 会产生更有用的消息:
BigQuery error in load operation: Error processing job '...': Provided Schema does not match Table <table name>. Field <field name> has changed type from TIMESTAMP to DATE
我现在的问题是如何才能从 Python API?
中检索到如此有用的消息基于已接受答案的解决方案
这是一个副本和过去的解决方法,可以添加它以默认显示更多信息。 (它可能有缺点)
import google.cloud.exceptions
import google.cloud.bigquery.job
def get_improved_bad_request_exception(
job: google.cloud.bigquery.job.LoadJob
) -> google.cloud.exceptions.BadRequest:
errors = job.errors
result = google.cloud.exceptions.BadRequest(
'; '.join([error['message'] for error in errors]),
errors=errors
)
result._job = job
return result
def wait_for_load_job(
job: google.cloud.bigquery.job.LoadJob
):
try:
job.result()
except google.cloud.exceptions.BadRequest as exc:
raise get_improved_bad_request_exception(job) from exc
然后调用 wait_for_load_job(job)
而不是直接调用 job.result()
,将导致更有用的异常(错误消息和 errors
属性)。
为了能够显示更有用的错误消息,您可以导入 google.api_core.exceptions.BadRequest
以捕获异常,然后使用 LoadJob attribute errors 从作业中获取详细的错误消息。
from google.api_core.exceptions import BadRequest
...
...
try:
load_job.result()# Waits for the job to complete.
except BadRequest:
for error in load_job.errors:
print(error["message"]) # error is of type dictionary
为了测试,我使用了示例代码 BQ load json data 并更改了输入文件以产生错误。在文件中,我将 "post_abbr"
的值从字符串更改为数组值。
使用的文件:
{"name": "Alabama", "post_abbr": "AL"}
{"name": "Alaska", "post_abbr": "AK"}
{"name": "Arizona", "post_abbr": [65,2]}
应用上述代码段后,请参阅下面的输出。最后一条错误消息显示了关于 "post_abbr"
接收非重复字段的数组的实际错误。
Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 3; errors: 1. Please look into the errors[] collection for more details.
Error while reading data, error message: JSON processing encountered too many errors, giving up. Rows: 3; errors: 1; max bad: 0; error percent: 0
Error while reading data, error message: JSON parsing error in row starting at position 78: Array specified for non-repeated field: post_abbr.