BigQuery Python client - 超时参数的含义,以及如何设置查询结果超时

BigQuery Python client - meaning of timeout parameter, and how to set query result timeout

这个问题是关于 BigQuery Python 客户端中 QueryJob 对象的 result 方法中的 timeout 参数。

timeout 的含义似乎相对于版本 1.24.0 发生了变化。

例如,documentation for QueryJob's result in version 1.24.0 表示超时为:

The number of seconds to wait for the underlying HTTP transport before using retry. If multiple requests are made under the hood, timeout is interpreted as the approximate total time of all requests.

据我了解,这可以用来限制 result 方法调用等待结果的总时间。

例如,考虑以下脚本:

import logging

from google.cloud import bigquery

# Set logging level to DEBUG in order to see the HTTP requests
# being made by urllib3
logging.basicConfig(level=logging.DEBUG)

PROJECT_ID = "project_id" # replace by actual project ID

client = bigquery.Client(project=PROJECT_ID)

QUERY = ('SELECT name FROM `bigquery-public-data.usa_names.usa_1910_2013` '
        'WHERE state = "TX" '
        'LIMIT 100')
TIMEOUT = 30  # in seconds
query_job = client.query(QUERY)  # API request - starts the query
assert query_job.state == 'RUNNING'

# Waits for the query to finish
iterator = query_job.result(timeout=TIMEOUT)
rows = list(iterator)

assert query_job.state == 'DONE'

据我了解,如果所有涉及获取结果的 API 调用加起来超过 30 秒,则对 result 的调用将放弃。所以,这里的timeout用来限制result方法调用的总执行时间。

然而,后来的版本引入了一个变化。例如,documentation for result in 1.27.2 表示超时为:

The number of seconds to wait for the underlying HTTP transport before using retry. If multiple requests are made under the hood, timeout applies to each individual request.

如果我理解正确,上面的例子完全改变了意思,调用 result 可能需要超过 30 秒。

我的疑惑是:

  1. 如果我 运行 上面的脚本与新版本 result 和旧版本有什么区别?
  2. 目前推荐将 timeout 值传递给 result 的用例是什么?
  3. 当前推荐的在等待查询结果的总时间后超时的方法是什么?

谢谢。

如你所见fix:

A transport layer timeout is made independent of the query timeout, i.e. the maximum time to wait for the query to complete.

The query timeout is used by the blocking poll so that the backend does not block for too long when polling for job completion, but the transport can have different timeout requirements, and we do not want it to be raising sometimes unnecessary timeout errors.

  • Apply timeout to each of the underlying requests

As job methods do not split the timeout anymore between all requests a method might make, the Client methods are adjusted in the same way.

所以基本的区别是,在以前的版本中,如果在下面的层中发出许多请求,它们将共享 30 秒的超时。换句话说,如果第一个请求需要 20 秒,第二个请求将在 10 秒后超时。 在新版本中,每个请求将有 30 秒。

关于用例,基本上取决于你的应用。如果您不能长时间等待可能丢失的请求,您可以减少超时。