BigQuery Python client - 超时参数的含义,以及如何设置查询结果超时
BigQuery Python client - meaning of timeout parameter, and how to set query result timeout
这个问题是关于 BigQuery Python 客户端中 QueryJob 对象的 result
方法中的 timeout
参数。
timeout
的含义似乎相对于版本 1.24.0 发生了变化。
例如,documentation for QueryJob's result
in version 1.24.0 表示超时为:
The number of seconds to wait for the underlying HTTP transport before using retry. If multiple requests are made under the hood, timeout is interpreted as the approximate total time of all requests.
据我了解,这可以用来限制 result
方法调用等待结果的总时间。
例如,考虑以下脚本:
import logging
from google.cloud import bigquery
# Set logging level to DEBUG in order to see the HTTP requests
# being made by urllib3
logging.basicConfig(level=logging.DEBUG)
PROJECT_ID = "project_id" # replace by actual project ID
client = bigquery.Client(project=PROJECT_ID)
QUERY = ('SELECT name FROM `bigquery-public-data.usa_names.usa_1910_2013` '
'WHERE state = "TX" '
'LIMIT 100')
TIMEOUT = 30 # in seconds
query_job = client.query(QUERY) # API request - starts the query
assert query_job.state == 'RUNNING'
# Waits for the query to finish
iterator = query_job.result(timeout=TIMEOUT)
rows = list(iterator)
assert query_job.state == 'DONE'
据我了解,如果所有涉及获取结果的 API 调用加起来超过 30 秒,则对 result
的调用将放弃。所以,这里的timeout
用来限制result
方法调用的总执行时间。
然而,后来的版本引入了一个变化。例如,documentation for result
in 1.27.2 表示超时为:
The number of seconds to wait for the underlying HTTP transport before using retry. If multiple requests are made under the hood, timeout applies to each individual request.
如果我理解正确,上面的例子完全改变了意思,调用 result
可能需要超过 30 秒。
我的疑惑是:
- 如果我 运行 上面的脚本与新版本
result
和旧版本有什么区别?
- 目前推荐将
timeout
值传递给 result
的用例是什么?
- 当前推荐的在等待查询结果的总时间后超时的方法是什么?
谢谢。
如你所见fix:
A transport layer timeout is made independent of the query timeout,
i.e. the maximum time to wait for the query to complete.
The query timeout is used by the blocking poll so that the backend
does not block for too long when polling for job completion, but the
transport can have different timeout requirements, and we do not want
it to be raising sometimes unnecessary timeout errors.
- Apply timeout to each of the underlying requests
As job methods do not split the timeout anymore between all requests a
method might make, the Client methods are adjusted in the same way.
所以基本的区别是,在以前的版本中,如果在下面的层中发出许多请求,它们将共享 30 秒的超时。换句话说,如果第一个请求需要 20 秒,第二个请求将在 10 秒后超时。
在新版本中,每个请求将有 30 秒。
关于用例,基本上取决于你的应用。如果您不能长时间等待可能丢失的请求,您可以减少超时。
这个问题是关于 BigQuery Python 客户端中 QueryJob 对象的 result
方法中的 timeout
参数。
timeout
的含义似乎相对于版本 1.24.0 发生了变化。
例如,documentation for QueryJob's result
in version 1.24.0 表示超时为:
The number of seconds to wait for the underlying HTTP transport before using retry. If multiple requests are made under the hood, timeout is interpreted as the approximate total time of all requests.
据我了解,这可以用来限制 result
方法调用等待结果的总时间。
例如,考虑以下脚本:
import logging
from google.cloud import bigquery
# Set logging level to DEBUG in order to see the HTTP requests
# being made by urllib3
logging.basicConfig(level=logging.DEBUG)
PROJECT_ID = "project_id" # replace by actual project ID
client = bigquery.Client(project=PROJECT_ID)
QUERY = ('SELECT name FROM `bigquery-public-data.usa_names.usa_1910_2013` '
'WHERE state = "TX" '
'LIMIT 100')
TIMEOUT = 30 # in seconds
query_job = client.query(QUERY) # API request - starts the query
assert query_job.state == 'RUNNING'
# Waits for the query to finish
iterator = query_job.result(timeout=TIMEOUT)
rows = list(iterator)
assert query_job.state == 'DONE'
据我了解,如果所有涉及获取结果的 API 调用加起来超过 30 秒,则对 result
的调用将放弃。所以,这里的timeout
用来限制result
方法调用的总执行时间。
然而,后来的版本引入了一个变化。例如,documentation for result
in 1.27.2 表示超时为:
The number of seconds to wait for the underlying HTTP transport before using retry. If multiple requests are made under the hood, timeout applies to each individual request.
如果我理解正确,上面的例子完全改变了意思,调用 result
可能需要超过 30 秒。
我的疑惑是:
- 如果我 运行 上面的脚本与新版本
result
和旧版本有什么区别? - 目前推荐将
timeout
值传递给result
的用例是什么? - 当前推荐的在等待查询结果的总时间后超时的方法是什么?
谢谢。
如你所见fix:
A transport layer timeout is made independent of the query timeout, i.e. the maximum time to wait for the query to complete.
The query timeout is used by the blocking poll so that the backend does not block for too long when polling for job completion, but the transport can have different timeout requirements, and we do not want it to be raising sometimes unnecessary timeout errors.
- Apply timeout to each of the underlying requests
As job methods do not split the timeout anymore between all requests a method might make, the Client methods are adjusted in the same way.
所以基本的区别是,在以前的版本中,如果在下面的层中发出许多请求,它们将共享 30 秒的超时。换句话说,如果第一个请求需要 20 秒,第二个请求将在 10 秒后超时。 在新版本中,每个请求将有 30 秒。
关于用例,基本上取决于你的应用。如果您不能长时间等待可能丢失的请求,您可以减少超时。