通过 JDBC 和 PyCharm 连接到 AWS Athena - fetchSize 问题

Connecting to AWS Athena through JDBC with PyCharm - fetchSize issue

我已使用 PyCharm 专业版连接到 AWS Athena。 它连接成功,但每当我收到 运行 查询时,我都会得到:

The requested fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try again. Refer to the Athena documentation for valid fetchSize values.

我已经从 AWS Athena JDBC documentation

下载了 Athena JDBC 驱动程序

可能是什么问题?

我认为你应该在DataGrip的这个设置中设置合适的值

关于获取大小、JDBC 和 AWS athena 需要考虑的一件事。似乎有一个 semi-documented but well known limit of 1000 rows per fetch. I know that the popular PyAthenaJDBC library sets it as their default fetch size。所以,这可能是你的问题的一部分。

当我尝试一次获取超过 1000 行时,我可能会产生获取大小错误。

from pyathenajdbc import connect 
conn = connect(s3_staging_dir='s3://SOMEBUCKET/', 
region_name='us-east-1')
cur = conn.cursor()
cur.execute('SELECT * FROM SOMEDATABASE.big_table LIMIT 5000')
results = cur.fetchall()
print len(results)
# Note: The cursor class actually has a setter method to 
#       keep users from setting illegal fetch sizes   
cur._arraysize = 1001 # Set array size one greater than the default
cur.execute('SELECT * FROM athena_test.big_table LIMIT 5000')
results = cur.fetchall() # Generate an error

java.sql.SQLExceptionPyRaisable: java.sql.SQLException: The requested fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try again. Refer to the Athena documentation for valid fetchSize values.

可能的解决方案包括:

  1. 运行在web GUI中查询,然后手动下载结果集
  2. 在您选择的 editor/IDE(DataGrip、Athena Web GUI 等)中开发查询,并通过 Python SDK 将查询字符串传递给 Athena。然后您可以等待查询完成并获取结果集。
  3. 您执行查询并对结果进行分页。
  4. 如果你从 Python 调用你的 SQL (我是从 PyCharm 标签推断的),你可以使用像 PyAthenaJDBC 这样的库为您处理页面大小(参见上面的示例)。

对于我的许多 Python 脚本,我使用类似于以下的工作流程。

import boto3
import time

sql = 'SELECT * from athena_test.big_table'

database = 'SOMEDATABASE'
bucket_name = 'SOMEBUCKET' 
output_path = '/home/zerodf/temp/somedata.csv'

client = boto3.client('athena')
config = {'OutputLocation': 's3://' + bucket_name + '/',
      'EncryptionConfiguration': {'EncryptionOption': 'SSE_S3'}}

execution_results = client.start_query_execution(QueryString = sql,
                                             QueryExecutionContext =
                                             {'Database': database},
                                             ResultConfiguration = config)

execution_id = str(execution_results[u'QueryExecutionId'])
remote_file = execution_id + '.csv'

while True:
    query_execution_results = client.get_query_execution(QueryExecutionId =
                                                     execution_id)
    if query_execution_results['QueryExecution']['Status']['State'] == u'SUCCEEDED':
        break
    else:
        time.sleep(60)

s3 = boto3.resource('s3')
s3.Bucket(bucket_name).download_file(remote_file, output_path)

显然,生产代码更复杂。