Python odbc 游标：执行查询后保持持久状态

Question

假设我们 table_1 住在 database_1。

import pyodbc
connection =  pyodbc.connect(dsn='hive', autocommit=True)
cursor = connection.cursor()
cursor.execute("USE database_1")
cursor.execute("SELECT * FROM table_1")

这将给出一个 table 未找到的错误，因为当游标执行下一个查询时，我们已经将我们使用的数据库重置为默认值。有没有办法在执行语句中保持一致的 state/bundle 多个查询以避免这种情况？我对能够设置 mappers/reducers 的数量并能够在执行下一个查询时保持此状态特别感兴趣。我知道另一种方法是使用 Python 使 shell 连接到 Hive 并执行一个 hql 文件，但我不想那样做。

Answer 1

我建议你尝试以下几件事：

我认为在大多数情况下，如果不是全部，您可以使用连接字符串指定要使用的数据库。
我在 the documentation 中看到 'execute' 命令 returns 光标本身，尽管我会尝试：

cursor.execute("USE database_1").execute("SELECT * FROM table_1")

（以防文档错误）

这可能真的有效：

cursor.execute("USE database_1")

cursor.commit()

cursor.execute("SELECT * FROM table_1")

如果有效，请更新。

Answer 2

据我所知 pyodbc 文档，似乎没有对 Hive 的具体支持。如果您愿意使用不同的库，pyhs2 特别支持与 HiveServer2 的连接（我认为是 Hive 0.11 或更新版本）。它可以用 pip (pip install pyhs2) 安装，但至少在我的 Mint Linux 17 盒子上我还必须先安装 libpython-dev 和 libsasl2-dev。

我在 Hive（table_1 在 database_1 而不是 default 中模拟了你的场景的一个微不足道的近似值：

hive> use default;
OK
Time taken: 0.324 seconds
hive> select * from table_1;
FAILED: SemanticException [Error 10001]: Line 1:14 Table not found 'table_1'
hive> use database_1;
OK
Time taken: 0.333 seconds
hive> describe table_1;
OK
content                     string                              
Time taken: 0.777 seconds, Fetched: 1 row(s)
hive> select * from table_1;
OK
this
is
some
sample
data
Time taken: 0.23 seconds, Fetched: 5 row(s)

那么这是一个利用 pyhs2 连接到 Hive 的基本脚本：

# Python 2.7
import pyhs2
from pyhs2.error import Pyhs2Exception

hql = "SELECT * FROM table_1"
with pyhs2.connect(
  host='localhost', port=10000, authMechanism="PLAIN", user="root",
  database="default"  # Of course it's possible just to specify database_1 here
) as db:
  with db.cursor() as cursor:

    try:
      print "Trying default database"
      cursor.execute(hql)
      for row in cursor.fetch(): print row
    except Pyhs2Exception as error:
      print(str(error))

    print "Switching databases to database_1"
    cursor.execute("use database_1")
    cursor.execute(hql)
    for row in cursor.fetch(): print row

这是结果输出：

Trying default database
"Error while compiling statement: FAILED: SemanticException [Error 10001]: Line 1:14 Table not found 'table_1'"
Switching databases to database_1
['this']
['is']
['some']
['sample']
['data']

正如我在代码的注释行中指出的那样，完全有可能直接使用 database_1 而不是 default 来启动连接，但我想尝试模仿您的操作正在处理您在问题中发布的代码（并演示在启动连接后切换数据库的能力）。

无论如何，如果您愿意接受非pyodbc解决方案，希望能引起深思。

Answer 3

我了解到您可以在 ODBC 连接字符串中设置缩减程序的数量，例如

string = 'dsn=hive/driver/path;mapred.reduce.tasks=100;....'
connection = pyodbc.connect(string, autocommit=True)

这让您可以在连接中使用您想要的特定设置；这并没有解决切换数据库的问题，但它解决了将设置引入 Hive 的其他情况，这是我的大部分问题。

Python odbc 游标：执行查询后保持持久状态

Python odbc cursor: keeping persistent state after executing a query

python

odbc

hive

cursor