PYMSSQL/SQL Server 2014：用作子查询的 PK 列表的长度是否有限制？

Question

我已经实现了一个 python 脚本，以便将数百万个文档（由 .NET web 应用程序生成并且所有内容都放在一个目录中）划分到具有此方案的子文件夹中：year/month/batch，因为这些文档的所有任务最初都是分批处理的。我的 python 脚本对 SQL Server 2014 执行查询，其中包含每个文档所需的所有数据，特别是创建它的月份和年份。然后它使用 shutil 模块移动PDF。因此，我首先执行第一个查询以获取给定月份和年份的批次列表：

queryBatches = '''SELECT DISTINCT IDBATCH
                FROM [DBNAME].[dbo].[WORKS]
                WHERE YEAR(DATETIMEWORK)={} AND MONTH(DATETIMEWORK)={}'''.format(year, month)

然后我执行：

for batch in batches:
  query = '''SELECT IDWORK, IDBATCH, NAMEDOCUMENT
             FROM [DBNAME].[dbo].[WORKS]
             WHERE NAMEDOCUMENTI IS NOT NULL and
                   NAMEDOCUMENT not like '/%/%/%/%.pdf' and 
                   YEAR(DATETIMEWORK)={} and 
                   MONTH(DATETIMEWORK)={} and 
                   IDBATCH={}'''.format(year,month,batch[0])

其记录被收集到游标中，根据PYMSSQL使用文档。所以我继续：

IDWorksUpdate = []
row = cursor.fetchone()
while row:

  if moveDocument(...):
    IDWorksUpdate.append(row[0])
  row = cursor.fetchone()

最后，当循环结束时，在IDWorksUpdate我拥有了WORKS的所有PK，其文档成功地被正确移动到一个子文件夹中。因此，我关闭游标和连接并实例化新的。最后我执行：

subquery = '('+', '.join(str(x) for x in IDWorksUpdate)+')'
query = '''UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT = \'/{}/{}/{}/\'+NAMEDOCUMENT WHERE IDWORK IN {}'''.format(year,month,idbatch,subquery)

newConn = pymssql.connect(server='localhost', database='DBNAME')
newCursor = newConn.cursor()

try:
    newCursor.execute(query)
    newConn.commit()
except:
    newConn.rollback()
    log.write('Error on updating documents names in database of works {}/{} of batch {}'.format(year,month,idbatch))
finally:
    newCursor.close()
    del newCursor
    newConn.close()

今天早上我看到只有几批更新查询在数据库中执行失败，即使文档已正确移动到子目录中也是如此。那个批处理有超过 55000 个文档要移动，所以可能 IDWorksUpdate 溢出了并且它有助于创建最终更新查询？我认为 55000 不是一个很大的整数列表。问题是，在 PYMSSQL 中，我们不能同时有多个 connection/cursor 到同一个数据库，所以我无法在移动相应文件时更新记录。所以我想创建一个文件被正确移动的作品的 PK 列表，最后用一个新的 connection/cursor 更新它们。可能发生了什么？我做错了吗？

更新

我刚刚编写了一个简单的脚本来重现将要执行以更新记录的查询，这是我从 SQL 服务器得到的错误：

The query processor ran out of internal resources and could not produce a query plan. This is a rare event and only expected for extremely complex queries or queries that reference a very large number of tables or partitions. Please simplify the query. If you believe you have received this message in error, contact Customer Support Services for more information.

这是查询：

UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT = '/2016/12/1484/'+NAMEDOCUMENT WHERE IDWORK IN (list of 55157 PKs)

事实是 table 非常大（大约有 1400 万条记录）。但我需要那个 PK 列表，因为只有文档已被正确处理和移动的任务才能更新。我不能简单地运行:

UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT = '/2016/12/1484/'+NAMEDOCUMENT WHERE YEAR(DATETIMEWORK)=2016 and 
MONTH(DATETIMEWORK)=12 and IDBATCH=1484

这是因为我们的服务器被加密锁攻击，我必须只处理和移动仍然存在的文件，等待其他文件被释放。我应该将这些字符串拆分为子列表吗？怎么样？

更新 2

似乎以下可能是一个解决方案：我将 PK 列表分成 10000 个块（一个完全实验性的数字），然后我执行与多个块一样多的查询，每个块都有一个块作为子查询。

def updateDB(listID, y, m, b, log):

newConn = pymssql.connect(server='localhost', database='DBNAME')
newCursor = newConn.cursor()

if len(listID) <= 10000:

    subquery = '('+', '.join(str(x) for x in listID)+')'
    query = '''UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT= \'/{}/{}/{}/\'+NAMEDOCUMENT WHERE IDWORKIN {}'''.format(y,m,b,subquery)

    try:
        newCursor.execute(query)
        newConn.commit()
    except:
        newConn.rollback()
        log.write('...')
        log.write('\n\n')
    finally:
        newCursor.close()
        del newCursor
        newConn.close()   
else:
    chunksPK = [listID[i:i + 10000] for i in xrange(0, len(listID), 10000)]

    for sublistPK in chunksPK:

        subquery = '('+', '.join(str(x) for x in sublistPK)+')'
        query = '''UPDATE [DBNAME].[dbo].[WORKS] SET NAMEDOCUMENT= \'/{}/{}/{}/\'+NAMEDOCUMENT WHERE IDWORK IN {}'''.format(y,m,b,subquery)

        try:
            newCursor.execute(query)
            newConn.commit()
        except:
            newConn.rollback()
            log.write('Could not execute partial {}'.format(query))
            log.write('\n\n')

    newCursor.close()
    del newCursor
    newConn.close()

这可能是 good/secure 解决方案吗？

Answer 1

如 MSDN 文档中所述

IN (Transact-SQL)

Explicitly including an extremely large number of values (many thousands of values separated by commas) within the parentheses, in an IN clause can consume resources and return errors 8623 or 8632. To work around this problem, store the items in the IN list in a table, and use a SELECT subquery within an IN clause.

（您引用的错误信息是错误 8623。）

将 IN 列表值放入临时 table 然后使用

... WHERE IDWORK IN (SELECT keyValue FROM #inListTable)

我觉得比您描述的 "chunking" 方法更直接。

PYMSSQL/SQL Server 2014：用作子查询的 PK 列表的长度是否有限制？

PYMSSQL/SQL Server 2014: is there a limit to the length of a list of PKs to use as a subquery?

python

sql

sql-server

batch-processing

pymssql