Python 游标是否执行加载所有数据

Question

我正在尝试查询一个大数据（1000 万行）并试图防止内存不足，但不熟悉 Python 并且对 execute()、游标迭代器和 fetchone 的不同意见感到困惑()

我是否可以假设 cursor.execute() 不会将所有数据加载到内存中，只有当我调用 fetchone() 时它才会加载 1 行数据

from mysql.connector import MySQLConnection


def query():
    conn = MySQLConnection(host=conf['host'],
                                conf['port'],
                                conf['user'],
                                conf['password'],
                                conf['database'])
    cursor = conn.cursor(buffered=True)
    cursor.execute('SELECT * FROM TABLE') # 10 million rows

这个游标迭代器是否与 fetchone() 相同？

for row in cursor:
    print(row)

我的代码片段可以安全地处理 1000 万行数据吗？如果没有，我怎样才能安全地迭代数据而不会内存不足？

Answer 1

取自MySQL documentation:

The fetchone() method is used by fetchall() and fetchmany(). It is also used when a cursor is used as an iterator.

以下示例显示了两种处理查询结果的等效方法。第一个在 while 循环中使用 fetchone()，第二个使用游标作为迭代器：

# Using a while loop
cursor.execute("SELECT * FROM employees")
row = cursor.fetchone()
while row is not None:
  print(row)
  row = cursor.fetchone()

# Using the cursor as iterator
cursor.execute("SELECT * FROM employees")
for row in cursor:
  print(row)

它还指出：

You must fetch all rows for the current query before executing new statements using the same connection.

如果您担心性能问题，您应该在 while 循环中使用 fetchmany(n)，直到您像这样获取所有结果：

'An iterator that uses fetchmany to keep memory usage down'
    while True:
        results = cursor.fetchmany(arraysize)
        if not results:
            break
        for result in results:
            yield result

此行为符合 PEP249, which describes how and which methods db connectors should implement. A partial answer is given in this thread。

基本上，fetchall vs fetchmany vs fetchone 的实现取决于库的开发人员，具体取决于数据库功能，但是对于 fetchmany 和 fetchone，unfetched/remaining结果将保留在服务器端，直到另一个调用请求或游标对象销毁。

所以总而言之，我认为可以安全地假设调用 execute 方法不会，在这种情况下 (mysqldb)，将查询中的所有数据转储到内存中。

Answer 2

我的第一个建议是使用 from mysql.connector import connect，它默认使用 C 扩展 (CMySQLConnection)，而不是 from mysql.connector import MySQLConnection（纯 Python）。

如果你出于某种原因想要使用纯Python版本，你可以在connect()

中传递use_pure=True

第二个建议是对结果进行分页，如果使用缓冲游标，它将从服务器获取整个结果集。我不知道你是否想要 1000 万行。

这里有一些参考资料：

https://dev.mysql.com/doc/refman/8.0/en/limit-optimization.html

https://dev.mysql.com/doc/connector-python/en/connector-python-api-mysqlcursorbuffered.html

Python 游标是否执行加载所有数据

Python does cursor execute load all data

python

mysql

mysql-connector-python