在 Python 中处理大型数据库 table 的每一行

Question

上下文

我在 python 中有一个函数可以在我的 table 中得分。我想以算术方式合并所有行的分数（例如，计算分数的总和、平均值等）。

def compute_score(row):
  # some complicated python code that would be painful to convert into SQL-equivalent
  return score

显而易见的第一种方法是简单地读入所有数据

import psycopg2

def sum_scores(dbname, tablename):
  conn = psycopg2.connect(dbname)
  cur = conn.cursor()
  cur.execute('SELECT * FROM ?', tablename)
  rows = cur.fetchall()
  sum = 0
  for row in rows:
    sum += score(row)
  conn.close()
  return sum

问题

我希望能够处理尽可能多的数据，因为我的数据库可以容纳。这可能比 Python 的内存要大，所以 fetchall() 在我看来它在那种情况下无法正常工作。

建议的解决方案

我正在考虑 3 种方法，目的都是一次处理几条记录：

使用fetchone()

逐一记录处理

def sum_scores(dbname, tablename):
  ...
  sum = 0
  for row_num in cur.rowcount:
    row = cur.fetchone()
    sum += score(row)
  ...
  return sum

批记录处理使用fetchmany(n)

def sum_scores(dbname, tablename):
  ...
  batch_size = 1e3 # tunable
  sum = 0
  batch = cur.fetchmany(batch_size)  
  while batch:
    for row in batch:
      sum += score(row)
    batch = cur.fetchmany(batch_size)
  ...
  return sum

依赖游标的迭代器

def sum_scores(dbname, tablename):
  ...
  sum = 0
  for row in cur:
    sum += score(row)
  ...
  return sum

问题

我的想法是否正确，因为我提出的 3 个解决方案一次只能提取可管理大小的数据块？或者他们是否遇到与fetchall相同的问题？
所提出的 3 种解决方案中的哪一种适用于大型数据集（即计算正确的分数组合并且不会在此过程中崩溃）？
游标的迭代器（建议的解决方案 #3）实际上是如何将数据拉入 Python 的内存中的？一个接一个，分批，还是一次全部？

Answer 1

所有 3 种解决方案都有效，并且只会将结果的一个子集存入内存。

如果您将名称传递给游标，通过游标进行迭代，建议的解决方案 #3 将与建议的解决方案 #2 相同。遍历游标将获取 itersize 条记录（默认为 2000）。

解决方案 #2 和 #3 将比 #1 快得多，因为连接开销要少得多。

http://initd.org/psycopg/docs/cursor.html#fetch

在 Python 中处理大型数据库 table 的每一行

Processing each row of a large database table in Python

python

psycopg2

bigdata

上下文

问题

建议的解决方案

问题