如何在 Python 中创建频率 table 并在加载数据块时进行更新？

Question

我有一个巨大的 table（数十亿行），我需要通过 a) 创建频率 table 和 b) 创建分布图来分析其中的两个数值变量。

VarA 的范围为 0.00 到 1.00（以 0.01 为增量）
VarB 分布在 0.00 左右（以 0.01 为增量）

我想遍历读取 1,000 行然后更新频率 table。我试过以下代码：


c_size = 1000

result = {'A': dict(), 'B': dict()}

def update_dict(key, val):
    if val not in result[key]:
         result[key][val] = 1
    else:
         result[key][val] += 1

for data_chunk in pd.read_csv('data.csv', names=['ValA','ValB'], skiprows=10, chunksize=c_size):
    for row in data_chunk:
        valA, valB = row
        update_dict('A', valA)
        update_dict('B', valB)

print(result['A'])
print(result['B'])

编辑

根据@peter-du 的建议更新了代码

Answer 1

为了使其处理更轻松，我认为您应该将处理后的值存储在字典中而不是两个单独的表中。

要迭代十亿行中的每1000行，我认为你应该使用生成器机制（参考https://realpython.com/introduction-to-python-generators/）

我在下面展示了一个小例子。

input_data = [[1, 2], [2, 3], [3, 4], [2, 4]]
result = {'A': dict(), 'B': dict()}

def update_dict(key, val):
    if val not in result[key]:
         result[key][val] = 1
    else:
         result[key][val] += 1


# Since the list is not too big, I use for loop to iterate.
# However, you can apply the generator mechanism for the code below.
for row in input_data:
    valA, valB = row
    update_dict('A', valA)
    update_dict('B', valB)

print(result['A'])
>>> {1: 1, 2: 2, 3: 1}

print(result['B'])
>>> {2: 1, 3: 1, 4: 2}

# Then, you can use these two dictionaries to create two separate tables
# You can also join two tables together using Pandas data frame

为了绘制分布图，我建议使用 seaborn (https://seaborn.pydata.org/generated/seaborn.distplot.html) 来绘制漂亮的图。

Answer 2

这是工作代码。谢谢@Peter-Du 的帮助。

c_size = 1000
result = {'VarA': dict(), 'VarB': dict()}
def update_dict(key, val):
   if val not in result[key]:
       result[key][val] = 1
   else:
       result[key][val] += 1
reader = pd.read_csv('file.csv', names=['VarA','VarB'], skiprows=10, chunksize=c_size)
for i, data_chunk in enumerate(reader):
    for row in data_chunk.values:
        valA, valB = row
        update_dict('VarA', np.round(valA,2))
        update_dict('VarB', np.round(valB,2))

如何在 Python 中创建频率 table 并在加载数据块时进行更新？

How to create a frequency table and update is as chunks of data are loaded, in Python?

python

large-files

large-data

编辑