在显示进度条的特定列上合并两个大型数据框

Question

我有两个大数据集，一个 2.6 GB，另一个 1GB。我已经设法将它们都作为 DataFrames 读取。

接下来我想创建一个新的 DataFrame，我想在其中匹配两个数据集的唯一 ID，并丢弃没有在两个数据集之间匹配的 ID 的行。

我尝试过合并少量的行，我认为它可行，但我想合并整个东西，还想显示一个进度条。我正在使用 Jupyter Notebook 和 Python 3.

Matrikkel2019 是两个相同数据集中的唯一 ID，我想保留两个数据集中的列，但只保留具有相同 matrikkel2019 ID 的值

代码

from tqdm import tqdm_notebook

tqdm_notebook().pandas() 

merge = energydata.merge(dwellingData, left_on = "matrikkel2019", right_on="matrikkel2019").progress_apply()

我尝试在 progress_apply 函数中使用 lambda x: x**2，但出现错误：TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int' and Invalid arguments error

主要问题是合并操作花费的时间太长，我的 8Gb RAM PC 正在挣扎，所以我不知道需要多长时间或是否会完成。

Answer 1

tqdm 支持 pandas 合并操作的进度条。

代码取自这个问题，

import pandas as pd
from tqdm import tqdm

df1 = pd.DataFrame({'lkey': 1000*['a', 'b', 'c', 'd'],'lvalue': np.random.randint(0,int(1e8),4000)})
df2 = pd.DataFrame({'rkey': 1000*['a', 'b', 'c', 'd'],'rvalue': np.random.randint(0, int(1e8),4000)})

#this is how you activate the pandas features in tqdm
tqdm.pandas()
#call the progress_apply feature with a dummy lambda 
df1.merge(df2, left_on='lkey', right_on='rkey').progress_apply(lambda x: x)

对于您的代码以及导入，它应该只是：

tqdm.pandas()
merge = energydata.merge(dwellingData, left_on = "matrikkel2019", right_on="matrikkel2019").progress_apply(lambda x: x)

在显示进度条的特定列上合并两个大型数据框

Merging two large dataframes on specific column with progress bar showing

python

merge

pandas

tqdm

代码