使用 python 合并两个非常大的 pandas 数据帧的最快方法

Quickest way to merge two very large pandas dataframes using python

我有多组非常大的 csv 文件,需要根据唯一 ID 进行合并。我将此唯一 ID 设置为索引,该索引基于我的 Origin 和 Destination 列的串联。

数据帧 1:

Origin Destination Value
70478 70 478 0.002779
70479 70 479 0.001673
70480 70 480 0.000427
70481 70 481 0.001503
70482 70 482 0.01215
70483 70 483 0.004507
70484 70 484 0.001871
70485 70 485 0.006522
70486 70 486 0.004786
70487 70 487 0.026566

数据帧 2:

Origin Destination Value
70478 70 478 135.974365
70479 70 479 130.936752
70480 70 480 111.191734
70481 70 481 98.170746
70482 70 482 88.257645
70483 70 483 102.095566
70484 70 484 103.585373
70485 70 485 114.298431
70486 70 486 97.331055
70487 70 487 85.754776

我的最终table应该如下(需求=df1的值;时间=df2的值;Demand_Time=Time/Demand ):

Origin Destination Demand Time Demand_Time
0 70 478 0.002779 135.974365 0.377858
1 70 479 0.001673 130.936752 0.219041
2 70 480 0.000427 111.191734 0.047494
3 70 481 0.001503 98.170746 0.147536
4 70 482 0.01215 88.257645 1.072321
5 70 483 0.004507 102.095566 0.460115
6 70 484 0.001871 103.585373 0.193806
7 70 485 0.006522 114.298431 0.74551
8 70 486 0.004786 97.331055 0.465854
9 70 487 0.026566 85.754776 2.278125

我在 df1 和 df2 之间做了一个 .compare,它产生了以下新数据帧:

Origin Destination Value
self other self other self other
70478 70 70 478 478 0.002779 135.974365
70479 70 70 479 479 0.001673 130.936752
70480 70 70 480 480 0.000427 111.191734
70481 70 70 481 481 0.001503 98.170746
70482 70 70 482 482 0.01215 88.257645
70483 70 70 483 483 0.004507 102.095566
70484 70 70 484 484 0.001871 103.585373
70485 70 70 485 485 0.006522 114.298431
70486 70 70 486 486 0.004786 97.331055
70487 70 70 487 487 0.026566 85.754776

然后我创建一个新的最终 pd.DataFrame df,迭代我上面的比较 table 和 .append 到我最终的新 df。

迭代和追加的最后一部分在非常大的 table 秒(每个有几十万条记录)上花费了很长时间 - 每次大约 1.5 小时。

有没有办法更有效地完成最后一部分?

谢谢。

代码示例:

import pandas as pd

# Replicating sample df1 (.read_csv from csv file 1)
df_1_data = [[70, 478, 0.0027788935694843],
             [70, 479, 0.0016728754853829],
             [70, 480, 0.0004271405050531],
             [70, 481, 0.0015028485795482],
             [70, 482, 0.0121498983353376],
             [70, 483, 0.0045067127794027],
             [70, 484, 0.0018709792057052],
             [70, 485, 0.0065224897116422],
             [70, 486, 0.0047862790524959],
             [70, 487, 0.0265655759721994]]
df_1 = pd.DataFrame(df_1_data, columns=['Origin', 'Destination', 'Value'])
df_1 = df_1.set_index(df_1['Origin'].astype(str) + df_1['Destination'].astype(str))
print(df_1)

# Replicating sample df2 (.read_csv from csv file 2)
df_2_data = [[70, 478, 135.9743652],
             [70, 479, 130.9367523],
             [70, 480, 111.1917343],
             [70, 481, 98.17074585],
             [70, 482, 88.25764465],
             [70, 483, 102.0955658],
             [70, 484, 103.5853729],
             [70, 485, 114.2984314],
             [70, 486, 97.33105469],
             [70, 487, 85.754776]]
df_2 = pd.DataFrame(df_2_data, columns=['Origin', 'Destination', 'Value'])
df_2 = df_2.set_index(df_2['Origin'].astype(str) + df_2['Destination'].astype(str))
print(df_2)

df_compare = df_1.compare(df_2, keep_shape=True, keep_equal=True)
print(df_compare)

df_out = pd.DataFrame(columns=['Origin', 'Destination', 'Demand', 'Time', 'Demand_Time'])
for index, row in df_compare.iterrows():
    df_out = df_out.append({'Origin': int(row['Origin']['self']), 'Destination': int(row['Destination']['self']),
                            'Demand': row['Value']['self'], 'Time': row['Value']['other'],
                            'Demand_Time': row['Value']['self'] * row['Value']['other']}, ignore_index=True)
print(df_out)

print('\nCOMPLETED')

IIUC,你可以使用:

out = (df1.rename(columns={'Value': 'Demand'})
          .assign(Time=df2['Value'], Demand_Time=df2['Value'] * df1['Value'])
          .reset_index(drop=True))
print(out)

# Output
   Origin  Destination    Demand        Time  Demand_Time
0      70          478  0.002779  135.974365     0.377873
1      70          479  0.001673  130.936752     0.219057
2      70          480  0.000427  111.191734     0.047479
3      70          481  0.001503   98.170746     0.147551
4      70          482  0.012150   88.257645     1.072330
5      70          483  0.004507  102.095566     0.460145
6      70          484  0.001871  103.585373     0.193808
7      70          485  0.006522  114.298431     0.745454
8      70          486  0.004786   97.331055     0.465826
9      70          487  0.026566   85.754776     2.278161

如果我正确理解请求,我会结合使用 pandas 和 numby 来及时获得您想要的结果

import datetime
import numpy as np

df_1_data = [[70, 478, 0.0027788935694843],
             [70, 479, 0.0016728754853829],
             [70, 480, 0.0004271405050531],
             [70, 481, 0.0015028485795482],
             [70, 482, 0.0121498983353376],
             [70, 483, 0.0045067127794027],
             [70, 484, 0.0018709792057052],
             [70, 485, 0.0065224897116422],
             [70, 486, 0.0047862790524959],
             [70, 487, 0.0265655759721994]]
df_1 = pd.DataFrame(df_1_data, columns=['Origin', 'Destination', 'Value'])
df_1 = df_1.set_index(df_1['Origin'].astype(str) + df_1['Destination'].astype(str))

# Replicating sample df2 (.read_csv from csv file 2)
df_2_data = [[70, 478, 135.9743652],
             [70, 479, 130.9367523],
             [70, 480, 111.1917343],
             [70, 481, 98.17074585],
             [70, 482, 88.25764465],
             [70, 483, 102.0955658],
             [70, 484, 103.5853729],
             [70, 485, 114.2984314],
             [70, 486, 97.33105469],
             [70, 487, 85.754776]]
df_2 = pd.DataFrame(df_2_data, columns=['Origin', 'Destination', 'Value'])
df_2 = df_2.set_index(df_2['Origin'].astype(str) + df_2['Destination'].astype(str))

df_1.columns = [['Origin', 'Destination', 'Demand']]
df_2.columns = [['Origin', 'Destination', 'Time']]
df_merge = df_1.merge(df_2, how = 'inner')
df_merge['Demand_Time'] = df_merge['Time'].values / df_merge['Demand'].values
df_merge