使用 python 合并两个非常大的 pandas 数据帧的最快方法
Quickest way to merge two very large pandas dataframes using python
我有多组非常大的 csv 文件,需要根据唯一 ID 进行合并。我将此唯一 ID 设置为索引,该索引基于我的 Origin 和 Destination 列的串联。
数据帧 1:
Origin
Destination
Value
70478
70
478
0.002779
70479
70
479
0.001673
70480
70
480
0.000427
70481
70
481
0.001503
70482
70
482
0.01215
70483
70
483
0.004507
70484
70
484
0.001871
70485
70
485
0.006522
70486
70
486
0.004786
70487
70
487
0.026566
数据帧 2:
Origin
Destination
Value
70478
70
478
135.974365
70479
70
479
130.936752
70480
70
480
111.191734
70481
70
481
98.170746
70482
70
482
88.257645
70483
70
483
102.095566
70484
70
484
103.585373
70485
70
485
114.298431
70486
70
486
97.331055
70487
70
487
85.754776
我的最终table应该如下(需求=df1的值;时间=df2的值;Demand_Time=Time/Demand ):
Origin
Destination
Demand
Time
Demand_Time
0
70
478
0.002779
135.974365
0.377858
1
70
479
0.001673
130.936752
0.219041
2
70
480
0.000427
111.191734
0.047494
3
70
481
0.001503
98.170746
0.147536
4
70
482
0.01215
88.257645
1.072321
5
70
483
0.004507
102.095566
0.460115
6
70
484
0.001871
103.585373
0.193806
7
70
485
0.006522
114.298431
0.74551
8
70
486
0.004786
97.331055
0.465854
9
70
487
0.026566
85.754776
2.278125
我在 df1 和 df2 之间做了一个 .compare
,它产生了以下新数据帧:
Origin
Destination
Value
self
other
self
other
self
other
70478
70
70
478
478
0.002779
135.974365
70479
70
70
479
479
0.001673
130.936752
70480
70
70
480
480
0.000427
111.191734
70481
70
70
481
481
0.001503
98.170746
70482
70
70
482
482
0.01215
88.257645
70483
70
70
483
483
0.004507
102.095566
70484
70
70
484
484
0.001871
103.585373
70485
70
70
485
485
0.006522
114.298431
70486
70
70
486
486
0.004786
97.331055
70487
70
70
487
487
0.026566
85.754776
然后我创建一个新的最终 pd.DataFrame
df,迭代我上面的比较 table 和 .append
到我最终的新 df。
迭代和追加的最后一部分在非常大的 table 秒(每个有几十万条记录)上花费了很长时间 - 每次大约 1.5 小时。
有没有办法更有效地完成最后一部分?
谢谢。
代码示例:
import pandas as pd
# Replicating sample df1 (.read_csv from csv file 1)
df_1_data = [[70, 478, 0.0027788935694843],
[70, 479, 0.0016728754853829],
[70, 480, 0.0004271405050531],
[70, 481, 0.0015028485795482],
[70, 482, 0.0121498983353376],
[70, 483, 0.0045067127794027],
[70, 484, 0.0018709792057052],
[70, 485, 0.0065224897116422],
[70, 486, 0.0047862790524959],
[70, 487, 0.0265655759721994]]
df_1 = pd.DataFrame(df_1_data, columns=['Origin', 'Destination', 'Value'])
df_1 = df_1.set_index(df_1['Origin'].astype(str) + df_1['Destination'].astype(str))
print(df_1)
# Replicating sample df2 (.read_csv from csv file 2)
df_2_data = [[70, 478, 135.9743652],
[70, 479, 130.9367523],
[70, 480, 111.1917343],
[70, 481, 98.17074585],
[70, 482, 88.25764465],
[70, 483, 102.0955658],
[70, 484, 103.5853729],
[70, 485, 114.2984314],
[70, 486, 97.33105469],
[70, 487, 85.754776]]
df_2 = pd.DataFrame(df_2_data, columns=['Origin', 'Destination', 'Value'])
df_2 = df_2.set_index(df_2['Origin'].astype(str) + df_2['Destination'].astype(str))
print(df_2)
df_compare = df_1.compare(df_2, keep_shape=True, keep_equal=True)
print(df_compare)
df_out = pd.DataFrame(columns=['Origin', 'Destination', 'Demand', 'Time', 'Demand_Time'])
for index, row in df_compare.iterrows():
df_out = df_out.append({'Origin': int(row['Origin']['self']), 'Destination': int(row['Destination']['self']),
'Demand': row['Value']['self'], 'Time': row['Value']['other'],
'Demand_Time': row['Value']['self'] * row['Value']['other']}, ignore_index=True)
print(df_out)
print('\nCOMPLETED')
IIUC,你可以使用:
out = (df1.rename(columns={'Value': 'Demand'})
.assign(Time=df2['Value'], Demand_Time=df2['Value'] * df1['Value'])
.reset_index(drop=True))
print(out)
# Output
Origin Destination Demand Time Demand_Time
0 70 478 0.002779 135.974365 0.377873
1 70 479 0.001673 130.936752 0.219057
2 70 480 0.000427 111.191734 0.047479
3 70 481 0.001503 98.170746 0.147551
4 70 482 0.012150 88.257645 1.072330
5 70 483 0.004507 102.095566 0.460145
6 70 484 0.001871 103.585373 0.193808
7 70 485 0.006522 114.298431 0.745454
8 70 486 0.004786 97.331055 0.465826
9 70 487 0.026566 85.754776 2.278161
如果我正确理解请求,我会结合使用 pandas 和 numby 来及时获得您想要的结果
import datetime
import numpy as np
df_1_data = [[70, 478, 0.0027788935694843],
[70, 479, 0.0016728754853829],
[70, 480, 0.0004271405050531],
[70, 481, 0.0015028485795482],
[70, 482, 0.0121498983353376],
[70, 483, 0.0045067127794027],
[70, 484, 0.0018709792057052],
[70, 485, 0.0065224897116422],
[70, 486, 0.0047862790524959],
[70, 487, 0.0265655759721994]]
df_1 = pd.DataFrame(df_1_data, columns=['Origin', 'Destination', 'Value'])
df_1 = df_1.set_index(df_1['Origin'].astype(str) + df_1['Destination'].astype(str))
# Replicating sample df2 (.read_csv from csv file 2)
df_2_data = [[70, 478, 135.9743652],
[70, 479, 130.9367523],
[70, 480, 111.1917343],
[70, 481, 98.17074585],
[70, 482, 88.25764465],
[70, 483, 102.0955658],
[70, 484, 103.5853729],
[70, 485, 114.2984314],
[70, 486, 97.33105469],
[70, 487, 85.754776]]
df_2 = pd.DataFrame(df_2_data, columns=['Origin', 'Destination', 'Value'])
df_2 = df_2.set_index(df_2['Origin'].astype(str) + df_2['Destination'].astype(str))
df_1.columns = [['Origin', 'Destination', 'Demand']]
df_2.columns = [['Origin', 'Destination', 'Time']]
df_merge = df_1.merge(df_2, how = 'inner')
df_merge['Demand_Time'] = df_merge['Time'].values / df_merge['Demand'].values
df_merge
我有多组非常大的 csv 文件,需要根据唯一 ID 进行合并。我将此唯一 ID 设置为索引,该索引基于我的 Origin 和 Destination 列的串联。
数据帧 1:
Origin | Destination | Value | |
---|---|---|---|
70478 | 70 | 478 | 0.002779 |
70479 | 70 | 479 | 0.001673 |
70480 | 70 | 480 | 0.000427 |
70481 | 70 | 481 | 0.001503 |
70482 | 70 | 482 | 0.01215 |
70483 | 70 | 483 | 0.004507 |
70484 | 70 | 484 | 0.001871 |
70485 | 70 | 485 | 0.006522 |
70486 | 70 | 486 | 0.004786 |
70487 | 70 | 487 | 0.026566 |
数据帧 2:
Origin | Destination | Value | |
---|---|---|---|
70478 | 70 | 478 | 135.974365 |
70479 | 70 | 479 | 130.936752 |
70480 | 70 | 480 | 111.191734 |
70481 | 70 | 481 | 98.170746 |
70482 | 70 | 482 | 88.257645 |
70483 | 70 | 483 | 102.095566 |
70484 | 70 | 484 | 103.585373 |
70485 | 70 | 485 | 114.298431 |
70486 | 70 | 486 | 97.331055 |
70487 | 70 | 487 | 85.754776 |
我的最终table应该如下(需求=df1的值;时间=df2的值;Demand_Time=Time/Demand ):
Origin | Destination | Demand | Time | Demand_Time | |
---|---|---|---|---|---|
0 | 70 | 478 | 0.002779 | 135.974365 | 0.377858 |
1 | 70 | 479 | 0.001673 | 130.936752 | 0.219041 |
2 | 70 | 480 | 0.000427 | 111.191734 | 0.047494 |
3 | 70 | 481 | 0.001503 | 98.170746 | 0.147536 |
4 | 70 | 482 | 0.01215 | 88.257645 | 1.072321 |
5 | 70 | 483 | 0.004507 | 102.095566 | 0.460115 |
6 | 70 | 484 | 0.001871 | 103.585373 | 0.193806 |
7 | 70 | 485 | 0.006522 | 114.298431 | 0.74551 |
8 | 70 | 486 | 0.004786 | 97.331055 | 0.465854 |
9 | 70 | 487 | 0.026566 | 85.754776 | 2.278125 |
我在 df1 和 df2 之间做了一个 .compare
,它产生了以下新数据帧:
Origin | Destination | Value | ||||
---|---|---|---|---|---|---|
self | other | self | other | self | other | |
70478 | 70 | 70 | 478 | 478 | 0.002779 | 135.974365 |
70479 | 70 | 70 | 479 | 479 | 0.001673 | 130.936752 |
70480 | 70 | 70 | 480 | 480 | 0.000427 | 111.191734 |
70481 | 70 | 70 | 481 | 481 | 0.001503 | 98.170746 |
70482 | 70 | 70 | 482 | 482 | 0.01215 | 88.257645 |
70483 | 70 | 70 | 483 | 483 | 0.004507 | 102.095566 |
70484 | 70 | 70 | 484 | 484 | 0.001871 | 103.585373 |
70485 | 70 | 70 | 485 | 485 | 0.006522 | 114.298431 |
70486 | 70 | 70 | 486 | 486 | 0.004786 | 97.331055 |
70487 | 70 | 70 | 487 | 487 | 0.026566 | 85.754776 |
然后我创建一个新的最终 pd.DataFrame
df,迭代我上面的比较 table 和 .append
到我最终的新 df。
迭代和追加的最后一部分在非常大的 table 秒(每个有几十万条记录)上花费了很长时间 - 每次大约 1.5 小时。
有没有办法更有效地完成最后一部分?
谢谢。
代码示例:
import pandas as pd
# Replicating sample df1 (.read_csv from csv file 1)
df_1_data = [[70, 478, 0.0027788935694843],
[70, 479, 0.0016728754853829],
[70, 480, 0.0004271405050531],
[70, 481, 0.0015028485795482],
[70, 482, 0.0121498983353376],
[70, 483, 0.0045067127794027],
[70, 484, 0.0018709792057052],
[70, 485, 0.0065224897116422],
[70, 486, 0.0047862790524959],
[70, 487, 0.0265655759721994]]
df_1 = pd.DataFrame(df_1_data, columns=['Origin', 'Destination', 'Value'])
df_1 = df_1.set_index(df_1['Origin'].astype(str) + df_1['Destination'].astype(str))
print(df_1)
# Replicating sample df2 (.read_csv from csv file 2)
df_2_data = [[70, 478, 135.9743652],
[70, 479, 130.9367523],
[70, 480, 111.1917343],
[70, 481, 98.17074585],
[70, 482, 88.25764465],
[70, 483, 102.0955658],
[70, 484, 103.5853729],
[70, 485, 114.2984314],
[70, 486, 97.33105469],
[70, 487, 85.754776]]
df_2 = pd.DataFrame(df_2_data, columns=['Origin', 'Destination', 'Value'])
df_2 = df_2.set_index(df_2['Origin'].astype(str) + df_2['Destination'].astype(str))
print(df_2)
df_compare = df_1.compare(df_2, keep_shape=True, keep_equal=True)
print(df_compare)
df_out = pd.DataFrame(columns=['Origin', 'Destination', 'Demand', 'Time', 'Demand_Time'])
for index, row in df_compare.iterrows():
df_out = df_out.append({'Origin': int(row['Origin']['self']), 'Destination': int(row['Destination']['self']),
'Demand': row['Value']['self'], 'Time': row['Value']['other'],
'Demand_Time': row['Value']['self'] * row['Value']['other']}, ignore_index=True)
print(df_out)
print('\nCOMPLETED')
IIUC,你可以使用:
out = (df1.rename(columns={'Value': 'Demand'})
.assign(Time=df2['Value'], Demand_Time=df2['Value'] * df1['Value'])
.reset_index(drop=True))
print(out)
# Output
Origin Destination Demand Time Demand_Time
0 70 478 0.002779 135.974365 0.377873
1 70 479 0.001673 130.936752 0.219057
2 70 480 0.000427 111.191734 0.047479
3 70 481 0.001503 98.170746 0.147551
4 70 482 0.012150 88.257645 1.072330
5 70 483 0.004507 102.095566 0.460145
6 70 484 0.001871 103.585373 0.193808
7 70 485 0.006522 114.298431 0.745454
8 70 486 0.004786 97.331055 0.465826
9 70 487 0.026566 85.754776 2.278161
如果我正确理解请求,我会结合使用 pandas 和 numby 来及时获得您想要的结果
import datetime
import numpy as np
df_1_data = [[70, 478, 0.0027788935694843],
[70, 479, 0.0016728754853829],
[70, 480, 0.0004271405050531],
[70, 481, 0.0015028485795482],
[70, 482, 0.0121498983353376],
[70, 483, 0.0045067127794027],
[70, 484, 0.0018709792057052],
[70, 485, 0.0065224897116422],
[70, 486, 0.0047862790524959],
[70, 487, 0.0265655759721994]]
df_1 = pd.DataFrame(df_1_data, columns=['Origin', 'Destination', 'Value'])
df_1 = df_1.set_index(df_1['Origin'].astype(str) + df_1['Destination'].astype(str))
# Replicating sample df2 (.read_csv from csv file 2)
df_2_data = [[70, 478, 135.9743652],
[70, 479, 130.9367523],
[70, 480, 111.1917343],
[70, 481, 98.17074585],
[70, 482, 88.25764465],
[70, 483, 102.0955658],
[70, 484, 103.5853729],
[70, 485, 114.2984314],
[70, 486, 97.33105469],
[70, 487, 85.754776]]
df_2 = pd.DataFrame(df_2_data, columns=['Origin', 'Destination', 'Value'])
df_2 = df_2.set_index(df_2['Origin'].astype(str) + df_2['Destination'].astype(str))
df_1.columns = [['Origin', 'Destination', 'Demand']]
df_2.columns = [['Origin', 'Destination', 'Time']]
df_merge = df_1.merge(df_2, how = 'inner')
df_merge['Demand_Time'] = df_merge['Time'].values / df_merge['Demand'].values
df_merge