如何每行仅合并一次数据帧
How to merge data frames only once per row
抱歉,我对这一切还很陌生,所以请原谅我可能会问的任何愚蠢问题。
我正在尝试使用 pandas、
合并来自 salesforce 的几个报告
我有两个报告:
Labour Report
和 Plant Report
并想生产:Combined Report
我已经尝试合并、连接和对齐数据帧,但是,因为两者共享相同的索引,所以我的工厂报告的第一行重复了 5 次,即:
Incorrectly Merged Report
我的下一个想法是使用 .align(),然后使用 .duplicated() 到 return 布尔系列,显示重复的行,并清除重复行的内容。如果我要使用这种方法,我怎样才能清除一行的内容?
而且这看起来效率很低而且很难做到,有没有更优雅的解决方案?
干杯
这是一些样本数据
劳动-report.csv
DJR Number,Date,Foreman,Labour Name,Start,Finish,Hours Worked
1,1/01/2020,Emersen,Tai,6:30,15:00,8:30
1,1/01/2020,Emersen,Rachel,6:30,15:00,8:30
1,1/01/2020,Emersen,Andrew,6:30,15:00,8:30
1,1/01/2020,Emersen,Liam,6:30,15:00,8:30
2,2/01/2020,Scott,Rachel,6:30,15:00,8:30
3,3/01/2020,Michael,Andrew,6:30,15:00,8:30
工厂-report.csv
DJR Number,Date,Foreman,Plant,Plant Hours
1,1/01/2020,Emersen,1,5
2,1/01/2020,Scott,2,5
3,1/01/2020,Michael,2,5
可以使用以下代码。逻辑是添加一个额外的列来标记该行是否需要连接。然后我们可以在 pandas 上进行左合并,然后在合并后删除这个额外的列
import pandas as pd
import numpy as np
labour_report = pd.read_csv('labour-report.csv')
plant_report = pd.read_csv('plant-report.csv')
should_join_row_lbr=[]
foremen = {}
for index, labour_report_row in labour_report.iterrows():
foreman = labour_report_row['Foreman']
if foreman in foremen:
should_join_row_lbr.append(0) # Dont join this row
else:
foremen[foreman]=1
# Join this row as this is the first record for the foreman
should_join_row_lbr.append(1)
labour_report['Should Join Row']=should_join_row_lbr
should_join_row_plnt = [1]*plant_report['DJR Number'].count()
plant_report['Should Join Row']=should_join_row_plnt
# Do a left join with Should Join column as well,
# Only the records that have value 1 will be joined from labour_report
combined_report = pd.merge(labour_report,plant_report,on=['Foreman','Should Join Row'],how='left')
combined_report = combined_report.drop(columns=['Should Join Row'])
combined_report = combined_report.replace(np.nan, '', regex=True)
print(combined_report)
combined_report.to_csv('combined_report.csv')
应该产生结果
DJR Number_x Date_x Foreman Labour Name Start Finish Hours Worked DJR Number_y Date_y Plant Plant Hours
0 1 1/01/2020 Emersen Tai 6:30 15:00 8:30 1 1/01/2020 1 5
1 1 1/01/2020 Emersen Rachel 6:30 15:00 8:30
2 1 1/01/2020 Emersen Andrew 6:30 15:00 8:30
3 1 1/01/2020 Emersen Liam 6:30 15:00 8:30
4 2 2/01/2020 Scott Rachel 6:30 15:00 8:30 2 1/01/2020 2 5
5 3 3/01/2020 Michael Andrew 6:30 15:00 8:30 3 1/01/2020 2 5
假设您不需要在最终结果中使用重复的 'DJR Number' 列。我们可以根据 DJR 编号加入两个 Dataframe。重复的函数将 return 一系列关于它们是否重复的布尔值。如果它们为 True,我们为其分配空值。
merged = labour.merge(plant,on='DJR Number')
merged.loc[merged['DJR Number'].duplicated(),['Date_y','Foreman_y','Plant','Plant Hours']] = np.NaN
如果您想要更清晰的结果,请改为合并三列。
merged = labour.merge(plant,on=['DJR Number','Date','Foreman'])
merged.loc[merged['DJR Number'].duplicated(),['Plant','Plant Hours']] = np.NaN
这不会导致任何重复列。
请参阅 Pandas Duplicated 文档。
抱歉,我对这一切还很陌生,所以请原谅我可能会问的任何愚蠢问题。
我正在尝试使用 pandas、
合并来自 salesforce 的几个报告我有两个报告:
Labour Report 和 Plant Report
并想生产:Combined Report
我已经尝试合并、连接和对齐数据帧,但是,因为两者共享相同的索引,所以我的工厂报告的第一行重复了 5 次,即:
Incorrectly Merged Report
我的下一个想法是使用 .align(),然后使用 .duplicated() 到 return 布尔系列,显示重复的行,并清除重复行的内容。如果我要使用这种方法,我怎样才能清除一行的内容?
而且这看起来效率很低而且很难做到,有没有更优雅的解决方案?
干杯
这是一些样本数据
劳动-report.csv
DJR Number,Date,Foreman,Labour Name,Start,Finish,Hours Worked
1,1/01/2020,Emersen,Tai,6:30,15:00,8:30
1,1/01/2020,Emersen,Rachel,6:30,15:00,8:30
1,1/01/2020,Emersen,Andrew,6:30,15:00,8:30
1,1/01/2020,Emersen,Liam,6:30,15:00,8:30
2,2/01/2020,Scott,Rachel,6:30,15:00,8:30
3,3/01/2020,Michael,Andrew,6:30,15:00,8:30
工厂-report.csv
DJR Number,Date,Foreman,Plant,Plant Hours
1,1/01/2020,Emersen,1,5
2,1/01/2020,Scott,2,5
3,1/01/2020,Michael,2,5
可以使用以下代码。逻辑是添加一个额外的列来标记该行是否需要连接。然后我们可以在 pandas 上进行左合并,然后在合并后删除这个额外的列
import pandas as pd
import numpy as np
labour_report = pd.read_csv('labour-report.csv')
plant_report = pd.read_csv('plant-report.csv')
should_join_row_lbr=[]
foremen = {}
for index, labour_report_row in labour_report.iterrows():
foreman = labour_report_row['Foreman']
if foreman in foremen:
should_join_row_lbr.append(0) # Dont join this row
else:
foremen[foreman]=1
# Join this row as this is the first record for the foreman
should_join_row_lbr.append(1)
labour_report['Should Join Row']=should_join_row_lbr
should_join_row_plnt = [1]*plant_report['DJR Number'].count()
plant_report['Should Join Row']=should_join_row_plnt
# Do a left join with Should Join column as well,
# Only the records that have value 1 will be joined from labour_report
combined_report = pd.merge(labour_report,plant_report,on=['Foreman','Should Join Row'],how='left')
combined_report = combined_report.drop(columns=['Should Join Row'])
combined_report = combined_report.replace(np.nan, '', regex=True)
print(combined_report)
combined_report.to_csv('combined_report.csv')
应该产生结果
DJR Number_x Date_x Foreman Labour Name Start Finish Hours Worked DJR Number_y Date_y Plant Plant Hours
0 1 1/01/2020 Emersen Tai 6:30 15:00 8:30 1 1/01/2020 1 5
1 1 1/01/2020 Emersen Rachel 6:30 15:00 8:30
2 1 1/01/2020 Emersen Andrew 6:30 15:00 8:30
3 1 1/01/2020 Emersen Liam 6:30 15:00 8:30
4 2 2/01/2020 Scott Rachel 6:30 15:00 8:30 2 1/01/2020 2 5
5 3 3/01/2020 Michael Andrew 6:30 15:00 8:30 3 1/01/2020 2 5
假设您不需要在最终结果中使用重复的 'DJR Number' 列。我们可以根据 DJR 编号加入两个 Dataframe。重复的函数将 return 一系列关于它们是否重复的布尔值。如果它们为 True,我们为其分配空值。
merged = labour.merge(plant,on='DJR Number')
merged.loc[merged['DJR Number'].duplicated(),['Date_y','Foreman_y','Plant','Plant Hours']] = np.NaN
如果您想要更清晰的结果,请改为合并三列。
merged = labour.merge(plant,on=['DJR Number','Date','Foreman'])
merged.loc[merged['DJR Number'].duplicated(),['Plant','Plant Hours']] = np.NaN
这不会导致任何重复列。
请参阅 Pandas Duplicated 文档。