如何在多列上自行加入 pandas 数据框并创建一个包含新列的新框架(新列仅包含右侧的信息)
How to self-join a pandas dataframe on multiple columns and create a new frame with a new column (new column only has info from right side)
我有以下数据集。
ID LineID TeamID ShiftID DateTime Production Theoretical Scrap
1 3 1 NULL 18/6/2020 4:00 482.5291 511.2351
2 2 1 NULL 18/6/2020 5:00 467.8704 519.9842
3 1 1 NULL 18/6/2020 5:00 390.5945 480.2252
2186 3 1 NULL 18/6/2020 5:00 0 0.5
2520 2 1 NULL 18/6/2020 5:00 0 21
2840 1 1 NULL 18/6/2020 6:00 0 12
4 1 1 NULL 18/6/2020 6:00 389.2222 480.2252
5 3 1 NULL 18/6/2020 6:00 516.0907 511.2351
6 2 1 NULL 18/6/2020 6:00 450.5216 519.9842
7 3 1 NULL 18/6/2020 6:00 397.9998 511.2351
8 2 1 NULL 18/6/2020 7:00 456.9486 519.9842
9 1 1 NULL 18/6/2020 7:00 414.6932 480.2252
1939 2 1 NULL 18/6/2020 7:00 0 24
2462 3 1 NULL 18/6/2020 7:00 0 3
3075 1 1 NULL 18/6/2020 7:00 0 3.5
......
......
......
它位于 excel / csv 文件中。
我可以使用 python pandas 和 sql。我想做一个自我加入,但我不知道我该怎么做。
我想加入 DateTime、LineID 和 TeamID,所以剪贴簿留言会合并到空白区域。然后我想要来自有废料的行的 ID 来制作一个新的“ScrapID”,例如。
ID LineID TeamID ShiftID DateTime Production Theoretical Scrap ScrapID
1 3 1 NULL 18/6/2020 4:00 482.5291 511.2351
2 2 1 NULL 18/6/2020 5:00 467.8704 519.9842 21 2520
3 1 1 NULL 18/6/2020 5:00 390.5945 480.2252 12 2186
4 1 1 NULL 18/6/2020 6:00 389.2222 480.2252
5 3 1 NULL 18/6/2020 6:00 516.0907 511.2351
6 2 1 NULL 18/6/2020 6:00 450.5216 519.9842
7 3 1 NULL 18/6/2020 6:00 397.9998 511.2351
8 2 1 NULL 18/6/2020 7:00 456.9486 519.9842 24 1939
9 1 1 NULL 18/6/2020 7:00 414.6932 480.2252 3.5 3075
......
......
......
我不知道该怎么做。
我试过了
df2 = df[machineutil['Scrap'] > 0]
pd.merge(df, df2, left_on = ['LineID','TeamID','Date'], right_on = ['LineID','TeamID','Date'], how = 'left')
但这只会使框架变长并使列的长度加倍。
我也试过了
df2 = df[machineutil['Scrap'] > 0]
pd.merge(df, df2[['Date','ID','LineID','TeamID','Scrap']], left_on = ['Date','LineID','TeamID'], right_on = ['Date','LineID','TeamID'], how = 'left')#
但是我得到了某些列的副本,其中包含一些奇怪的填充,我不确定为什么。
ID_x LineID TeamID ShiftID Date Production Theoretical Scrap_x ID_y Scrap_y
0 1 3 1 NaN 2018-06-18 04:00:00 482.5291 511.2351 0.0 NaN NaN
1 2 2 1 NaN 2018-06-18 05:00:00 467.8704 519.9842 0.0 2520.0 21.00
2 2 2 1 NaN 2018-06-18 05:00:00 467.8704 519.9842 0.0 3063.0 2.50
3 3 1 1 NaN 2018-06-18 05:00:00 390.5945 480.2252 0.0 NaN NaN
4 2186 3 1 NaN 2018-06-18 05:00:00 0.000000 0.000000 0.5 2186.0 0.50
您应该首先根据 Scrap
列是否包含正数据拆分数据框,然后加入这些部分:
df1 = df.loc[~(df['Scrap']>0),['LineID', 'TeamID', 'ShiftID',
'DateTime', 'Production','Theoretical']]
df2 = df.loc[df['Scrap']>0, ['ID', 'LineID', 'TeamID', 'DateTime',
'Scrap']]
resul = df1.merge(df2, how='left', on=['LineID', 'TeamID', 'DateTime'])
在我的测试中,它给出:
LineID TeamID ShiftID DateTime Production Theoretical ID Scrap
0 3 1 NaN 2020-06-18 04:00:00 482.5291 511.2351 NaN NaN
1 2 1 NaN 2020-06-18 05:00:00 467.8704 519.9842 2520.0 21.0
2 1 1 NaN 2020-06-18 05:00:00 390.5945 480.2252 NaN NaN
3 1 1 NaN 2020-06-18 06:00:00 389.2222 480.2252 2840.0 12.0
4 3 1 NaN 2020-06-18 06:00:00 516.0907 511.2351 NaN NaN
5 2 1 NaN 2020-06-18 06:00:00 450.5216 519.9842 NaN NaN
6 3 1 NaN 2020-06-18 06:00:00 397.9998 511.2351 NaN NaN
7 2 1 NaN 2020-06-18 07:00:00 456.9486 519.9842 1939.0 24.0
8 1 1 NaN 2020-06-18 07:00:00 414.6932 480.2252 3075.0 3.5
我有以下数据集。
ID LineID TeamID ShiftID DateTime Production Theoretical Scrap
1 3 1 NULL 18/6/2020 4:00 482.5291 511.2351
2 2 1 NULL 18/6/2020 5:00 467.8704 519.9842
3 1 1 NULL 18/6/2020 5:00 390.5945 480.2252
2186 3 1 NULL 18/6/2020 5:00 0 0.5
2520 2 1 NULL 18/6/2020 5:00 0 21
2840 1 1 NULL 18/6/2020 6:00 0 12
4 1 1 NULL 18/6/2020 6:00 389.2222 480.2252
5 3 1 NULL 18/6/2020 6:00 516.0907 511.2351
6 2 1 NULL 18/6/2020 6:00 450.5216 519.9842
7 3 1 NULL 18/6/2020 6:00 397.9998 511.2351
8 2 1 NULL 18/6/2020 7:00 456.9486 519.9842
9 1 1 NULL 18/6/2020 7:00 414.6932 480.2252
1939 2 1 NULL 18/6/2020 7:00 0 24
2462 3 1 NULL 18/6/2020 7:00 0 3
3075 1 1 NULL 18/6/2020 7:00 0 3.5
......
......
......
它位于 excel / csv 文件中。
我可以使用 python pandas 和 sql。我想做一个自我加入,但我不知道我该怎么做。
我想加入 DateTime、LineID 和 TeamID,所以剪贴簿留言会合并到空白区域。然后我想要来自有废料的行的 ID 来制作一个新的“ScrapID”,例如。
ID LineID TeamID ShiftID DateTime Production Theoretical Scrap ScrapID
1 3 1 NULL 18/6/2020 4:00 482.5291 511.2351
2 2 1 NULL 18/6/2020 5:00 467.8704 519.9842 21 2520
3 1 1 NULL 18/6/2020 5:00 390.5945 480.2252 12 2186
4 1 1 NULL 18/6/2020 6:00 389.2222 480.2252
5 3 1 NULL 18/6/2020 6:00 516.0907 511.2351
6 2 1 NULL 18/6/2020 6:00 450.5216 519.9842
7 3 1 NULL 18/6/2020 6:00 397.9998 511.2351
8 2 1 NULL 18/6/2020 7:00 456.9486 519.9842 24 1939
9 1 1 NULL 18/6/2020 7:00 414.6932 480.2252 3.5 3075
......
......
......
我不知道该怎么做。
我试过了
df2 = df[machineutil['Scrap'] > 0]
pd.merge(df, df2, left_on = ['LineID','TeamID','Date'], right_on = ['LineID','TeamID','Date'], how = 'left')
但这只会使框架变长并使列的长度加倍。
我也试过了
df2 = df[machineutil['Scrap'] > 0]
pd.merge(df, df2[['Date','ID','LineID','TeamID','Scrap']], left_on = ['Date','LineID','TeamID'], right_on = ['Date','LineID','TeamID'], how = 'left')#
但是我得到了某些列的副本,其中包含一些奇怪的填充,我不确定为什么。
ID_x LineID TeamID ShiftID Date Production Theoretical Scrap_x ID_y Scrap_y
0 1 3 1 NaN 2018-06-18 04:00:00 482.5291 511.2351 0.0 NaN NaN
1 2 2 1 NaN 2018-06-18 05:00:00 467.8704 519.9842 0.0 2520.0 21.00
2 2 2 1 NaN 2018-06-18 05:00:00 467.8704 519.9842 0.0 3063.0 2.50
3 3 1 1 NaN 2018-06-18 05:00:00 390.5945 480.2252 0.0 NaN NaN
4 2186 3 1 NaN 2018-06-18 05:00:00 0.000000 0.000000 0.5 2186.0 0.50
您应该首先根据 Scrap
列是否包含正数据拆分数据框,然后加入这些部分:
df1 = df.loc[~(df['Scrap']>0),['LineID', 'TeamID', 'ShiftID',
'DateTime', 'Production','Theoretical']]
df2 = df.loc[df['Scrap']>0, ['ID', 'LineID', 'TeamID', 'DateTime',
'Scrap']]
resul = df1.merge(df2, how='left', on=['LineID', 'TeamID', 'DateTime'])
在我的测试中,它给出:
LineID TeamID ShiftID DateTime Production Theoretical ID Scrap
0 3 1 NaN 2020-06-18 04:00:00 482.5291 511.2351 NaN NaN
1 2 1 NaN 2020-06-18 05:00:00 467.8704 519.9842 2520.0 21.0
2 1 1 NaN 2020-06-18 05:00:00 390.5945 480.2252 NaN NaN
3 1 1 NaN 2020-06-18 06:00:00 389.2222 480.2252 2840.0 12.0
4 3 1 NaN 2020-06-18 06:00:00 516.0907 511.2351 NaN NaN
5 2 1 NaN 2020-06-18 06:00:00 450.5216 519.9842 NaN NaN
6 3 1 NaN 2020-06-18 06:00:00 397.9998 511.2351 NaN NaN
7 2 1 NaN 2020-06-18 07:00:00 456.9486 519.9842 1939.0 24.0
8 1 1 NaN 2020-06-18 07:00:00 414.6932 480.2252 3075.0 3.5