如何在多列上自行加入 pandas 数据框并创建一个包含新列的新框架(新列仅包含右侧的信息)

How to self-join a pandas dataframe on multiple columns and create a new frame with a new column (new column only has info from right side)

我有以下数据集。

ID      LineID  TeamID  ShiftID DateTime        Production  Theoretical  Scrap
1       3       1       NULL    18/6/2020 4:00  482.5291    511.2351    
2       2       1       NULL    18/6/2020 5:00  467.8704    519.9842
3       1       1       NULL    18/6/2020 5:00  390.5945    480.2252    
2186    3       1       NULL    18/6/2020 5:00  0                        0.5
2520    2       1       NULL    18/6/2020 5:00  0                        21
2840    1       1       NULL    18/6/2020 6:00  0                        12
4       1       1       NULL    18/6/2020 6:00  389.2222    480.2252        
5       3       1       NULL    18/6/2020 6:00  516.0907    511.2351    
6       2       1       NULL    18/6/2020 6:00  450.5216    519.9842    
7       3       1       NULL    18/6/2020 6:00  397.9998    511.2351    
8       2       1       NULL    18/6/2020 7:00  456.9486    519.9842    
9       1       1       NULL    18/6/2020 7:00  414.6932    480.2252        
1939    2       1       NULL    18/6/2020 7:00  0                        24
2462    3       1       NULL    18/6/2020 7:00  0                        3
3075    1       1       NULL    18/6/2020 7:00  0                        3.5
......
......
......

它位于 excel / csv 文件中。

我可以使用 python pandas 和 sql。我想做一个自我加入,但我不知道我该怎么做。

我想加入 DateTime、LineID 和 TeamID,所以剪贴簿留言会合并到空白区域。然后我想要来自有废料的行的 ID 来制作一个新的“ScrapID”,例如。

ID      LineID  TeamID  ShiftID DateTime        Production  Theoretical  Scrap ScrapID
1       3       1       NULL    18/6/2020 4:00  482.5291    511.2351     
2       2       1       NULL    18/6/2020 5:00  467.8704    519.9842     21    2520
3       1       1       NULL    18/6/2020 5:00  390.5945    480.2252     12    2186
4       1       1       NULL    18/6/2020 6:00  389.2222    480.2252        
5       3       1       NULL    18/6/2020 6:00  516.0907    511.2351    
6       2       1       NULL    18/6/2020 6:00  450.5216    519.9842    
7       3       1       NULL    18/6/2020 6:00  397.9998    511.2351    
8       2       1       NULL    18/6/2020 7:00  456.9486    519.9842     24    1939
9       1       1       NULL    18/6/2020 7:00  414.6932    480.2252     3.5   3075 
......
......
......

我不知道该怎么做。

我试过了

df2 = df[machineutil['Scrap'] > 0]

pd.merge(df, df2, left_on = ['LineID','TeamID','Date'], right_on = ['LineID','TeamID','Date'], how = 'left')

但这只会使框架变长并使列的长度加倍。

我也试过了

df2 = df[machineutil['Scrap'] > 0]

pd.merge(df, df2[['Date','ID','LineID','TeamID','Scrap']], left_on = ['Date','LineID','TeamID'], right_on = ['Date','LineID','TeamID'], how = 'left')#

但是我得到了某些列的副本,其中包含一些奇怪的填充,我不确定为什么。

    ID_x    LineID  TeamID  ShiftID Date    Production  Theoretical Scrap_x ID_y    Scrap_y
0   1   3   1   NaN 2018-06-18 04:00:00 482.5291    511.2351    0.0 NaN NaN
1   2   2   1   NaN 2018-06-18 05:00:00 467.8704    519.9842    0.0 2520.0  21.00
2   2   2   1   NaN 2018-06-18 05:00:00 467.8704    519.9842    0.0 3063.0  2.50
3   3   1   1   NaN 2018-06-18 05:00:00 390.5945    480.2252    0.0 NaN NaN
4   2186    3   1   NaN 2018-06-18 05:00:00 0.000000    0.000000    0.5 2186.0  0.50

您应该首先根据 Scrap 列是否包含正数据拆分数据框,然后加入这些部分:

df1 = df.loc[~(df['Scrap']>0),['LineID', 'TeamID', 'ShiftID',
                                   'DateTime', 'Production','Theoretical']]
df2 = df.loc[df['Scrap']>0, ['ID', 'LineID', 'TeamID', 'DateTime',
                                   'Scrap']]
resul = df1.merge(df2, how='left', on=['LineID', 'TeamID', 'DateTime'])

在我的测试中,它给出:

   LineID  TeamID  ShiftID            DateTime  Production  Theoretical      ID  Scrap
0       3       1      NaN 2020-06-18 04:00:00    482.5291     511.2351     NaN    NaN
1       2       1      NaN 2020-06-18 05:00:00    467.8704     519.9842  2520.0   21.0
2       1       1      NaN 2020-06-18 05:00:00    390.5945     480.2252     NaN    NaN
3       1       1      NaN 2020-06-18 06:00:00    389.2222     480.2252  2840.0   12.0
4       3       1      NaN 2020-06-18 06:00:00    516.0907     511.2351     NaN    NaN
5       2       1      NaN 2020-06-18 06:00:00    450.5216     519.9842     NaN    NaN
6       3       1      NaN 2020-06-18 06:00:00    397.9998     511.2351     NaN    NaN
7       2       1      NaN 2020-06-18 07:00:00    456.9486     519.9842  1939.0   24.0
8       1       1      NaN 2020-06-18 07:00:00    414.6932     480.2252  3075.0    3.5