如何合并这些数据框

Question

我有两个形式的数据框：

df1: 
     note_id  start_sentence  end_sentence  
0    476766             328           452   
1    476766             941          1065   
2    500941             377           522   
3    500941             797           963   
4    500941            1722          1917

和

df2:
    note_id  start  end  
0    476766  300    327   
1    500941  700    796

我想合并这些，以便连接在 note_id 上并且 start 和 start_sentence 之间的差异最小化并且 start_sentence - start 为正；理想情况下，像

df1.merge(df2, on=['note_id']).query('min(start_sentence - start)&(start_sentence > start)')

但是，这当然不行。

最终目标将是一个完整的外部连接，如：

  note_id  start_sentence  end_sentence    start      end
0    476766             328           452    300      327
1    476766             941          1065     NA       NA
2    500941             377           522     NA       NA
3    500941             797           963    700      796
4    500941            1722          1917     NA       NA

我知道如何使用可迭代对象执行此操作，但由于我有数千行的数百行，所以速度很慢。

Answer 1

我只想做一个内连接，然后使用你所有的条件来更新值。

专门计算差异>0的差异，然后用np.nan

填充每组除了最小差异以外的所有内容

import pandas as pd

df1 = pd.DataFrame({'note_id': [476766, 476766, 500941, 500941, 500941],
 'start_sentence': [328, 941, 377, 797, 1722],
 'end_sentence': [452, 1065, 522, 963, 1917]})

df2 = pd.DataFrame({'note_id': [476766, 500941], 'start': [300, 700], 'end': [327, 796]})

df = pd.merge(df1,df2, on='note_id')

df.loc[(df['start_sentence']-df['start']).gt(0),'diff'] = df['start_sentence']-df['start']
df.loc[~df.index.isin(df.groupby('note_id')['diff'].idxmin()), ['start','end']] = np.nan
df.drop(columns='diff', inplace=True)

print(df)

输出

   note_id  start_sentence  end_sentence  start    end
0   476766             328           452  300.0  327.0
1   476766             941          1065    NaN    NaN
2   500941             377           522    NaN    NaN
3   500941             797           963  700.0  796.0
4   500941            1722          1917    NaN    NaN

如何合并这些数据框

How to do merge of these dataframes

python

merge

pandas