用另一列的最新值填充数据框列

Question

我有两个数据框 list1 和 list2，每个数据框都有不同数量的随机索引行。 list1 有大约 240,000 行，而 list2 有大约 390,000 行。它们按照 ['time'] 列从最早到最晚的时间排序。它们大致是这样的：

list1

     time    rates
299  09:31   1.30
1230 10:34   2.42
32   13:40   1.49
     ...   ...

list2

     time    Symbol    IV
78   10:31   aqb       7
121  10:59   cdd       3
3240 11:19   oty       4
393  13:54   zqb       8
44   14:13   omu       1
     ...

list2 中的每一行都有一个 ['time'] 值。我希望 list2 中的每一行都具有来自 list1 的最新 ['rates'] 值，该值不晚于它自己的 ['time'] 值。在那之前，可以将相同的 ['rates'] 值填充到 list2 中（抱歉，我知道这很混乱）。下面显示了带有解释的所需结果的示例。

想要的结果

     time    Symbol    IV    rates
78   10:31   aqb       7     1.30
121  10:59   cdd       3     2.42
3240 11:19   oty       4     2.42
393  13:54   zqb       8     1.49
44   14:13   omu       1     1.49

list1 中的第一行来自 9:31，第二行来自 10:34。 list2 中的第一行位于 10:31，因此它应该填充 9:31 的 ['rates'] 值而不是 10:34 的利率值，因为 [=40] =] 晚于 10:31。 list2 中的下一行是 10:59。 list1 中不在 10:59 之后的最新行是 10:34，因此 10:34 中的值 2.42 被填充。与 list2 中具有 11:19 的第三行相同。

我如何在不使用 for 循环缓慢地 iterrows() 遍历每一行并执行上面的一堆 if else 检查的情况下完成此操作，考虑到每个数据帧中的几十万行，这将花费永恒？谢谢！

Answer 1

我只是将 ['time'] 上的两个数据帧与一个指标合并，然后对 ['time'] 上的新数据帧进行排序：

list2 = list2.merge(list1,how = 'outer', on= ['time'], indicator = True)
list2 = list2.sort_values(['time'])

然后用 'left_only' 指标填充行，因此具有 Nan ['rates'] 值，其中包含来自具有 'right_only' 指标的行的最新值，方法是：

list2= list2.fillna(method = 'ffill')

然后从 list1 中删除行：

list2= list2.loc[list2['_merge']!= 'right_only']

Answer 2

使用merge_asof

df1.time=pd.to_datetime(df1.time,format='%H:%M')
df2.time=pd.to_datetime(df2.time,format='%H:%M')
pd.merge_asof(df2.sort_values('time'),df1.sort_values('time'),on='time',direction = 'backward' )
Out[79]: 
                 time Symbol  IV  rates
0 1900-01-01 10:31:00    aqb   7   1.30
1 1900-01-01 10:59:00    cdd   3   2.42
2 1900-01-01 11:19:00    oty   4   2.42
3 1900-01-01 13:54:00    zqb   8   1.49
4 1900-01-01 14:13:00    omu   1   1.49

用另一列的最新值填充数据框列

Filling dataframe column with the latest values of another column

python

calculated-columns

dataframe

pandas