给定另一个数据帧中两列的值约束,在一个数据帧的列中查找最大值
Find maximum of value in a column of one dataframe given the value constraint of two columns in another dataframe
我有一个数据框 df1,其中有两列代表任务的开始和结束时间。我有另一个数据框 df2,其中两列代表时间和当时可用的库存。我想在 df1 中创建另一个名为 max_stock 的列,它在 df1 的 ST 和 ET 给定的时间范围内具有股票价值的最大值。例如,第一个任务有开始时间 7/11/2021 1:00
和结束时间 7/11/2021 2:00
因此 max_stock
的值是 df2 的 stock
列中值的最大值,最大值为 10,在时间 7/11/2021 1:00
、7/11/2021 1:30
和 7/11/2021 2:00
分别为 26 和 48。
df1
ST ET
7/11/2021 1:00 7/11/2021 2:00
7/11/2021 2:00 7/11/2021 3:00
7/11/2021 3:00 7/11/2021 4:00
7/11/2021 4:00 7/11/2021 5:00
7/11/2021 5:00 7/11/2021 6:00
7/11/2021 6:00 7/11/2021 7:00
7/11/2021 7:00 7/11/2021 8:00
7/11/2021 8:00 7/11/2021 9:00
7/11/2021 9:00 7/11/2021 10:00
df2
Time stock
7/11/2021 1:00 10
7/11/2021 1:30 26
7/11/2021 2:00 48
7/11/2021 2:30 35
7/11/2021 3:00 32
7/11/2021 3:30 80
7/11/2021 4:00 31
7/11/2021 4:30 81
7/11/2021 5:00 65
7/11/2021 5:30 83
7/11/2021 6:00 40
7/11/2021 6:30 84
7/11/2021 7:00 41
7/11/2021 7:30 15
7/11/2021 8:00 65
7/11/2021 8:30 18
7/11/2021 9:00 80
7/11/2021 9:30 12
7/11/2021 10:00 5
需要 df
ST ET max_stock
7/11/2021 1:00 7/11/2021 2:00 48.00
7/11/2021 2:00 7/11/2021 3:00 48.00
7/11/2021 3:00 7/11/2021 4:00 80.00
7/11/2021 4:00 7/11/2021 5:00 81.00
7/11/2021 5:00 7/11/2021 6:00 83.00
7/11/2021 6:00 7/11/2021 7:00 84.00
7/11/2021 7:00 7/11/2021 8:00 65.00
7/11/2021 8:00 7/11/2021 9:00 80.00
7/11/2021 9:00 7/11/2021 10:00 80.00
一个选项是通过 conditional_join from pyjanitor 在分组和聚合之前模拟大于和小于条件:
# pip install pyjanitor
import pandas as pd
import janitor
(df1.conditional_join(
df2,
('ST', 'Time', '<='),
('ET', 'Time', '>='))
.groupby(['ST', 'ET'], as_index = False)
.stock
.max()
)
ST ET stock
0 2021-07-11 01:00:00 2021-07-11 02:00:00 48
1 2021-07-11 02:00:00 2021-07-11 03:00:00 48
2 2021-07-11 03:00:00 2021-07-11 04:00:00 80
3 2021-07-11 04:00:00 2021-07-11 05:00:00 81
4 2021-07-11 05:00:00 2021-07-11 06:00:00 83
5 2021-07-11 06:00:00 2021-07-11 07:00:00 84
6 2021-07-11 07:00:00 2021-07-11 08:00:00 65
7 2021-07-11 08:00:00 2021-07-11 09:00:00 80
8 2021-07-11 09:00:00 2021-07-11 10:00:00 80
之后您可以使用笛卡尔连接和过滤器(对于大型数据帧,这可能会降低内存效率):
(df1.merge(df2, how='cross')
.query('ST <=Time <= ET')
.groupby(['ST', 'ET'], as_index = False)
.stock
.max()
)
Out[113]:
ST ET stock
0 2021-07-11 01:00:00 2021-07-11 02:00:00 48
1 2021-07-11 02:00:00 2021-07-11 03:00:00 48
2 2021-07-11 03:00:00 2021-07-11 04:00:00 80
3 2021-07-11 04:00:00 2021-07-11 05:00:00 81
4 2021-07-11 05:00:00 2021-07-11 06:00:00 83
5 2021-07-11 06:00:00 2021-07-11 07:00:00 84
6 2021-07-11 07:00:00 2021-07-11 08:00:00 65
7 2021-07-11 08:00:00 2021-07-11 09:00:00 80
8 2021-07-11 09:00:00 2021-07-11 10:00:00 80
另一个选项是使用区间索引(这里的过程较长,因为生成的区间具有重叠值):
box = pd.IntervalIndex.from_arrays(df1.ST, df1.ET, closed='both')
df1.index = box
# create temporary Series
temp = (df2.Time
.apply(lambda x: box[box.get_loc(x)])
.explode(ignore_index = False)
)
temp.name = 'interval'
# lump back to main dataframe (df2)
temp = pd.concat([df2, temp], axis = 1)
# aggregate:
temp = temp.groupby('interval').stock.max()
# join back to df1 to get final output
df1.join(temp).reset_index(drop=True)
ST ET stock
0 2021-07-11 01:00:00 2021-07-11 02:00:00 48
1 2021-07-11 02:00:00 2021-07-11 03:00:00 48
2 2021-07-11 03:00:00 2021-07-11 04:00:00 80
3 2021-07-11 04:00:00 2021-07-11 05:00:00 81
4 2021-07-11 05:00:00 2021-07-11 06:00:00 83
5 2021-07-11 06:00:00 2021-07-11 07:00:00 84
6 2021-07-11 07:00:00 2021-07-11 08:00:00 65
7 2021-07-11 08:00:00 2021-07-11 09:00:00 80
8 2021-07-11 09:00:00 2021-07-11 10:00:00 80
我有一个数据框 df1,其中有两列代表任务的开始和结束时间。我有另一个数据框 df2,其中两列代表时间和当时可用的库存。我想在 df1 中创建另一个名为 max_stock 的列,它在 df1 的 ST 和 ET 给定的时间范围内具有股票价值的最大值。例如,第一个任务有开始时间 7/11/2021 1:00
和结束时间 7/11/2021 2:00
因此 max_stock
的值是 df2 的 stock
列中值的最大值,最大值为 10,在时间 7/11/2021 1:00
、7/11/2021 1:30
和 7/11/2021 2:00
分别为 26 和 48。
df1
ST ET
7/11/2021 1:00 7/11/2021 2:00
7/11/2021 2:00 7/11/2021 3:00
7/11/2021 3:00 7/11/2021 4:00
7/11/2021 4:00 7/11/2021 5:00
7/11/2021 5:00 7/11/2021 6:00
7/11/2021 6:00 7/11/2021 7:00
7/11/2021 7:00 7/11/2021 8:00
7/11/2021 8:00 7/11/2021 9:00
7/11/2021 9:00 7/11/2021 10:00
df2
Time stock
7/11/2021 1:00 10
7/11/2021 1:30 26
7/11/2021 2:00 48
7/11/2021 2:30 35
7/11/2021 3:00 32
7/11/2021 3:30 80
7/11/2021 4:00 31
7/11/2021 4:30 81
7/11/2021 5:00 65
7/11/2021 5:30 83
7/11/2021 6:00 40
7/11/2021 6:30 84
7/11/2021 7:00 41
7/11/2021 7:30 15
7/11/2021 8:00 65
7/11/2021 8:30 18
7/11/2021 9:00 80
7/11/2021 9:30 12
7/11/2021 10:00 5
需要 df
ST ET max_stock
7/11/2021 1:00 7/11/2021 2:00 48.00
7/11/2021 2:00 7/11/2021 3:00 48.00
7/11/2021 3:00 7/11/2021 4:00 80.00
7/11/2021 4:00 7/11/2021 5:00 81.00
7/11/2021 5:00 7/11/2021 6:00 83.00
7/11/2021 6:00 7/11/2021 7:00 84.00
7/11/2021 7:00 7/11/2021 8:00 65.00
7/11/2021 8:00 7/11/2021 9:00 80.00
7/11/2021 9:00 7/11/2021 10:00 80.00
一个选项是通过 conditional_join from pyjanitor 在分组和聚合之前模拟大于和小于条件:
# pip install pyjanitor
import pandas as pd
import janitor
(df1.conditional_join(
df2,
('ST', 'Time', '<='),
('ET', 'Time', '>='))
.groupby(['ST', 'ET'], as_index = False)
.stock
.max()
)
ST ET stock
0 2021-07-11 01:00:00 2021-07-11 02:00:00 48
1 2021-07-11 02:00:00 2021-07-11 03:00:00 48
2 2021-07-11 03:00:00 2021-07-11 04:00:00 80
3 2021-07-11 04:00:00 2021-07-11 05:00:00 81
4 2021-07-11 05:00:00 2021-07-11 06:00:00 83
5 2021-07-11 06:00:00 2021-07-11 07:00:00 84
6 2021-07-11 07:00:00 2021-07-11 08:00:00 65
7 2021-07-11 08:00:00 2021-07-11 09:00:00 80
8 2021-07-11 09:00:00 2021-07-11 10:00:00 80
之后您可以使用笛卡尔连接和过滤器(对于大型数据帧,这可能会降低内存效率):
(df1.merge(df2, how='cross')
.query('ST <=Time <= ET')
.groupby(['ST', 'ET'], as_index = False)
.stock
.max()
)
Out[113]:
ST ET stock
0 2021-07-11 01:00:00 2021-07-11 02:00:00 48
1 2021-07-11 02:00:00 2021-07-11 03:00:00 48
2 2021-07-11 03:00:00 2021-07-11 04:00:00 80
3 2021-07-11 04:00:00 2021-07-11 05:00:00 81
4 2021-07-11 05:00:00 2021-07-11 06:00:00 83
5 2021-07-11 06:00:00 2021-07-11 07:00:00 84
6 2021-07-11 07:00:00 2021-07-11 08:00:00 65
7 2021-07-11 08:00:00 2021-07-11 09:00:00 80
8 2021-07-11 09:00:00 2021-07-11 10:00:00 80
另一个选项是使用区间索引(这里的过程较长,因为生成的区间具有重叠值):
box = pd.IntervalIndex.from_arrays(df1.ST, df1.ET, closed='both')
df1.index = box
# create temporary Series
temp = (df2.Time
.apply(lambda x: box[box.get_loc(x)])
.explode(ignore_index = False)
)
temp.name = 'interval'
# lump back to main dataframe (df2)
temp = pd.concat([df2, temp], axis = 1)
# aggregate:
temp = temp.groupby('interval').stock.max()
# join back to df1 to get final output
df1.join(temp).reset_index(drop=True)
ST ET stock
0 2021-07-11 01:00:00 2021-07-11 02:00:00 48
1 2021-07-11 02:00:00 2021-07-11 03:00:00 48
2 2021-07-11 03:00:00 2021-07-11 04:00:00 80
3 2021-07-11 04:00:00 2021-07-11 05:00:00 81
4 2021-07-11 05:00:00 2021-07-11 06:00:00 83
5 2021-07-11 06:00:00 2021-07-11 07:00:00 84
6 2021-07-11 07:00:00 2021-07-11 08:00:00 65
7 2021-07-11 08:00:00 2021-07-11 09:00:00 80
8 2021-07-11 09:00:00 2021-07-11 10:00:00 80