根据条件获取日期间隔之间的重叠时长
Get the overlap duration between date intervals based on condition
我有两个数据框,它们有一个 start/end 日期时间和一个值。行数不一样。重叠的区间不一定在同一个row/index.
df1
start_datetime end_datetime value
08:50 09:50 5
09:52 10:10 6
10:50 11:30 2
df2
start_datetime end_datetime value
08:51 08:59 3
09:52 10:02 9
10:03 10:30 1
11:03 11:39 1
13:10 13:15 0
我想计算 df1 和 df2 仅在 df1.value > df2.value
重叠时的持续时间总和。
在一个df2时间间隔内,df1可以重叠多次,有时条件为真。
我试过类似的东西:
time = timedelta()
for i, row1 in df1.iterrows():
t1 = pd.Interval(row1.start, row1.end)
for j, row2 in df2.iterrows():
t2 = pd.Interval(row2.start, row2.end)
if t1.overlaps(t2) and row1.value > row2.value:
latest_start = np.maximum(row1.start, row1.start)
earliest_end = np.minimum(row2.end, row2.end)
delta = earliest_end - latest_start
time += delta
我可以在每个 df1 行上循环并使用整个 df2 数据进行测试,但它没有优化。
预期输出(示例):
Timedelta('0 days 00:99:99')
这是我的解决方案:
- 创建数据帧:
df1 = pd.DataFrame(
{"start_datetime1": ['08:50' ,'09:52' ,'10:50 ' ],
'end_datetime1' : ['09:50','10:10','11:30'] ,
'value1': [5,6,2]})
df2 = pd.DataFrame(
{"start_datetime2": ['08:51' ,'09:52' ,'10:03 ','11:03 ','13:10 ' ],
'end_datetime2' : ['08:59','10:02','10:30','11:39', '13:15'] ,
'value2': [3,9,1,1,0]})
df2["start_datetime2"]= pd.to_datetime(df2["start_datetime2"])
df2["end_datetime2"]= pd.to_datetime(df2["end_datetime2"])
df1["start_datetime1"]= pd.to_datetime(df1["start_datetime1"])
df1["end_datetime1"]= pd.to_datetime(df1["end_datetime1"])
- 合并数据帧,使比较更容易。组合数据框具有所有可能的匹配项:
df1['temp'] = 1 #temporary keys to create all combinations
df2['temp'] = 1
df_combined = pd.merge(df1,df2,on='temp').drop('temp',axis=1)
- 用 lambda 函数比较值:
df_combined['Result'] = df_combined.apply(lambda row: max(row["start_datetime1"],row["start_datetime2"]) -
min(row["start_datetime1"],row["start_datetime2"])
if pd.Interval(row['start_datetime1'], row['end_datetime1']).overlaps(
pd.Interval(row['start_datetime2'], row['end_datetime2'])) and
row["value1"] > row["value2"]
else 0, axis = 1 )
df_combined
结果:
total_timedelta = df_combined['Result'].loc[df_combined['Result'] != 0].sum()
0 days 00:25:00
数据框:
我有两个数据框,它们有一个 start/end 日期时间和一个值。行数不一样。重叠的区间不一定在同一个row/index.
df1
start_datetime end_datetime value
08:50 09:50 5
09:52 10:10 6
10:50 11:30 2
df2
start_datetime end_datetime value
08:51 08:59 3
09:52 10:02 9
10:03 10:30 1
11:03 11:39 1
13:10 13:15 0
我想计算 df1 和 df2 仅在 df1.value > df2.value
重叠时的持续时间总和。
在一个df2时间间隔内,df1可以重叠多次,有时条件为真。
我试过类似的东西:
time = timedelta()
for i, row1 in df1.iterrows():
t1 = pd.Interval(row1.start, row1.end)
for j, row2 in df2.iterrows():
t2 = pd.Interval(row2.start, row2.end)
if t1.overlaps(t2) and row1.value > row2.value:
latest_start = np.maximum(row1.start, row1.start)
earliest_end = np.minimum(row2.end, row2.end)
delta = earliest_end - latest_start
time += delta
我可以在每个 df1 行上循环并使用整个 df2 数据进行测试,但它没有优化。
预期输出(示例):
Timedelta('0 days 00:99:99')
这是我的解决方案:
- 创建数据帧:
df1 = pd.DataFrame(
{"start_datetime1": ['08:50' ,'09:52' ,'10:50 ' ],
'end_datetime1' : ['09:50','10:10','11:30'] ,
'value1': [5,6,2]})
df2 = pd.DataFrame(
{"start_datetime2": ['08:51' ,'09:52' ,'10:03 ','11:03 ','13:10 ' ],
'end_datetime2' : ['08:59','10:02','10:30','11:39', '13:15'] ,
'value2': [3,9,1,1,0]})
df2["start_datetime2"]= pd.to_datetime(df2["start_datetime2"])
df2["end_datetime2"]= pd.to_datetime(df2["end_datetime2"])
df1["start_datetime1"]= pd.to_datetime(df1["start_datetime1"])
df1["end_datetime1"]= pd.to_datetime(df1["end_datetime1"])
- 合并数据帧,使比较更容易。组合数据框具有所有可能的匹配项:
df1['temp'] = 1 #temporary keys to create all combinations
df2['temp'] = 1
df_combined = pd.merge(df1,df2,on='temp').drop('temp',axis=1)
- 用 lambda 函数比较值:
df_combined['Result'] = df_combined.apply(lambda row: max(row["start_datetime1"],row["start_datetime2"]) -
min(row["start_datetime1"],row["start_datetime2"])
if pd.Interval(row['start_datetime1'], row['end_datetime1']).overlaps(
pd.Interval(row['start_datetime2'], row['end_datetime2'])) and
row["value1"] > row["value2"]
else 0, axis = 1 )
df_combined
结果:
total_timedelta = df_combined['Result'].loc[df_combined['Result'] != 0].sum()
0 days 00:25:00
数据框: