在日期时间之间合并三个 Pandas 数据框并添加相应的列
Merge three Pandas dataframe between datetime and adding corresponding columns
给定两个数据帧 df1、df2 和 df3,如何连接它们以使 df3 时间戳位于数据帧 df1 和 df2 的开始和结束之间。
我必须根据 df3'Timestamp' 是在 df1 还是 df2 'Start time' 和 'End Time' 中将作业 ID 合并到 df3,并且还要匹配节点(No.
df1(1230行*3列)
Node Start Time End Time JobID
A 00:03:50 00:05:45 12345
A 00:06:10 00:07:39 56789
A 00:08:30 00:10:45 34567
.
.
.
df2(1130行*3列)
Node Start Time End Time JobID
B 00:02:30 00:07:35 13579
B 00:08:56 00:09:39 24680
B 00:10:32 00:13:47 14680
.
.
.
df3(4002行*3列)
Node Timestamp
A 00:05:42
A 00:09:50
A 00:11:27
B 00:04:48
B 00:09:59
B 00:10:32
.
.
.
.
预期输出:
df3(4002行*3列)
No. Timestamp Job ID
A 00:05:42 12345
A 00:09:50 34567
A 00:11:27 NaN
B 00:04:48 13579
B 00:09:59 NaN
B 00:10:32 14680
.
.
.
.
可以使用.merge()
and filter with .between()
,如下:
df1_3 = df1.merge(df3, on='Node')
df1_3_filtered = df1_3[df1_3['Timestamp'].between(df1_3['Start Time'], df1_3['End Time'])]
df2_3 = df2.merge(df3, on='Node')
df2_3_filtered = df2_3[df2_3['Timestamp'].between(df2_3['Start Time'], df2_3['End Time'])]
df_out = df1_3_filtered.append(df2_3_filtered)[['Node', 'JobID', 'Timestamp']]
df_out = df3.merge(df_out, how='left')
结果:
print(df_out)
Node Timestamp JobID
0 A 00:05:42 12345.0
1 A 00:09:50 34567.0
2 A 00:11:27 NaN
3 B 00:04:48 13579.0
4 B 00:09:59 NaN
5 B 00:10:32 14680.0
编辑
如果你有多个与df1
和df2
具有相同结构的数据帧并且想与df3
合并,你可以这样做:
只需将所有数据帧放入下面的列表List_dfs
:
List_dfs = [df1, df2] # put all your dataframes of same structure here
然后,运行下面的代码。您将在 df_out
:
中获得所有这些数据帧的合并和过滤结果
df_all_filtered = pd.DataFrame() # init. df for acculumating filtered results
for df in List_dfs:
dfx_3 = df.merge(df3, on='Node')
dfx_3_filtered = dfx_3[dfx_3['Timestamp'].between(dfx_3['Start Time'], dfx_3['End Time'])]
df_all_filtered = df_all_filtered.append(dfx_3_filtered) # append filtered result
df_out = df_all_filtered[['Node', 'JobID', 'Timestamp']]
df_out = df3.merge(df_out, how='left')
另一种方法是将您的班次数据重新采样到几秒钟内,然后合并重新采样的数据。
def resample_shifts(dataframe : pd.DataFrame, indices : list,
start_col : str, end_col : str) -> pd.DataFrame:
return dataframe.set_index(indices)\
.apply(lambda x : pd.date_range(x[start_col],
x[end_col],freq='s')
,1).explode().rename('Timestamp').reset_index()
df1a = resample_shifts(df1,
['Node','JobID'],
'Start_Time',
'End_Time'
)
df2a = resample_shifts(df2,
['Node','JobID'],
'Start_Time',
'End_Time'
)
df3['Timestamp'] = pd.to_datetime(df3['Timestamp'])
df3a = pd.merge(pd.concat([df1a,df2a]),df3,on=['Node','Timestamp'],how='right')
print(df3a)
Node JobID Timestamp
0 A 12345.0 2021-06-28 00:05:42
1 A 34567.0 2021-06-28 00:09:50
2 A NaN 2021-06-28 00:11:27
3 B 13579.0 2021-06-28 00:04:48
4 B NaN 2021-06-28 00:09:59
5 B 14680.0 2021-06-28 00:10:32
给定两个数据帧 df1、df2 和 df3,如何连接它们以使 df3 时间戳位于数据帧 df1 和 df2 的开始和结束之间。
我必须根据 df3'Timestamp' 是在 df1 还是 df2 'Start time' 和 'End Time' 中将作业 ID 合并到 df3,并且还要匹配节点(No.
df1(1230行*3列)
Node Start Time End Time JobID
A 00:03:50 00:05:45 12345
A 00:06:10 00:07:39 56789
A 00:08:30 00:10:45 34567
.
.
.
df2(1130行*3列)
Node Start Time End Time JobID
B 00:02:30 00:07:35 13579
B 00:08:56 00:09:39 24680
B 00:10:32 00:13:47 14680
.
.
.
df3(4002行*3列)
Node Timestamp
A 00:05:42
A 00:09:50
A 00:11:27
B 00:04:48
B 00:09:59
B 00:10:32
.
.
.
.
预期输出: df3(4002行*3列)
No. Timestamp Job ID
A 00:05:42 12345
A 00:09:50 34567
A 00:11:27 NaN
B 00:04:48 13579
B 00:09:59 NaN
B 00:10:32 14680
.
.
.
.
可以使用.merge()
and filter with .between()
,如下:
df1_3 = df1.merge(df3, on='Node')
df1_3_filtered = df1_3[df1_3['Timestamp'].between(df1_3['Start Time'], df1_3['End Time'])]
df2_3 = df2.merge(df3, on='Node')
df2_3_filtered = df2_3[df2_3['Timestamp'].between(df2_3['Start Time'], df2_3['End Time'])]
df_out = df1_3_filtered.append(df2_3_filtered)[['Node', 'JobID', 'Timestamp']]
df_out = df3.merge(df_out, how='left')
结果:
print(df_out)
Node Timestamp JobID
0 A 00:05:42 12345.0
1 A 00:09:50 34567.0
2 A 00:11:27 NaN
3 B 00:04:48 13579.0
4 B 00:09:59 NaN
5 B 00:10:32 14680.0
编辑
如果你有多个与df1
和df2
具有相同结构的数据帧并且想与df3
合并,你可以这样做:
只需将所有数据帧放入下面的列表List_dfs
:
List_dfs = [df1, df2] # put all your dataframes of same structure here
然后,运行下面的代码。您将在 df_out
:
df_all_filtered = pd.DataFrame() # init. df for acculumating filtered results
for df in List_dfs:
dfx_3 = df.merge(df3, on='Node')
dfx_3_filtered = dfx_3[dfx_3['Timestamp'].between(dfx_3['Start Time'], dfx_3['End Time'])]
df_all_filtered = df_all_filtered.append(dfx_3_filtered) # append filtered result
df_out = df_all_filtered[['Node', 'JobID', 'Timestamp']]
df_out = df3.merge(df_out, how='left')
另一种方法是将您的班次数据重新采样到几秒钟内,然后合并重新采样的数据。
def resample_shifts(dataframe : pd.DataFrame, indices : list,
start_col : str, end_col : str) -> pd.DataFrame:
return dataframe.set_index(indices)\
.apply(lambda x : pd.date_range(x[start_col],
x[end_col],freq='s')
,1).explode().rename('Timestamp').reset_index()
df1a = resample_shifts(df1,
['Node','JobID'],
'Start_Time',
'End_Time'
)
df2a = resample_shifts(df2,
['Node','JobID'],
'Start_Time',
'End_Time'
)
df3['Timestamp'] = pd.to_datetime(df3['Timestamp'])
df3a = pd.merge(pd.concat([df1a,df2a]),df3,on=['Node','Timestamp'],how='right')
print(df3a)
Node JobID Timestamp
0 A 12345.0 2021-06-28 00:05:42
1 A 34567.0 2021-06-28 00:09:50
2 A NaN 2021-06-28 00:11:27
3 B 13579.0 2021-06-28 00:04:48
4 B NaN 2021-06-28 00:09:59
5 B 14680.0 2021-06-28 00:10:32