如何根据时差标准在 pandas 中插入新行
How to insert new line in pandas on hour differences criteria
我有以下数据框:
Matricule Startdate Starthour Enddate Endhour
0 5357 2019-01-08 14:21:06 2019-01-08 14:34:42
1 5357 2019-01-08 15:29:23 2019-01-08 15:33:43
2 5357 2019-01-08 19:51:11 2019-01-08 20:02:48
3 5357 2019-03-08 20:05:49 2019-03-08 21:04:52
4 aaaa 2019-01-08 14:17:51 2019-01-08 14:32:10
5 aaaa 2019-01-08 18:21:16 2019-01-08 18:39:26
我正在尝试制作一个 table 在每条新线路之间插入,这是基于线路 1 的到达时间和线路 2 的出发时间之间的差异更大的条件超过 30 分钟
要插入的行与上一行具有相同的属性。这是一个例子:
Matricule Startdate Starthour Enddate Endhour
0 5357 2019-01-08 14:21:06 2019-01-08 14:34:42
1 5357 2019-01-08 14:34:42 2019-01-08 15:04:42
2 5357 2019-01-08 15:29:23 2019-01-08 15:33:43
3 5357 2019-01-08 15:33:43 2019-01-08 16:03:43
4 5357 2019-01-08 19:51:11 2019-01-08 20:02:48
5 5357 2019-03-08 20:05:49 2019-03-08 21:04:52
6 aaaa 2019-01-08 14:17:51 2019-01-08 14:32:10
7 aaaa 2019-01-08 14:32:10 2019-01-08 15:02:10
8 aaaa 2019-01-08 18:21:16 2019-01-08 18:39:26
首先,我创建了新的列,将日期和时间作为一个统一的对象:
df['start'] = df['Startdate'].astype(str) + " " + df['Starthour'].astype(str)
df['start'] = pd.to_datetime(df['start'])
df['end'] = df['Enddate'] + " " + df['Endhour']
df['end'] = pd.to_datetime(df['end'])
接下来,计算与下一条记录的间隔,确保它排在第一位:
df = df.sort_values(['Matricule','start'])
df['gap_to_next'] = (df['start'].shift(-1) - df['end'])
处理不同 Matricules 之间的不匹配:
cut = df['Matricule'] != df['Matricule'].shift(-1)
df.loc[cut, 'gap_to_next'] = np.nan
定义一个布尔系列,显示您需要插入新行的位置。我用了你的请求大约 30 分钟,但添加了一些关于确保事情间隔少于 1 天的内容,因为你的样本有一个案例似乎暗示了这一点。根据需要进行调整:
should_insert_next = ( (df['gap_to_next'] > pd.Timedelta(30, 'min')) & (df['gap_to_next'] < pd.Timedelta(24, 'hr')) )
只复制这些行:
new_rows = df[should_insert_next].copy()
使用这些行作为模板,将时间调整为您想要的插入时间。您似乎想要 30 分钟的时间来记录新记录。
new_rows['start'] = new_rows['end']
new_rows['end'] = new_rows['start'] + pd.Timedelta(30, 'min')
如果您的原始日期和小时列不是字符串,您可以在下面添加一个步骤将它们转换为任何类型...
new_rows['Startdate'] = new_rows['start'].dt.strftime("%Y-%m-%d")
new_rows['Enddate'] = new_rows['end'].dt.strftime("%Y-%m-%d")
new_rows['Starthour'] = new_rows['start'].dt.strftime("%H:%M:%S")
new_rows['Endhour'] = new_rows['end'].dt.strftime("%H:%M:%S")
最后,将新旧串接起来再求:
final = pd.concat([df, new_rows])
final = final.sort_values(['Matricule','start'])
final = final.drop(columns=['gap_to_next','start','end'])
final = final.reset_index(drop=True)
那给了:
print(final)
Matricule Startdate Starthour Enddate Endhour
0 5357 2019-01-08 14:21:06 2019-01-08 14:34:42
1 5357 2019-01-08 14:34:42 2019-01-08 15:04:42
2 5357 2019-01-08 15:29:23 2019-01-08 15:33:43
3 5357 2019-01-08 15:33:43 2019-01-08 16:03:43
4 5357 2019-01-08 19:51:11 2019-01-08 20:02:48
5 5357 2019-03-08 20:05:49 2019-03-08 21:04:52
6 aaaa 2019-01-08 14:17:51 2019-01-08 14:32:10
7 aaaa 2019-01-08 14:32:10 2019-01-08 15:02:10
8 aaaa 2019-01-08 18:21:16 2019-01-08 18:39:26
我有以下数据框:
Matricule Startdate Starthour Enddate Endhour
0 5357 2019-01-08 14:21:06 2019-01-08 14:34:42
1 5357 2019-01-08 15:29:23 2019-01-08 15:33:43
2 5357 2019-01-08 19:51:11 2019-01-08 20:02:48
3 5357 2019-03-08 20:05:49 2019-03-08 21:04:52
4 aaaa 2019-01-08 14:17:51 2019-01-08 14:32:10
5 aaaa 2019-01-08 18:21:16 2019-01-08 18:39:26
我正在尝试制作一个 table 在每条新线路之间插入,这是基于线路 1 的到达时间和线路 2 的出发时间之间的差异更大的条件超过 30 分钟 要插入的行与上一行具有相同的属性。这是一个例子:
Matricule Startdate Starthour Enddate Endhour
0 5357 2019-01-08 14:21:06 2019-01-08 14:34:42
1 5357 2019-01-08 14:34:42 2019-01-08 15:04:42
2 5357 2019-01-08 15:29:23 2019-01-08 15:33:43
3 5357 2019-01-08 15:33:43 2019-01-08 16:03:43
4 5357 2019-01-08 19:51:11 2019-01-08 20:02:48
5 5357 2019-03-08 20:05:49 2019-03-08 21:04:52
6 aaaa 2019-01-08 14:17:51 2019-01-08 14:32:10
7 aaaa 2019-01-08 14:32:10 2019-01-08 15:02:10
8 aaaa 2019-01-08 18:21:16 2019-01-08 18:39:26
首先,我创建了新的列,将日期和时间作为一个统一的对象:
df['start'] = df['Startdate'].astype(str) + " " + df['Starthour'].astype(str)
df['start'] = pd.to_datetime(df['start'])
df['end'] = df['Enddate'] + " " + df['Endhour']
df['end'] = pd.to_datetime(df['end'])
接下来,计算与下一条记录的间隔,确保它排在第一位:
df = df.sort_values(['Matricule','start'])
df['gap_to_next'] = (df['start'].shift(-1) - df['end'])
处理不同 Matricules 之间的不匹配:
cut = df['Matricule'] != df['Matricule'].shift(-1)
df.loc[cut, 'gap_to_next'] = np.nan
定义一个布尔系列,显示您需要插入新行的位置。我用了你的请求大约 30 分钟,但添加了一些关于确保事情间隔少于 1 天的内容,因为你的样本有一个案例似乎暗示了这一点。根据需要进行调整:
should_insert_next = ( (df['gap_to_next'] > pd.Timedelta(30, 'min')) & (df['gap_to_next'] < pd.Timedelta(24, 'hr')) )
只复制这些行:
new_rows = df[should_insert_next].copy()
使用这些行作为模板,将时间调整为您想要的插入时间。您似乎想要 30 分钟的时间来记录新记录。
new_rows['start'] = new_rows['end']
new_rows['end'] = new_rows['start'] + pd.Timedelta(30, 'min')
如果您的原始日期和小时列不是字符串,您可以在下面添加一个步骤将它们转换为任何类型...
new_rows['Startdate'] = new_rows['start'].dt.strftime("%Y-%m-%d")
new_rows['Enddate'] = new_rows['end'].dt.strftime("%Y-%m-%d")
new_rows['Starthour'] = new_rows['start'].dt.strftime("%H:%M:%S")
new_rows['Endhour'] = new_rows['end'].dt.strftime("%H:%M:%S")
最后,将新旧串接起来再求:
final = pd.concat([df, new_rows])
final = final.sort_values(['Matricule','start'])
final = final.drop(columns=['gap_to_next','start','end'])
final = final.reset_index(drop=True)
那给了:
print(final)
Matricule Startdate Starthour Enddate Endhour
0 5357 2019-01-08 14:21:06 2019-01-08 14:34:42
1 5357 2019-01-08 14:34:42 2019-01-08 15:04:42
2 5357 2019-01-08 15:29:23 2019-01-08 15:33:43
3 5357 2019-01-08 15:33:43 2019-01-08 16:03:43
4 5357 2019-01-08 19:51:11 2019-01-08 20:02:48
5 5357 2019-03-08 20:05:49 2019-03-08 21:04:52
6 aaaa 2019-01-08 14:17:51 2019-01-08 14:32:10
7 aaaa 2019-01-08 14:32:10 2019-01-08 15:02:10
8 aaaa 2019-01-08 18:21:16 2019-01-08 18:39:26