在 pandas 中创建缺失的时间范围
Creating missing time ranges in pandas
我有一个 pandas 数据框,其中每一行对应给定记录的一段时间。如果一条记录有多个时间段,则它们之间会有间隔。我想填写第一个时间段结束和最后一个时间段开始之间的所有缺失时间段。
我的数据是这样的:
record = [1, 1, 2, 2, 2]
start_time = pd.to_datetime(['2001-01-01', '2001-02-01', '2000-01-01', '2001-05-31', '2001-09-01'])
stop_time = pd.to_datetime(['2001-01-15', '2001-02-28', '2001-01-31', '2001-08-16', '2001-09-30'])
df = pd.DataFrame({'record': record, 'start_time': start_time, 'stop_time': stop_time})
record start_time stop_time
0 1 2001-01-01 2001-01-15
1 1 2001-02-01 2001-02-28
2 2 2000-01-01 2001-01-31
3 2 2001-05-31 2001-08-16
4 2 2001-09-01 2001-09-30
时间间隔在第0行和第1行之间(停止时间为2001-01-15,下一次开始时间为2001-02-01,相差16天),以及2和3、3 和 4。间隙只能发生在给定记录的第一行和最后一行之间。
我想要实现的是:
record start_time stop_time
0 1 2001-01-01 2001-01-15
1 1 2001-01-16 2001-01-31
2 1 2001-02-01 2001-02-28
3 2 2000-01-01 2001-01-31
4 2 2001-02-01 2001-05-30
5 2 2001-05-31 2001-08-16
6 2 2001-08-17 2001-08-31
7 2 2001-09-01 2001-09-30
也就是说,我想添加开始和停止时间适合这些间隔的行。因此,在前面的示例中,记录 1 将有一个新行,其开始日期为 2001-01-16,结束日期为 2001-01-31。
完整的数据集在 150 万条记录中有超过 200 万行,因此我正在 pandas 中寻找不使用应用且相对高效的矢量化解决方案。
也许是这样的?
import pandas as pd
record = [1, 1, 2, 2, 2]
start_time = pd.to_datetime(['2001-01-01', '2001-02-01', '2000-01-01', '2001-05-31', '2001-09-01'])
stop_time = pd.to_datetime(['2001-01-15', '2001-02-28', '2001-01-31', '2001-08-16', '2001-09-30'])
df = pd.DataFrame({'record': record, 'start_time': start_time, 'stop_time': stop_time})
one_day = pd.Timedelta('1d')
missing_dates = []
for record, df_per_record in df.groupby('record'):
start_time = pd.to_datetime(df_per_record.start_time)
stop_time = pd.to_datetime(df_per_record.stop_time)
reference_date = pd.Timestamp(df_per_record.start_time.iloc[0])
start_time_in_days = (start_time - reference_date) // one_day
stop_time_in_days = (stop_time - reference_date) // one_day
dates_diff = start_time_in_days.iloc[1:].values - stop_time_in_days.iloc[:-1].values
missing_start_dates = stop_time[:-1][dates_diff > 1] + one_day
missing_stop_dates = missing_start_dates + ((dates_diff-2) * one_day)
missing_dates.append(pd.DataFrame({"record": record, "start_time": missing_start_dates, "stop_time": missing_stop_dates}))
print(pd.concat([df]+missing_dates).sort_values(["record", "start_time"]))
编辑:
这次版本 #2 没有 for 循环:
import pandas as pd
record = [1, 1, 2, 2, 2]
start_time = pd.to_datetime(['2001-01-01', '2001-02-01', '2000-01-01', '2001-05-31', '2001-09-01'])
stop_time = pd.to_datetime(['2001-01-15', '2001-02-28', '2001-01-31', '2001-08-16', '2001-09-30'])
df = pd.DataFrame({'record': record, 'start_time': start_time, 'stop_time': stop_time})
one_day = pd.Timedelta('1d')
start_time = pd.to_datetime(df.start_time)
stop_time = pd.to_datetime(df.stop_time)
reference_date = pd.Timestamp(df.start_time.iloc[0])
start_time_in_days = (start_time - reference_date) // one_day
stop_time_in_days = (stop_time - reference_date) // one_day
is_same_record = df.record.iloc[1:].values == df.record.iloc[:-1].values
dates_diff = start_time_in_days.iloc[1:].values - stop_time_in_days.iloc[:-1].values
mask = (dates_diff > 1) & is_same_record
missing_start_dates = stop_time[:-1][mask] + one_day
missing_stop_dates = missing_start_dates + ((dates_diff[is_same_record]-2) * one_day)
missing_dates = pd.DataFrame({"record": df.record.iloc[:-1][mask], "start_time": missing_start_dates, "stop_time": missing_stop_dates})
print(pd.concat([df, missing_dates]).sort_values(["record", "start_time"]).reset_index(drop=True))
我有一个 pandas 数据框,其中每一行对应给定记录的一段时间。如果一条记录有多个时间段,则它们之间会有间隔。我想填写第一个时间段结束和最后一个时间段开始之间的所有缺失时间段。
我的数据是这样的:
record = [1, 1, 2, 2, 2]
start_time = pd.to_datetime(['2001-01-01', '2001-02-01', '2000-01-01', '2001-05-31', '2001-09-01'])
stop_time = pd.to_datetime(['2001-01-15', '2001-02-28', '2001-01-31', '2001-08-16', '2001-09-30'])
df = pd.DataFrame({'record': record, 'start_time': start_time, 'stop_time': stop_time})
record start_time stop_time
0 1 2001-01-01 2001-01-15
1 1 2001-02-01 2001-02-28
2 2 2000-01-01 2001-01-31
3 2 2001-05-31 2001-08-16
4 2 2001-09-01 2001-09-30
时间间隔在第0行和第1行之间(停止时间为2001-01-15,下一次开始时间为2001-02-01,相差16天),以及2和3、3 和 4。间隙只能发生在给定记录的第一行和最后一行之间。
我想要实现的是:
record start_time stop_time
0 1 2001-01-01 2001-01-15
1 1 2001-01-16 2001-01-31
2 1 2001-02-01 2001-02-28
3 2 2000-01-01 2001-01-31
4 2 2001-02-01 2001-05-30
5 2 2001-05-31 2001-08-16
6 2 2001-08-17 2001-08-31
7 2 2001-09-01 2001-09-30
也就是说,我想添加开始和停止时间适合这些间隔的行。因此,在前面的示例中,记录 1 将有一个新行,其开始日期为 2001-01-16,结束日期为 2001-01-31。
完整的数据集在 150 万条记录中有超过 200 万行,因此我正在 pandas 中寻找不使用应用且相对高效的矢量化解决方案。
也许是这样的?
import pandas as pd
record = [1, 1, 2, 2, 2]
start_time = pd.to_datetime(['2001-01-01', '2001-02-01', '2000-01-01', '2001-05-31', '2001-09-01'])
stop_time = pd.to_datetime(['2001-01-15', '2001-02-28', '2001-01-31', '2001-08-16', '2001-09-30'])
df = pd.DataFrame({'record': record, 'start_time': start_time, 'stop_time': stop_time})
one_day = pd.Timedelta('1d')
missing_dates = []
for record, df_per_record in df.groupby('record'):
start_time = pd.to_datetime(df_per_record.start_time)
stop_time = pd.to_datetime(df_per_record.stop_time)
reference_date = pd.Timestamp(df_per_record.start_time.iloc[0])
start_time_in_days = (start_time - reference_date) // one_day
stop_time_in_days = (stop_time - reference_date) // one_day
dates_diff = start_time_in_days.iloc[1:].values - stop_time_in_days.iloc[:-1].values
missing_start_dates = stop_time[:-1][dates_diff > 1] + one_day
missing_stop_dates = missing_start_dates + ((dates_diff-2) * one_day)
missing_dates.append(pd.DataFrame({"record": record, "start_time": missing_start_dates, "stop_time": missing_stop_dates}))
print(pd.concat([df]+missing_dates).sort_values(["record", "start_time"]))
编辑:
这次版本 #2 没有 for 循环:
import pandas as pd
record = [1, 1, 2, 2, 2]
start_time = pd.to_datetime(['2001-01-01', '2001-02-01', '2000-01-01', '2001-05-31', '2001-09-01'])
stop_time = pd.to_datetime(['2001-01-15', '2001-02-28', '2001-01-31', '2001-08-16', '2001-09-30'])
df = pd.DataFrame({'record': record, 'start_time': start_time, 'stop_time': stop_time})
one_day = pd.Timedelta('1d')
start_time = pd.to_datetime(df.start_time)
stop_time = pd.to_datetime(df.stop_time)
reference_date = pd.Timestamp(df.start_time.iloc[0])
start_time_in_days = (start_time - reference_date) // one_day
stop_time_in_days = (stop_time - reference_date) // one_day
is_same_record = df.record.iloc[1:].values == df.record.iloc[:-1].values
dates_diff = start_time_in_days.iloc[1:].values - stop_time_in_days.iloc[:-1].values
mask = (dates_diff > 1) & is_same_record
missing_start_dates = stop_time[:-1][mask] + one_day
missing_stop_dates = missing_start_dates + ((dates_diff[is_same_record]-2) * one_day)
missing_dates = pd.DataFrame({"record": df.record.iloc[:-1][mask], "start_time": missing_start_dates, "stop_time": missing_stop_dates})
print(pd.concat([df, missing_dates]).sort_values(["record", "start_time"]).reset_index(drop=True))