如何在 pandas 中插入空白行并多次正确递增索引
How can i insert a blank row in pandas and properly increment the index multiple times
我有 2 个 pandas 数据框,它们都有相同的列但行号不同,具体取决于缺少的行,其中一列是 Date
具有以下格式 29/09/2020 13.22.57
为了简单和无关紧要,下面有时会省略日月年
日期可能与 df
中的 df_2
完全匹配,或者我们预设的阈值可能存在可接受的延迟,在本例中为 2s。
df['Date']
的示例数据:
13.24.19
13.24.35
13.25.07
13.25.23
13.26.00
13.26.13
13.26.54
df_2['Date']
的示例数据:
13.24.19
13.24.35
13.25.23
13.26.13
13.26.38
预计
df['Date']:
13.22.57
13.23.13
13.23.44
13.24.02
13.24.19
13.24.35
0
13.25.23
0
13.26.13
13.26.38
df_2['Date']:
13.24.19
13.24.35
13.25.07
13.25.23
13.26.00
13.26.13
0
13.26.54
增量可能发生在 df
或 df_2
上,这取决于哪个缺失列的时间更长,最后两者的行数应与未缺失的行数相同匹配现在将有一个 0 值,下面的值将发生增量。
数据帧:
d = {'Date': ['13.24.19', '13.24.35','13.25.07', '13.25.23','13.26.00', '13.26.13','13.26.54'], 'col2': [1, 2, 3, 4, 5, 6, 7]}
df = pd.DataFrame(data=d)
df['Date'] = pd.to_datetime(df['Date'], format='%H.%M.%S')
d2 = {'Date': ['13.24.19', '13.24.35','13.25.23', '13.26.13','13.26.38'], 'col2': [1, 2, 3, 4, 5]}
df_2 = pd.DataFrame(data=d2)
df_2['Date'] = pd.to_datetime(df_2['Date'], format='%H.%M.%S')
这应该适用于每个数据帧:
df.loc[df.shape[0]] = [None for _ in range(len(df.columns))]
这对我有用:
注意,假设是 len(df)>len(df_2)
a={"Date": [
"29-09-2020 13:22:57",
"29-09-2020 13:23:12",
"29-09-2020 13:23:44",
"29-09-2020 13:24:01",
"29-09-2020 13:24:19",
"29-09-2020 13:24:35",
"29-09-2020 13:25:07",
"29-09-2020 13:25:23",
"29-09-2020 13:26:00",
"29-09-2020 13:26:13",
"29-09-2020 13:26:54",
]}
b={"Date":[
"29-09-2020 13:22:57",
"29-09-2020 13:23:13",
"29-09-2020 13:23:44",
"29-09-2020 13:24:02",
"29-09-2020 13:24:19",
"29-09-2020 13:24:35",
"29-09-2020 13:25:23",
"29-09-2020 13:26:13",
"29-09-2020 13:26:38",
]
}
df=pd.DataFrame(a)
df["Date"]=pd.to_datetime(df["Date"])
df_2=pd.DataFrame(b)
df_2["Date"]=pd.to_datetime(df_2["Date"])
def add_zero(dataframe,index,increment):
dataframe.loc[index+increment]=0
dataframe = dataframe.sort_index().reset_index(drop=True)
return dataframe
flag=True
idx=0
while flag==True:
if idx >= len(df_2["Date"]):
df_2=add_zero(df_2,idx,0.5)
break
if idx >= len(df["Date"]):
df=add_zero(df,idx,0.5)
break
print(idx)
print(df['Date'][idx])
print(df_2['Date'][idx])
diff=datetime.timedelta.total_seconds(df['Date'][idx] - df_2['Date'][idx])
print(f"Diff: {diff}")
if diff > 2:
df=add_zero(df,idx,-0.5)
print("greater")
elif diff < -2:
df_2=add_zero(df_2,idx,-0.5)
print("smaller")
else:
print("Acceptable")
idx=idx+1
if idx>=max(len(df_2),len(df)):
flag=False
输出
Date Date2
0 2020-09-29 13:22:57 2020-09-29 13:22:57
1 2020-09-29 13:23:12 2020-09-29 13:23:13
2 2020-09-29 13:23:44 2020-09-29 13:23:44
3 2020-09-29 13:24:01 2020-09-29 13:24:02
4 2020-09-29 13:24:19 2020-09-29 13:24:19
5 2020-09-29 13:24:35 2020-09-29 13:24:35
6 2020-09-29 13:25:07 0
7 2020-09-29 13:25:23 2020-09-29 13:25:23
8 2020-09-29 13:26:00 0
9 2020-09-29 13:26:13 2020-09-29 13:26:13
10 0 2020-09-29 13:26:38
11 2020-09-29 13:26:54 0
IIUC,你可以执行双重合并。
首先是 merge_asof
,direction='nearest'
和 tolerance
为 2s,以对齐第二个数据帧相对于第一个数据帧的值。
然后是经典的外部 merge
来填充第二个数据帧中的缺失值。
最后,使用 bfill
按日期排序以获得单列作为参考。
注意。 merge_asof
需要事先按日期对两个数据帧进行排序。
df2 = (pd.merge_asof(df, df_2.add_suffix('_2'),
left_on='Date', right_on='Date_2',
direction='nearest', tolerance=pd.Timedelta('2s'))
.merge(df_2[['Date']].add_suffix('_2'), on='Date_2', how='outer')
)
df2 = df2.loc[df2[['Date', 'Date_2']].bfill(axis=1)['Date'].sort_values().index]
输出:
Date col2 Date_2 col2_2
0 1900-01-01 13:22:57 1.0 1900-01-01 13:22:57 1.0
1 1900-01-01 13:23:12 2.0 1900-01-01 13:23:12 2.0
2 1900-01-01 13:23:44 3.0 1900-01-01 13:23:44 3.0
3 1900-01-01 13:24:01 4.0 1900-01-01 13:24:01 4.0
4 1900-01-01 13:24:19 5.0 1900-01-01 13:24:19 5.0
5 1900-01-01 13:24:35 6.0 1900-01-01 13:24:35 6.0
6 1900-01-01 13:25:07 7.0 NaT NaN
9 1900-01-01 13:25:23 8.0 1900-01-01 13:25:23 7.0
7 1900-01-01 13:26:00 9.0 NaT NaN
10 1900-01-01 13:26:13 10.0 1900-01-01 13:26:13 8.0
11 NaT NaN 1900-01-01 13:26:38 NaN
8 1900-01-01 13:26:54 11.0 NaT NaN
使用的输入:
d = {'Date': ['13.22.57', '13.23.12','13.23.44', '13.24.01','13.24.19', '13.24.35','13.25.07', '13.25.23','13.26.00', '13.26.13','13.26.54'], 'col2': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]}
df = pd.DataFrame(data=d)
df['Date'] = pd.to_datetime(df['Date'], format='%H.%M.%S')
d2 = {'Date': ['13.22.57', '13.23.12','13.23.44', '13.24.01','13.24.19', '13.24.35','13.25.23', '13.26.13','13.26.38'], 'col2': [1, 2, 3, 4, 5, 6, 7, 8, 9]}
df_2 = pd.DataFrame(data=d2)
df_2['Date'] = pd.to_datetime(df_2['Date'], format='%H.%M.%S')
我有 2 个 pandas 数据框,它们都有相同的列但行号不同,具体取决于缺少的行,其中一列是 Date
具有以下格式 29/09/2020 13.22.57
为了简单和无关紧要,下面有时会省略日月年
日期可能与 df
中的 df_2
完全匹配,或者我们预设的阈值可能存在可接受的延迟,在本例中为 2s。
df['Date']
的示例数据:
13.24.19
13.24.35
13.25.07
13.25.23
13.26.00
13.26.13
13.26.54
df_2['Date']
的示例数据:
13.24.19
13.24.35
13.25.23
13.26.13
13.26.38
预计
df['Date']:
13.22.57
13.23.13
13.23.44
13.24.02
13.24.19
13.24.35
0
13.25.23
0
13.26.13
13.26.38
df_2['Date']:
13.24.19
13.24.35
13.25.07
13.25.23
13.26.00
13.26.13
0
13.26.54
增量可能发生在 df
或 df_2
上,这取决于哪个缺失列的时间更长,最后两者的行数应与未缺失的行数相同匹配现在将有一个 0 值,下面的值将发生增量。
数据帧:
d = {'Date': ['13.24.19', '13.24.35','13.25.07', '13.25.23','13.26.00', '13.26.13','13.26.54'], 'col2': [1, 2, 3, 4, 5, 6, 7]}
df = pd.DataFrame(data=d)
df['Date'] = pd.to_datetime(df['Date'], format='%H.%M.%S')
d2 = {'Date': ['13.24.19', '13.24.35','13.25.23', '13.26.13','13.26.38'], 'col2': [1, 2, 3, 4, 5]}
df_2 = pd.DataFrame(data=d2)
df_2['Date'] = pd.to_datetime(df_2['Date'], format='%H.%M.%S')
这应该适用于每个数据帧:
df.loc[df.shape[0]] = [None for _ in range(len(df.columns))]
这对我有用: 注意,假设是 len(df)>len(df_2)
a={"Date": [
"29-09-2020 13:22:57",
"29-09-2020 13:23:12",
"29-09-2020 13:23:44",
"29-09-2020 13:24:01",
"29-09-2020 13:24:19",
"29-09-2020 13:24:35",
"29-09-2020 13:25:07",
"29-09-2020 13:25:23",
"29-09-2020 13:26:00",
"29-09-2020 13:26:13",
"29-09-2020 13:26:54",
]}
b={"Date":[
"29-09-2020 13:22:57",
"29-09-2020 13:23:13",
"29-09-2020 13:23:44",
"29-09-2020 13:24:02",
"29-09-2020 13:24:19",
"29-09-2020 13:24:35",
"29-09-2020 13:25:23",
"29-09-2020 13:26:13",
"29-09-2020 13:26:38",
]
}
df=pd.DataFrame(a)
df["Date"]=pd.to_datetime(df["Date"])
df_2=pd.DataFrame(b)
df_2["Date"]=pd.to_datetime(df_2["Date"])
def add_zero(dataframe,index,increment):
dataframe.loc[index+increment]=0
dataframe = dataframe.sort_index().reset_index(drop=True)
return dataframe
flag=True
idx=0
while flag==True:
if idx >= len(df_2["Date"]):
df_2=add_zero(df_2,idx,0.5)
break
if idx >= len(df["Date"]):
df=add_zero(df,idx,0.5)
break
print(idx)
print(df['Date'][idx])
print(df_2['Date'][idx])
diff=datetime.timedelta.total_seconds(df['Date'][idx] - df_2['Date'][idx])
print(f"Diff: {diff}")
if diff > 2:
df=add_zero(df,idx,-0.5)
print("greater")
elif diff < -2:
df_2=add_zero(df_2,idx,-0.5)
print("smaller")
else:
print("Acceptable")
idx=idx+1
if idx>=max(len(df_2),len(df)):
flag=False
输出
Date Date2
0 2020-09-29 13:22:57 2020-09-29 13:22:57
1 2020-09-29 13:23:12 2020-09-29 13:23:13
2 2020-09-29 13:23:44 2020-09-29 13:23:44
3 2020-09-29 13:24:01 2020-09-29 13:24:02
4 2020-09-29 13:24:19 2020-09-29 13:24:19
5 2020-09-29 13:24:35 2020-09-29 13:24:35
6 2020-09-29 13:25:07 0
7 2020-09-29 13:25:23 2020-09-29 13:25:23
8 2020-09-29 13:26:00 0
9 2020-09-29 13:26:13 2020-09-29 13:26:13
10 0 2020-09-29 13:26:38
11 2020-09-29 13:26:54 0
IIUC,你可以执行双重合并。
首先是 merge_asof
,direction='nearest'
和 tolerance
为 2s,以对齐第二个数据帧相对于第一个数据帧的值。
然后是经典的外部 merge
来填充第二个数据帧中的缺失值。
最后,使用 bfill
按日期排序以获得单列作为参考。
注意。 merge_asof
需要事先按日期对两个数据帧进行排序。
df2 = (pd.merge_asof(df, df_2.add_suffix('_2'),
left_on='Date', right_on='Date_2',
direction='nearest', tolerance=pd.Timedelta('2s'))
.merge(df_2[['Date']].add_suffix('_2'), on='Date_2', how='outer')
)
df2 = df2.loc[df2[['Date', 'Date_2']].bfill(axis=1)['Date'].sort_values().index]
输出:
Date col2 Date_2 col2_2
0 1900-01-01 13:22:57 1.0 1900-01-01 13:22:57 1.0
1 1900-01-01 13:23:12 2.0 1900-01-01 13:23:12 2.0
2 1900-01-01 13:23:44 3.0 1900-01-01 13:23:44 3.0
3 1900-01-01 13:24:01 4.0 1900-01-01 13:24:01 4.0
4 1900-01-01 13:24:19 5.0 1900-01-01 13:24:19 5.0
5 1900-01-01 13:24:35 6.0 1900-01-01 13:24:35 6.0
6 1900-01-01 13:25:07 7.0 NaT NaN
9 1900-01-01 13:25:23 8.0 1900-01-01 13:25:23 7.0
7 1900-01-01 13:26:00 9.0 NaT NaN
10 1900-01-01 13:26:13 10.0 1900-01-01 13:26:13 8.0
11 NaT NaN 1900-01-01 13:26:38 NaN
8 1900-01-01 13:26:54 11.0 NaT NaN
使用的输入:
d = {'Date': ['13.22.57', '13.23.12','13.23.44', '13.24.01','13.24.19', '13.24.35','13.25.07', '13.25.23','13.26.00', '13.26.13','13.26.54'], 'col2': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]}
df = pd.DataFrame(data=d)
df['Date'] = pd.to_datetime(df['Date'], format='%H.%M.%S')
d2 = {'Date': ['13.22.57', '13.23.12','13.23.44', '13.24.01','13.24.19', '13.24.35','13.25.23', '13.26.13','13.26.38'], 'col2': [1, 2, 3, 4, 5, 6, 7, 8, 9]}
df_2 = pd.DataFrame(data=d2)
df_2['Date'] = pd.to_datetime(df_2['Date'], format='%H.%M.%S')