如何在 pandas 中插入空白行并多次正确递增索引

How can i insert a blank row in pandas and properly increment the index multiple times

我有 2 个 pandas 数据框,它们都有相同的列但行号不同,具体取决于缺少的行,其中一列是 Date 具有以下格式 29/09/2020 13.22.57 为了简单和无关紧要,下面有时会省略日月年 日期可能与 df 中的 df_2 完全匹配,或者我们预设的阈值可能存在可接受的延迟,在本例中为 2s。

df['Date']的示例数据:

13.24.19
13.24.35
13.25.07
13.25.23
13.26.00
13.26.13
13.26.54

df_2['Date']的示例数据:

13.24.19    
13.24.35                        
13.25.23                        
13.26.13    
13.26.38

预计

df['Date']:

13.22.57    
13.23.13    
13.23.44    
13.24.02    
13.24.19    
13.24.35
0                       
13.25.23
0                       
13.26.13    
13.26.38



df_2['Date']:


13.24.19
13.24.35
13.25.07
13.25.23
13.26.00
13.26.13
0
13.26.54

增量可能发生在 dfdf_2 上,这取决于哪个缺失列的时间更长,最后两者的行数应与未缺失的行数相同匹配现在将有一个 0 值,下面的值将发生增量。

数据帧:

d = {'Date': ['13.24.19', '13.24.35','13.25.07', '13.25.23','13.26.00', '13.26.13','13.26.54'], 'col2': [1, 2, 3, 4, 5, 6, 7]}
df = pd.DataFrame(data=d)
df['Date'] = pd.to_datetime(df['Date'], format='%H.%M.%S')

d2 = {'Date': ['13.24.19', '13.24.35','13.25.23', '13.26.13','13.26.38'], 'col2': [1, 2, 3, 4, 5]}
df_2 = pd.DataFrame(data=d2)
df_2['Date'] = pd.to_datetime(df_2['Date'], format='%H.%M.%S')

这应该适用于每个数据帧:

df.loc[df.shape[0]] = [None for _ in range(len(df.columns))]

这对我有用: 注意,假设是 len(df)>len(df_2)

a={"Date": [
"29-09-2020 13:22:57",
"29-09-2020 13:23:12",
"29-09-2020 13:23:44",
"29-09-2020 13:24:01",
"29-09-2020 13:24:19",
"29-09-2020 13:24:35",
"29-09-2020 13:25:07",
"29-09-2020 13:25:23",
"29-09-2020 13:26:00",
"29-09-2020 13:26:13",
"29-09-2020 13:26:54",
]}
b={"Date":[
    "29-09-2020 13:22:57",    
    "29-09-2020 13:23:13",    
    "29-09-2020 13:23:44",    
    "29-09-2020 13:24:02",    
    "29-09-2020 13:24:19",    
    "29-09-2020 13:24:35",                        
    "29-09-2020 13:25:23",                        
    "29-09-2020 13:26:13",    
    "29-09-2020 13:26:38",
]
}
df=pd.DataFrame(a)
df["Date"]=pd.to_datetime(df["Date"])
df_2=pd.DataFrame(b)
df_2["Date"]=pd.to_datetime(df_2["Date"])


def add_zero(dataframe,index,increment):
    dataframe.loc[index+increment]=0
    dataframe = dataframe.sort_index().reset_index(drop=True)
    return dataframe

flag=True
idx=0
while flag==True:
    if idx >= len(df_2["Date"]):
        df_2=add_zero(df_2,idx,0.5)
        break
    if idx >= len(df["Date"]):
        df=add_zero(df,idx,0.5)
        break
    print(idx)
    print(df['Date'][idx])
    print(df_2['Date'][idx])
    diff=datetime.timedelta.total_seconds(df['Date'][idx] - df_2['Date'][idx])
    print(f"Diff: {diff}")
    if diff > 2:
        df=add_zero(df,idx,-0.5)
        print("greater")
    elif diff < -2:
        df_2=add_zero(df_2,idx,-0.5)
        print("smaller")
    else:
        print("Acceptable")

    idx=idx+1

    if idx>=max(len(df_2),len(df)):
        flag=False

输出

    Date                Date2
0   2020-09-29 13:22:57 2020-09-29 13:22:57
1   2020-09-29 13:23:12 2020-09-29 13:23:13
2   2020-09-29 13:23:44 2020-09-29 13:23:44
3   2020-09-29 13:24:01 2020-09-29 13:24:02
4   2020-09-29 13:24:19 2020-09-29 13:24:19
5   2020-09-29 13:24:35 2020-09-29 13:24:35
6   2020-09-29 13:25:07 0
7   2020-09-29 13:25:23 2020-09-29 13:25:23
8   2020-09-29 13:26:00 0
9   2020-09-29 13:26:13 2020-09-29 13:26:13
10  0                   2020-09-29 13:26:38
11  2020-09-29 13:26:54 0

IIUC,你可以执行双重合并。

首先是 merge_asofdirection='nearest'tolerance 为 2s,以对齐第二个数据帧相对于第一个数据帧的值。

然后是经典的外部 merge 来填充第二个数据帧中的缺失值。

最后,使用 bfill 按日期排序以获得单列作为参考。

注意。 merge_asof 需要事先按日期对两个数据帧进行排序。

df2 = (pd.merge_asof(df, df_2.add_suffix('_2'),
              left_on='Date', right_on='Date_2',
              direction='nearest', tolerance=pd.Timedelta('2s'))
   .merge(df_2[['Date']].add_suffix('_2'), on='Date_2', how='outer')
)
df2 = df2.loc[df2[['Date', 'Date_2']].bfill(axis=1)['Date'].sort_values().index]

输出:

                  Date  col2              Date_2  col2_2
0  1900-01-01 13:22:57   1.0 1900-01-01 13:22:57     1.0
1  1900-01-01 13:23:12   2.0 1900-01-01 13:23:12     2.0
2  1900-01-01 13:23:44   3.0 1900-01-01 13:23:44     3.0
3  1900-01-01 13:24:01   4.0 1900-01-01 13:24:01     4.0
4  1900-01-01 13:24:19   5.0 1900-01-01 13:24:19     5.0
5  1900-01-01 13:24:35   6.0 1900-01-01 13:24:35     6.0
6  1900-01-01 13:25:07   7.0                 NaT     NaN
9  1900-01-01 13:25:23   8.0 1900-01-01 13:25:23     7.0
7  1900-01-01 13:26:00   9.0                 NaT     NaN
10 1900-01-01 13:26:13  10.0 1900-01-01 13:26:13     8.0
11                 NaT   NaN 1900-01-01 13:26:38     NaN
8  1900-01-01 13:26:54  11.0                 NaT     NaN

使用的输入:

d = {'Date': ['13.22.57', '13.23.12','13.23.44', '13.24.01','13.24.19', '13.24.35','13.25.07', '13.25.23','13.26.00', '13.26.13','13.26.54'], 'col2': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]}
df = pd.DataFrame(data=d)
df['Date'] = pd.to_datetime(df['Date'], format='%H.%M.%S')

d2 = {'Date': ['13.22.57', '13.23.12','13.23.44', '13.24.01','13.24.19', '13.24.35','13.25.23', '13.26.13','13.26.38'], 'col2': [1, 2, 3, 4, 5, 6, 7, 8, 9]}
df_2 = pd.DataFrame(data=d2)
df_2['Date'] = pd.to_datetime(df_2['Date'], format='%H.%M.%S')