Pandas DataFrame逐行合并

Question

我必须逐行将两个数据帧合并为一个（第一个 df 中的 1 行和第二个 df 中的 1 行）。另外，我需要最快的运行时间。

输入：

df1:
      timestamp   radar_name
0        101         Front
1        102         Front
2        103         Front
3        104         Front

df2
      timestamp    radar_name
0        101          Rear
1        102          Rear
2        103          Rear
3        104          Rear

输出：

merged_df:

      timestamp   radar_name
0        101         Front
1        101         Rear
2        102         Front
3        102         Rear
4        103         Front
5        103         Rear
6        104         Front
7        104         Rear

目前我实现了两种方法：

1.Iterating 使用 for 循环遍历文件 - 运行时间约为 1 分钟 50 秒

for row_cnt in range(len(first_half)):
    merged_file.loc[merged_file_index] = first_half.loc[row_cnt]
    merged_file_index += 1
    merged_file.loc[merged_file_index] = second_half.loc[row_cnt]
    merged_file_index += 1

2.Concat df1 和 f2 按时间戳排序 - 运行时间约为 1 分钟

frames=[df1,df2]
merged_file_2=pd.concat(frames)
merged_file_2.sort_values(by=['timestamp'],inplace=True)
merged_file_2.reset_index(inplace=True)
merged_file_2.drop(columns=['index'],inplace=True)

每 1 个文件的时间是可控的，但我有 100 个这样的文件需要合并运行多次，最后时间加起来。

还有其他方法可以加快合并速度吗？

Answer 1

您可以使用稳定排序按索引排序：

df3 = (pd.concat([df1, df2])
         .sort_index(kind='stable')
         .reset_index(drop=True)
      )

输出：

   timestamp radar_name
0        101      Front
1        101       Rear
2        102      Front
3        102       Rear
4        103      Front
5        103       Rear
6        104      Front
7        104       Rear

或者使用 pre-computed 索引切片：

import numpy as np

idx = np.argsort(np.r_[np.arange(df1.shape[0]), np.arange(df2.shape[0])])
# array([0, 4, 1, 5, 2, 6, 3, 7])

df3 = pd.concat([df1, df2]).iloc[idx]

任意数量数据帧的变体：

dfs = [df1, df2]

idx = np.argsort(np.concatenate([np.arange(d.shape[0]) for d in dfs]))

df3 = pd.concat(dfs, ignore_index=True).iloc[idx]

输出：

   timestamp radar_name
0        101      Front
0        101       Rear
1        102      Front
1        102       Rear
2        103      Front
2        103       Rear
3        104      Front
3        104       Rear

Answer 2

如果您使现有代码更简单一些，您应该已经有了很大的改进。此外，如果在排序中添加ignore_index，则不需要删除索引，这样会加快速度。

您还可以尝试 sort_values 的 kind 参数。参见 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html

我会这样做：

merged_file_3=pd.concat([df1,df2]).sort_values(by=['timestamp'], kind="stable",ignore_index=True)

Pandas DataFrame逐行合并

Pandas DataFrame merge line by line

python

merge

dataframe

pandas