Python: 根据匹配的 ids 顺序填充数据帧（低效代码）

Question

底部的表格和代码可能比描述更有帮助。

我有一个可行的解决方案，但认为它非常低效见底。

问题：

我有两个数据框 df_1 和 df_2 -- 这些数据框有一个匹配列 - match_id

df_2 有一个日期列，我正试图将其写入 df_1

df_2 中的每个 match_id 都存在于 df_1

中

我需要根据 match_id.

按顺序将日期写入 df_1

重要的是，我只希望 df_2 中的 match_id 行在匹配过程中使用一次。

如果 df_2 中的 match_id 不足以填充 df_1 中的所有 ID，则将 df_1 中的剩余行留空。

如果我展示一下就容易多了:

df_1:

index	match_id	date
0	45
1	45
2	45
3	45
4	46
5	46
6	47

df_2:

index	match_id	date
0	45	01/01/22
1	45	02/01/22
2	46	02/01/22
3	46	05/01/22

输出（已更新 df_1）：

index	match_id	date
0	45	01/01/22
1	45	02/01/22
2	45
3	45
4	46	02/01/22
5	46	05/01/22
6	47

我有一个有效的解决方案，但我确信在实践中必须有很多 time/resource 有效的方法（对 python 来说仍然很新，对编码来说也很新）运行它适用于更大的数据集：

import pandas as pd

data_1 = [[45, ""], [45, ""],[45, ""],[45, ""],[46, ""],[46, ""],[47, ""]]

df_1 = pd.DataFrame(data_1, columns = ['match_id', 'date'])

data_2 = [[45, "01/01/22"], [45, "02/01/22"],[46, "01/01/22"],[46, "05/01/22"]]

df_2 = pd.DataFrame(data_2, columns = ['match_id', 'date'])



for i_df_1, r_df_1 in df_1.iterrows():
    for i_df_2, r_df_2 in df_2.iterrows():
            if r_df_1["match_id"] == r_df_2["match_id"]:

            # Add data into the payment transaction dataframe
                df_1.loc[i_df_1,"date"] = r_df_2["date"]

            # Drop the used row from df_2 so does not get used again 
                df_2 = df_2.drop(i_df_2)

                break
            continue

Answer 1

您可以使用 groupby.cumcount 计算一个额外的密钥并在 merge 中使用它：

df_3 = (df_1
 #.drop(columns='date') # uncomment if df1 already has an empty date column
 .merge(df_2,
        left_on=['match_id', df_1.groupby('match_id').cumcount()],
        right_on=['match_id', df_2.groupby('match_id').cumcount()],
        how='left'
       )
 #.drop(columns='key_1') # uncomment if unwanted
)

输出：

   match_id  key_1      date
0        45      0  01/01/22
1        45      1  02/01/22
2        45      2       NaN
3        45      3       NaN
4        46      0  02/01/22
5        46      1  05/01/22
6        47      0       NaN

Python: 根据匹配的 ids 顺序填充数据帧（低效代码）

Python: Filling a dataframe sequentially based on matching ids (inefficient code)

python

performance

nested-loops

dataframe

pandas