使用一定范围的列合并两个 DataFrame(在 ID 右侧,在多个 ID 左侧)

Merging two DataFrame using a range of columns (Right on ID and left on multiple IDs)

我想使用 id 从两个 df 创建一个数据集。问题是在第二个 df 上, id 不在单个列中。 id 值可以位于不同的列中。

merged=pd.merge(df1, df2, left_on=['id','month','year'], right_on=['id_name','id_surname','id_first_name', month','year'], how="left")

所有 id 变量都是字母数字。

但我收到错误消息:

ValueError: len(right_on) must equal len(left_on)

理想情况下,我想测试 id 是否在其他三列之一 ids 中,并相应地合并该列。也许某种 vlookup() 函数(来自 excel)允许在 table 数组的范围内查找键值。有什么想法吗?

假设我们有以下两个数据帧:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(
    {
        "id": [1, 2, 3],
        "month": ["Jan", "Mar", "Apr"],
        "year": ["2022", "2020", "2021"],
        "column_A": ["test", "test_", "test__"]
    }
)


df2 = pd.DataFrame(
    {
        "id_name": [1, np.NaN, np.NaN],
        "id_surname": [np.NaN, 2, np.NaN],
        "id_first_name": [np.NaN, np.NaN, 3],
        "month": ["Jan", "Mar", "Apr"],
        "year": ["2022", "2020", "2021"],
        "column_B": ["check", "check_", "check__"]
    }
)

第二个数据帧将是:

   id_name  id_surname  id_first_name month  year column_B
0      1.0         NaN            NaN   Jan  2022   check
1      NaN         2.0            NaN   Mar  2020   check_
2      NaN         NaN            3.0   Apr  2021   check__

您可以通过保留三列 id_name, id_surname, id_first_name 中的所有非 NaN 值来为第二个数据框创建一个新列 id。从 id_name 列开始,用 id_surname 的非 Nans 值填充其 NaN,然后​​用 id_first_name 的非 NaN 填充剩余的 NaN。执行此操作的代码是:

df2["id"] = df2["id_name"].fillna(df2["id_surname"]).fillna(df2["id_first_name"])

这将为 df2 创建列 id:

   id_name  id_surname  id_first_name month  year column_B   id
0      1.0         NaN            NaN   Jan  2022   check    1.0
1      NaN         2.0            NaN   Mar  2020   check_   2.0
2      NaN         NaN            3.0   Apr  2021   check__  3.0

最后,您可以通过以下方式合并:

merged = pd.merge(
    df1,
    df2,
    left_on=["id", "month", "year"],
    right_on=["id", "month", "year"],
    how="left",
)

结果将是:

   id month  year column_A  id_name  id_surname  id_first_name column_B
0   1   Jan  2022     test      1.0         NaN            NaN   check
1   2   Mar  2020    test_      NaN         2.0            NaN   check_
2   3   Apr  2021   test__      NaN         NaN            3.0   check__