使用一定范围的列合并两个 DataFrame(在 ID 右侧,在多个 ID 左侧)
Merging two DataFrame using a range of columns (Right on ID and left on multiple IDs)
我想使用 id
从两个 df
创建一个数据集。问题是在第二个 df
上, id
不在单个列中。 id
值可以位于不同的列中。
merged=pd.merge(df1, df2, left_on=['id','month','year'], right_on=['id_name','id_surname','id_first_name', month','year'], how="left")
所有 id
变量都是字母数字。
但我收到错误消息:
ValueError: len(right_on) must equal len(left_on)
理想情况下,我想测试 id
是否在其他三列之一 ids
中,并相应地合并该列。也许某种 vlookup() 函数(来自 excel)允许在 table 数组的范围内查找键值。有什么想法吗?
假设我们有以下两个数据帧:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
{
"id": [1, 2, 3],
"month": ["Jan", "Mar", "Apr"],
"year": ["2022", "2020", "2021"],
"column_A": ["test", "test_", "test__"]
}
)
df2 = pd.DataFrame(
{
"id_name": [1, np.NaN, np.NaN],
"id_surname": [np.NaN, 2, np.NaN],
"id_first_name": [np.NaN, np.NaN, 3],
"month": ["Jan", "Mar", "Apr"],
"year": ["2022", "2020", "2021"],
"column_B": ["check", "check_", "check__"]
}
)
第二个数据帧将是:
id_name id_surname id_first_name month year column_B
0 1.0 NaN NaN Jan 2022 check
1 NaN 2.0 NaN Mar 2020 check_
2 NaN NaN 3.0 Apr 2021 check__
您可以通过保留三列 id_name, id_surname, id_first_name
中的所有非 NaN 值来为第二个数据框创建一个新列 id
。从 id_name
列开始,用 id_surname
的非 Nans 值填充其 NaN,然后用 id_first_name
的非 NaN 填充剩余的 NaN。执行此操作的代码是:
df2["id"] = df2["id_name"].fillna(df2["id_surname"]).fillna(df2["id_first_name"])
这将为 df2
创建列 id
:
id_name id_surname id_first_name month year column_B id
0 1.0 NaN NaN Jan 2022 check 1.0
1 NaN 2.0 NaN Mar 2020 check_ 2.0
2 NaN NaN 3.0 Apr 2021 check__ 3.0
最后,您可以通过以下方式合并:
merged = pd.merge(
df1,
df2,
left_on=["id", "month", "year"],
right_on=["id", "month", "year"],
how="left",
)
结果将是:
id month year column_A id_name id_surname id_first_name column_B
0 1 Jan 2022 test 1.0 NaN NaN check
1 2 Mar 2020 test_ NaN 2.0 NaN check_
2 3 Apr 2021 test__ NaN NaN 3.0 check__
我想使用 id
从两个 df
创建一个数据集。问题是在第二个 df
上, id
不在单个列中。 id
值可以位于不同的列中。
merged=pd.merge(df1, df2, left_on=['id','month','year'], right_on=['id_name','id_surname','id_first_name', month','year'], how="left")
所有 id
变量都是字母数字。
但我收到错误消息:
ValueError: len(right_on) must equal len(left_on)
理想情况下,我想测试 id
是否在其他三列之一 ids
中,并相应地合并该列。也许某种 vlookup() 函数(来自 excel)允许在 table 数组的范围内查找键值。有什么想法吗?
假设我们有以下两个数据帧:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
{
"id": [1, 2, 3],
"month": ["Jan", "Mar", "Apr"],
"year": ["2022", "2020", "2021"],
"column_A": ["test", "test_", "test__"]
}
)
df2 = pd.DataFrame(
{
"id_name": [1, np.NaN, np.NaN],
"id_surname": [np.NaN, 2, np.NaN],
"id_first_name": [np.NaN, np.NaN, 3],
"month": ["Jan", "Mar", "Apr"],
"year": ["2022", "2020", "2021"],
"column_B": ["check", "check_", "check__"]
}
)
第二个数据帧将是:
id_name id_surname id_first_name month year column_B
0 1.0 NaN NaN Jan 2022 check
1 NaN 2.0 NaN Mar 2020 check_
2 NaN NaN 3.0 Apr 2021 check__
您可以通过保留三列 id_name, id_surname, id_first_name
中的所有非 NaN 值来为第二个数据框创建一个新列 id
。从 id_name
列开始,用 id_surname
的非 Nans 值填充其 NaN,然后用 id_first_name
的非 NaN 填充剩余的 NaN。执行此操作的代码是:
df2["id"] = df2["id_name"].fillna(df2["id_surname"]).fillna(df2["id_first_name"])
这将为 df2
创建列 id
:
id_name id_surname id_first_name month year column_B id
0 1.0 NaN NaN Jan 2022 check 1.0
1 NaN 2.0 NaN Mar 2020 check_ 2.0
2 NaN NaN 3.0 Apr 2021 check__ 3.0
最后,您可以通过以下方式合并:
merged = pd.merge(
df1,
df2,
left_on=["id", "month", "year"],
right_on=["id", "month", "year"],
how="left",
)
结果将是:
id month year column_A id_name id_surname id_first_name column_B
0 1 Jan 2022 test 1.0 NaN NaN check
1 2 Mar 2020 test_ NaN 2.0 NaN check_
2 3 Apr 2021 test__ NaN NaN 3.0 check__