Pandas 合并数据框的条件取决于列中的值
Pandas merge dataframe with conditions depends on value in a column
我们将不胜感激。
我有 2 个 DataFrame。
第一个数据框由activity个人schedule
的时间表组成,如下:
PersonID Person Origin Destination
3-1 1 A B
3-1 1 B A
13-1 1 C D
13-1 1 D C
13-2 2 A B
13-2 2 B A
我还有另一个 DataFrame,household
,包含 person/agent.
的详细信息
PersonID1 Age1 Gender1 PersonID2 Age2 Gender2
3-1 20 M NaN NaN NaN
13-1 45 F 13-2 17 M
我想使用 pd.merge 对这两个执行 VLOOKUP。由于查找(合并)将取决于此人的 ID,因此我尝试了一个条件。
def merging(row):
if row['Person'] == 1:
row = pd.merge(row, household, how='left', left_on=['PersonID'], right_on=['Age1', 'Gender1'])
else:
row = pd.merge(row, household, how='left', left_on=['PersonID'], right_on=['Age2','Gender2'])
return row
schedule_merged = schedule.apply(merging, axis=1)
但是,由于某种原因,它就是行不通。错误显示 ValueError: len(right_on) must equal len(left_on)
。我的目标是最终做出这样的数据:
PersonID Person Origin Destination Age Gender
3-1 1 A B 20 M
3-1 1 B A 20 M
13-1 1 C D 45 F
13-1 1 D C 45 F
13-2 2 A B 17 M
13-2 2 B A 17 M
我想我弄乱了 pd.merge
行。虽然在 Excel 中使用 VLOOKUP 可能更有效,但它对我的 PC 来说太重了,因为我必须将它应用于十万个数据。我怎么能正确地做到这一点?谢谢!
如果真实数据集不比给定的示例更复杂,我会这样做。否则我会建议查看 pd.melt() 以获得更复杂的旋转。
import pandas as pd
import numpy as np
# Create Dummy schedule DataFrame
d = {'PersonID': ['3-1', '3-1', '13-1', '13-1', '13-2', '13-2'], 'Person': ['1', '1', '1', '1', '2', '2'], 'Origin': ['A', 'B', 'C', 'D', 'A', 'B'], 'Destination': ['B', 'A', 'D', 'C', 'B', 'A']}
schedule = pd.DataFrame(data=d)
schedule
# Create Dummy houshold DataFrame
d = {'PersonID1': ['3-1', '13-1'], 'Age1': ['20', '45'], 'Gender1': ['M', 'F'], 'PersonID2': [np.nan, '13-2'], 'Age2': [np.nan, '17'], 'Gender2': [np.nan, 'M']}
household = pd.DataFrame(data=d)
household
# Select columns for PersonID1 and rename columns
household1 = household[['PersonID1', 'Age1', 'Gender1']]
household1.columns = ['PersonID', 'Age', 'Gender']
# Select columns for PersonID1 and rename columns
household2 = household[['PersonID2', 'Age2', 'Gender2']]
household2.columns = ['PersonID', 'Age', 'Gender']
# Concat them together
household_new = pd.concat([household1, household2])
# Merge houshold and schedule df together on PersonID
schedule = schedule.merge(household_new, how='left', left_on='PersonID', right_on='PersonID', validate='many_to_one')
输出
PersonID Person Origin Destination Age Gender
3-1 1 A B 20 M
3-1 1 B A 20 M
13-1 1 C D 45 F
13-1 1 D C 45 F
13-2 2 A B 17 M
13-2 2 B A 17 M
我们将不胜感激。
我有 2 个 DataFrame。
第一个数据框由activity个人schedule
的时间表组成,如下:
PersonID Person Origin Destination
3-1 1 A B
3-1 1 B A
13-1 1 C D
13-1 1 D C
13-2 2 A B
13-2 2 B A
我还有另一个 DataFrame,household
,包含 person/agent.
PersonID1 Age1 Gender1 PersonID2 Age2 Gender2
3-1 20 M NaN NaN NaN
13-1 45 F 13-2 17 M
我想使用 pd.merge 对这两个执行 VLOOKUP。由于查找(合并)将取决于此人的 ID,因此我尝试了一个条件。
def merging(row):
if row['Person'] == 1:
row = pd.merge(row, household, how='left', left_on=['PersonID'], right_on=['Age1', 'Gender1'])
else:
row = pd.merge(row, household, how='left', left_on=['PersonID'], right_on=['Age2','Gender2'])
return row
schedule_merged = schedule.apply(merging, axis=1)
但是,由于某种原因,它就是行不通。错误显示 ValueError: len(right_on) must equal len(left_on)
。我的目标是最终做出这样的数据:
PersonID Person Origin Destination Age Gender
3-1 1 A B 20 M
3-1 1 B A 20 M
13-1 1 C D 45 F
13-1 1 D C 45 F
13-2 2 A B 17 M
13-2 2 B A 17 M
我想我弄乱了 pd.merge
行。虽然在 Excel 中使用 VLOOKUP 可能更有效,但它对我的 PC 来说太重了,因为我必须将它应用于十万个数据。我怎么能正确地做到这一点?谢谢!
如果真实数据集不比给定的示例更复杂,我会这样做。否则我会建议查看 pd.melt() 以获得更复杂的旋转。
import pandas as pd
import numpy as np
# Create Dummy schedule DataFrame
d = {'PersonID': ['3-1', '3-1', '13-1', '13-1', '13-2', '13-2'], 'Person': ['1', '1', '1', '1', '2', '2'], 'Origin': ['A', 'B', 'C', 'D', 'A', 'B'], 'Destination': ['B', 'A', 'D', 'C', 'B', 'A']}
schedule = pd.DataFrame(data=d)
schedule
# Create Dummy houshold DataFrame
d = {'PersonID1': ['3-1', '13-1'], 'Age1': ['20', '45'], 'Gender1': ['M', 'F'], 'PersonID2': [np.nan, '13-2'], 'Age2': [np.nan, '17'], 'Gender2': [np.nan, 'M']}
household = pd.DataFrame(data=d)
household
# Select columns for PersonID1 and rename columns
household1 = household[['PersonID1', 'Age1', 'Gender1']]
household1.columns = ['PersonID', 'Age', 'Gender']
# Select columns for PersonID1 and rename columns
household2 = household[['PersonID2', 'Age2', 'Gender2']]
household2.columns = ['PersonID', 'Age', 'Gender']
# Concat them together
household_new = pd.concat([household1, household2])
# Merge houshold and schedule df together on PersonID
schedule = schedule.merge(household_new, how='left', left_on='PersonID', right_on='PersonID', validate='many_to_one')
输出
PersonID Person Origin Destination Age Gender
3-1 1 A B 20 M
3-1 1 B A 20 M
13-1 1 C D 45 F
13-1 1 D C 45 F
13-2 2 A B 17 M
13-2 2 B A 17 M