Pandas 合并数据框的条件取决于列中的值

Pandas merge dataframe with conditions depends on value in a column

我们将不胜感激。

我有 2 个 DataFrame。

第一个数据框由activity个人schedule的时间表组成,如下:

PersonID   Person      Origin    Destination
3-1          1          A           B
3-1          1          B           A
13-1         1          C           D
13-1         1          D           C
13-2         2          A           B
13-2         2          B           A

我还有另一个 DataFrame,household,包含 person/agent.

的详细信息
PersonID1    Age1    Gender1     PersonID2     Age2    Gender2
3-1           20       M           NaN         NaN       NaN
13-1          45       F          13-2          17        M

我想使用 pd.merge 对这两个执行 VLOOKUP。由于查找(合并)将取决于此人的 ID,因此我尝试了一个条件。

def merging(row):
   if row['Person'] == 1:
       row = pd.merge(row, household, how='left', left_on=['PersonID'], right_on=['Age1', 'Gender1'])
   else:
       row = pd.merge(row, household, how='left', left_on=['PersonID'], right_on=['Age2','Gender2'])
   return row

schedule_merged = schedule.apply(merging, axis=1)

但是,由于某种原因,它就是行不通。错误显示 ValueError: len(right_on) must equal len(left_on)。我的目标是最终做出这样的数据:

PersonID   Person      Origin    Destination    Age    Gender
3-1          1          A           B           20       M
3-1          1          B           A           20       M
13-1         1          C           D           45       F
13-1         1          D           C           45       F
13-2         2          A           B           17       M
13-2         2          B           A           17       M

我想我弄乱了 pd.merge 行。虽然在 Excel 中使用 VLOOKUP 可能更有效,但它对我的 PC 来说太重了,因为我必须将它应用于十万个数据。我怎么能正确地做到这一点?谢谢!

如果真实数据集不比给定的示例更复杂,我会这样做。否则我会建议查看 pd.melt() 以获得更复杂的旋转。

import pandas as pd
import numpy as np

# Create Dummy schedule DataFrame
d = {'PersonID': ['3-1', '3-1', '13-1', '13-1', '13-2', '13-2'], 'Person': ['1', '1', '1', '1', '2', '2'], 'Origin': ['A', 'B', 'C', 'D', 'A', 'B'], 'Destination': ['B', 'A', 'D', 'C', 'B', 'A']}
schedule = pd.DataFrame(data=d)
schedule

# Create Dummy houshold DataFrame
d = {'PersonID1': ['3-1', '13-1'], 'Age1': ['20', '45'], 'Gender1': ['M', 'F'], 'PersonID2': [np.nan, '13-2'], 'Age2': [np.nan, '17'], 'Gender2': [np.nan, 'M']}
household = pd.DataFrame(data=d)
household

# Select columns for PersonID1 and rename columns
household1 = household[['PersonID1', 'Age1', 'Gender1']]
household1.columns = ['PersonID', 'Age', 'Gender']
# Select columns for PersonID1 and rename columns
household2 = household[['PersonID2', 'Age2', 'Gender2']]
household2.columns = ['PersonID', 'Age', 'Gender']

# Concat them together
household_new = pd.concat([household1, household2])

# Merge houshold and schedule df together on PersonID
schedule = schedule.merge(household_new, how='left', left_on='PersonID', right_on='PersonID', validate='many_to_one')

输出

PersonID   Person      Origin    Destination    Age    Gender
3-1          1          A           B           20       M
3-1          1          B           A           20       M
13-1         1          C           D           45       F
13-1         1          D           C           45       F
13-2         2          A           B           17       M
13-2         2          B           A           17       M