基于 Python DataFrame 中多列的 Vlookup
Vlookup based on multiple columns in Python DataFrame
我有两个数据框。我正在尝试从第二个数据帧中查找 'Hobby' 列并更新第一个数据帧的 'Interests' 列。请注意列:Key、Employee 和 Industry 应该在两个数据框之间完全匹配,但是对于城市,即使城市的第一部分在两个数据框之间匹配也应该是可以接受的。虽然,它在 Excel 中是非常先进的,但在 Python 上实现它看起来有点复杂。关于如何进行的任何提示都会非常有帮助。 (请参阅下面的屏幕截图了解预期的输出。)
data1=[['AC32456','NYC-URBAN','JON','BANKING','SINGING'],['AD45678','WDC-RURAL','XING','FINANCE','DANCING'],
['DE43216', 'LONDON-URBAN','EDWARDS','IT','READING'],['RT45327','SINGAPORE-URBAN','WOLF','SPORTS','WALKING'],
['Rs454457','MUMBAI-RURAL','NEMBIAR','IT','ZUDO']]
data2=[['AH56245','NYC','MIKE','BANKING','BIKING'],['AD45678','WDC','XING','FINANCE','TREKKING'],
['DE43216', 'LONDON-URBAN','EDWARDS','FINANCE','SLEEPING'],['RT45327','SINGAPORE','WOLF','SPORTS','DANCING'],
['RS454457','MUMBAI','NEMBIAR','IT','ZUDO']]
List1=['Key','City','Employee', 'Industry', 'Interests']
List2=['Key','City','Employee', 'Industry', 'Hobby']
df1=pd.DataFrame(data1, columns=List1)
df2=pd.DataFrame(data2,columns=List2)
将 df1
的索引设置为 Key
(您可以将索引设置为您想要匹配的任何内容)并使用 update:
# get the first part of the city
df1['City_key'] = df1['City'].str.split('-', expand=True)[0]
df2['City_key'] = df2['City'].str.split('-', expand=True)[0]
# set index
df1 = df1.set_index(['Key', 'Employee', 'Industry', 'City_key'])
# update
df1['Interests'].update(df2.set_index(['Key', 'Employee', 'Industry', 'City_key'])['Hobby'])
# reset index and drop the City_key column
new_df = df1.reset_index().drop(columns=['City_key'])
Key Employee Industry City Interests
0 AC32456 JON BANKING NYC-URBAN SINGING
1 AD45678 XING FINANCE WDC-RURAL TREKKING
2 DE43216 EDWARDS IT LONDON-URBAN READING
3 RT45327 WOLF SPORTS SINGAPORE-URBAN DANCING
4 Rs454457 NEMBIAR IT MUMBAI-RURAL ZUDO
我有两个数据框。我正在尝试从第二个数据帧中查找 'Hobby' 列并更新第一个数据帧的 'Interests' 列。请注意列:Key、Employee 和 Industry 应该在两个数据框之间完全匹配,但是对于城市,即使城市的第一部分在两个数据框之间匹配也应该是可以接受的。虽然,它在 Excel 中是非常先进的,但在 Python 上实现它看起来有点复杂。关于如何进行的任何提示都会非常有帮助。 (请参阅下面的屏幕截图了解预期的输出。)
data1=[['AC32456','NYC-URBAN','JON','BANKING','SINGING'],['AD45678','WDC-RURAL','XING','FINANCE','DANCING'],
['DE43216', 'LONDON-URBAN','EDWARDS','IT','READING'],['RT45327','SINGAPORE-URBAN','WOLF','SPORTS','WALKING'],
['Rs454457','MUMBAI-RURAL','NEMBIAR','IT','ZUDO']]
data2=[['AH56245','NYC','MIKE','BANKING','BIKING'],['AD45678','WDC','XING','FINANCE','TREKKING'],
['DE43216', 'LONDON-URBAN','EDWARDS','FINANCE','SLEEPING'],['RT45327','SINGAPORE','WOLF','SPORTS','DANCING'],
['RS454457','MUMBAI','NEMBIAR','IT','ZUDO']]
List1=['Key','City','Employee', 'Industry', 'Interests']
List2=['Key','City','Employee', 'Industry', 'Hobby']
df1=pd.DataFrame(data1, columns=List1)
df2=pd.DataFrame(data2,columns=List2)
将 df1
的索引设置为 Key
(您可以将索引设置为您想要匹配的任何内容)并使用 update:
# get the first part of the city
df1['City_key'] = df1['City'].str.split('-', expand=True)[0]
df2['City_key'] = df2['City'].str.split('-', expand=True)[0]
# set index
df1 = df1.set_index(['Key', 'Employee', 'Industry', 'City_key'])
# update
df1['Interests'].update(df2.set_index(['Key', 'Employee', 'Industry', 'City_key'])['Hobby'])
# reset index and drop the City_key column
new_df = df1.reset_index().drop(columns=['City_key'])
Key Employee Industry City Interests
0 AC32456 JON BANKING NYC-URBAN SINGING
1 AD45678 XING FINANCE WDC-RURAL TREKKING
2 DE43216 EDWARDS IT LONDON-URBAN READING
3 RT45327 WOLF SPORTS SINGAPORE-URBAN DANCING
4 Rs454457 NEMBIAR IT MUMBAI-RURAL ZUDO