pandas 数据帧的最佳条件连接
Optimal conditional joining of pandas dataframe
我有一种情况,我正在尝试加入 df_a
到 df_b
实际上,这些 dataframes
具有以下形状:(389944, 121)
和 (1098118, 60)
如果任何以下条件成立,我需要有条件地加入这两个数据帧。如果是多个,只需加入一次:
df_a.player == df_b.handle
df_a.website == df_b.url
df_a.website == df_b.web_addr
df_a.website == df_b.notes
举个例子...
df_a:
player
website
merch
michael jordan
www.michaeljordan.com
Y
Lebron James
www.kingjames.com
Y
Kobe Bryant
www.mamba.com
Y
Larry Bird
www.larrybird.com
Y
luka Doncic
www.77.com
N
df_b:
platform
url
web_addr
notes
handle
followers
following
Twitter
https://twitter.com/luka7doncic
www.77.com
luka7doncic
1500000
347
Twitter
www.larrybird.com
https://en.wikipedia.org/wiki/Larry_Bird
www.larrybird.com
Twitter
https://www.michaeljordansworld.com/
www.michaeljordan.com
Twitter
https://twitter.com/kobebryant
https://granitystudios.com/
https://granitystudios.com/
Kobe Bryant
14900000
514
Twitter
fooman.com
thefoo.com
foobar
foobarman
1
1
Twitter
www.whosebug.com
理想情况下,df_a
获取 left joined
到 df_b
以引入 handle
、followers
和 following
字段
player
website
merch
handle
followers
following
michael jordan
www.michaeljordan.com
Y
nh
0
0
Lebron James
www.kingjames.com
Y
null
null
null
Kobe Bryant
www.mamba.com
Y
Kobe Bryant
14900000
514
Larry Bird
www.larrybird.com
Y
nh
0
0
luka Doncic
www.77.com
N
luka7doncic
1500000
347
一个minimal, reproducible example如下:
import pandas as pd, numpy as np
df_a = pd.DataFrame.from_dict({'player': {0: 'michael jordan', 1: 'Lebron James', 2: 'Kobe Bryant', 3: 'Larry Bird', 4: 'luka Doncic'}, 'website': {0: 'www.michaeljordan.com', 1: 'www.kingjames.com', 2: 'www.mamba.com', 3: 'www.larrybird.com', 4: 'www.77.com'}, 'merch': {0: 'Y', 1: 'Y', 2: 'Y', 3: 'Y', 4: 'N'}, 'handle': {0: 'nh', 1: np.nan, 2: 'Kobe Bryant', 3: 'nh', 4: 'luka7doncic'}, 'followers': {0: 0.0, 1: np.nan, 2: 14900000.0, 3: 0.0, 4: 1500000.0}, 'following': {0: 0.0, 1: np.nan, 2: 514.0, 3: 0.0, 4: 347.0}})
df_b = pd.DataFrame.from_dict({'platform': {0: 'Twitter', 1: 'Twitter', 2: 'Twitter', 3: 'Twitter', 4: 'Twitter', 5: 'Twitter'}, 'url': {0: 'https://twitter.com/luka7doncic', 1: 'www.larrybird.com', 2: np.nan, 3: 'https://twitter.com/kobebryant', 4: 'fooman.com', 5: 'www.whosebug.com'}, 'web_addr': {0: 'www.77.com', 1: 'https://en.wikipedia.org/wiki/Larry_Bird', 2: 'https://www.michaeljordansworld.com/', 3: 'https://granitystudios.com/', 4: 'thefoo.com', 5: np.nan}, 'notes': {0: np.nan, 1: 'www.larrybird.com', 2: 'www.michaeljordan.com', 3: 'https://granitystudios.com/', 4: 'foobar', 5: np.nan}, 'handle': {0: 'luka7doncic', 1: 'nh', 2: 'nh', 3: 'Kobe Bryant', 4: 'foobarman', 5: 'nh'}, 'followers': {0: 1500000, 1: 0, 2: 0, 3: 14900000, 4: 1, 5: 0}, 'following': {0: 347, 1: 0, 2: 0, 3: 514, 4: 1, 5: 0}})
cols_to_join = ['url', 'web_addr', 'notes']
on_handle = df_a.merge(right=df_b, left_on='player', right_on='handle', how='left')
res_df = []
res_df.append(on_handle)
for right_col in cols_to_join:
try:
temp = df_a.merge(right=df_b, left_on='website', right_on=right_col, how='left')
except:
temp = None
if temp is not None:
res_df.append(temp)
final = pd.concat(res_df, ignore_index=True)
final.drop_duplicates(inplace=True)
final
但是,这会产生包含重复列的错误结果。
我怎样才能更有效地执行此操作并获得正确的结果?
您可以使用列表传递 left_on
和 right_on
-
final = df_a.merge(
right=df_b,
left_on=['player', 'website', 'website', 'website'],
right_on=['handle', 'url', 'web_addr', 'notes'],
how='left'
)
使用:
#for same input
df_a = df_a.drop(['handle','followers','following'], axis=1)
# print (df_a)
#meltying df_b for column website from cols_to_join
cols_to_join = ['url', 'web_addr', 'notes']
df2 = df_b.melt(id_vars=df_b.columns.difference(cols_to_join), value_name='website')
#because duplicates, removed dupes by website
df2 = df2.sort_values('followers', ascending=False).drop_duplicates('website')
print (df2)
followers following handle platform variable \
9 14900000 514 Kobe Bryant Twitter web_addr
3 14900000 514 Kobe Bryant Twitter url
6 1500000 347 luka7doncic Twitter web_addr
12 1500000 347 luka7doncic Twitter notes
0 1500000 347 luka7doncic Twitter url
10 1 1 foobarman Twitter web_addr
4 1 1 foobarman Twitter url
16 1 1 foobarman Twitter notes
5 0 0 nh Twitter url
7 0 0 nh Twitter web_addr
8 0 0 nh Twitter web_addr
1 0 0 nh Twitter url
14 0 0 nh Twitter notes
website
9 https://granitystudios.com/
3 https://twitter.com/kobebryant
6 www.77.com
12 NaN
0 https://twitter.com/luka7doncic
10 thefoo.com
4 fooman.com
16 foobar
5 www.whosebug.com
7 https://en.wikipedia.org/wiki/Larry_Bird
8 https://www.michaeljordansworld.com/
1 www.larrybird.com
14 www.michaeljordan.com
#2 times merge and because same index values replace missing values
dffin1 = df_a.merge(df_b.drop(cols_to_join + ['platform'], axis=1), left_on='player', right_on='handle', how='left')
dffin2 = df_a.merge(df2.drop(['platform','variable'], axis=1), on='website', how='left')
dffin = dffin2.fillna(dffin1)
print (dffin)
player website merch followers following \
0 michael jordan www.michaeljordan.com Y 0.0 0.0
1 Lebron James www.kingjames.com Y NaN NaN
2 Kobe Bryant www.mamba.com Y 14900000.0 514.0
3 Larry Bird www.larrybird.com Y 0.0 0.0
4 luka Doncic www.77.com N 1500000.0 347.0
handle
0 nh
1 NaN
2 Kobe Bryant
3 nh
4 luka7doncic
我有一种情况,我正在尝试加入 df_a
到 df_b
实际上,这些 dataframes
具有以下形状:(389944, 121)
和 (1098118, 60)
如果任何以下条件成立,我需要有条件地加入这两个数据帧。如果是多个,只需加入一次:
df_a.player == df_b.handle
df_a.website == df_b.url
df_a.website == df_b.web_addr
df_a.website == df_b.notes
举个例子...
df_a:
player | website | merch |
---|---|---|
michael jordan | www.michaeljordan.com | Y |
Lebron James | www.kingjames.com | Y |
Kobe Bryant | www.mamba.com | Y |
Larry Bird | www.larrybird.com | Y |
luka Doncic | www.77.com | N |
df_b:
platform | url | web_addr | notes | handle | followers | following |
---|---|---|---|---|---|---|
https://twitter.com/luka7doncic | www.77.com | luka7doncic | 1500000 | 347 | ||
www.larrybird.com | https://en.wikipedia.org/wiki/Larry_Bird | www.larrybird.com | ||||
https://www.michaeljordansworld.com/ | www.michaeljordan.com | |||||
https://twitter.com/kobebryant | https://granitystudios.com/ | https://granitystudios.com/ | Kobe Bryant | 14900000 | 514 | |
fooman.com | thefoo.com | foobar | foobarman | 1 | 1 | |
www.whosebug.com |
理想情况下,df_a
获取 left joined
到 df_b
以引入 handle
、followers
和 following
字段
player | website | merch | handle | followers | following |
---|---|---|---|---|---|
michael jordan | www.michaeljordan.com | Y | nh | 0 | 0 |
Lebron James | www.kingjames.com | Y | null | null | null |
Kobe Bryant | www.mamba.com | Y | Kobe Bryant | 14900000 | 514 |
Larry Bird | www.larrybird.com | Y | nh | 0 | 0 |
luka Doncic | www.77.com | N | luka7doncic | 1500000 | 347 |
一个minimal, reproducible example如下:
import pandas as pd, numpy as np
df_a = pd.DataFrame.from_dict({'player': {0: 'michael jordan', 1: 'Lebron James', 2: 'Kobe Bryant', 3: 'Larry Bird', 4: 'luka Doncic'}, 'website': {0: 'www.michaeljordan.com', 1: 'www.kingjames.com', 2: 'www.mamba.com', 3: 'www.larrybird.com', 4: 'www.77.com'}, 'merch': {0: 'Y', 1: 'Y', 2: 'Y', 3: 'Y', 4: 'N'}, 'handle': {0: 'nh', 1: np.nan, 2: 'Kobe Bryant', 3: 'nh', 4: 'luka7doncic'}, 'followers': {0: 0.0, 1: np.nan, 2: 14900000.0, 3: 0.0, 4: 1500000.0}, 'following': {0: 0.0, 1: np.nan, 2: 514.0, 3: 0.0, 4: 347.0}})
df_b = pd.DataFrame.from_dict({'platform': {0: 'Twitter', 1: 'Twitter', 2: 'Twitter', 3: 'Twitter', 4: 'Twitter', 5: 'Twitter'}, 'url': {0: 'https://twitter.com/luka7doncic', 1: 'www.larrybird.com', 2: np.nan, 3: 'https://twitter.com/kobebryant', 4: 'fooman.com', 5: 'www.whosebug.com'}, 'web_addr': {0: 'www.77.com', 1: 'https://en.wikipedia.org/wiki/Larry_Bird', 2: 'https://www.michaeljordansworld.com/', 3: 'https://granitystudios.com/', 4: 'thefoo.com', 5: np.nan}, 'notes': {0: np.nan, 1: 'www.larrybird.com', 2: 'www.michaeljordan.com', 3: 'https://granitystudios.com/', 4: 'foobar', 5: np.nan}, 'handle': {0: 'luka7doncic', 1: 'nh', 2: 'nh', 3: 'Kobe Bryant', 4: 'foobarman', 5: 'nh'}, 'followers': {0: 1500000, 1: 0, 2: 0, 3: 14900000, 4: 1, 5: 0}, 'following': {0: 347, 1: 0, 2: 0, 3: 514, 4: 1, 5: 0}})
cols_to_join = ['url', 'web_addr', 'notes']
on_handle = df_a.merge(right=df_b, left_on='player', right_on='handle', how='left')
res_df = []
res_df.append(on_handle)
for right_col in cols_to_join:
try:
temp = df_a.merge(right=df_b, left_on='website', right_on=right_col, how='left')
except:
temp = None
if temp is not None:
res_df.append(temp)
final = pd.concat(res_df, ignore_index=True)
final.drop_duplicates(inplace=True)
final
但是,这会产生包含重复列的错误结果。
我怎样才能更有效地执行此操作并获得正确的结果?
您可以使用列表传递 left_on
和 right_on
-
final = df_a.merge(
right=df_b,
left_on=['player', 'website', 'website', 'website'],
right_on=['handle', 'url', 'web_addr', 'notes'],
how='left'
)
使用:
#for same input
df_a = df_a.drop(['handle','followers','following'], axis=1)
# print (df_a)
#meltying df_b for column website from cols_to_join
cols_to_join = ['url', 'web_addr', 'notes']
df2 = df_b.melt(id_vars=df_b.columns.difference(cols_to_join), value_name='website')
#because duplicates, removed dupes by website
df2 = df2.sort_values('followers', ascending=False).drop_duplicates('website')
print (df2)
followers following handle platform variable \
9 14900000 514 Kobe Bryant Twitter web_addr
3 14900000 514 Kobe Bryant Twitter url
6 1500000 347 luka7doncic Twitter web_addr
12 1500000 347 luka7doncic Twitter notes
0 1500000 347 luka7doncic Twitter url
10 1 1 foobarman Twitter web_addr
4 1 1 foobarman Twitter url
16 1 1 foobarman Twitter notes
5 0 0 nh Twitter url
7 0 0 nh Twitter web_addr
8 0 0 nh Twitter web_addr
1 0 0 nh Twitter url
14 0 0 nh Twitter notes
website
9 https://granitystudios.com/
3 https://twitter.com/kobebryant
6 www.77.com
12 NaN
0 https://twitter.com/luka7doncic
10 thefoo.com
4 fooman.com
16 foobar
5 www.whosebug.com
7 https://en.wikipedia.org/wiki/Larry_Bird
8 https://www.michaeljordansworld.com/
1 www.larrybird.com
14 www.michaeljordan.com
#2 times merge and because same index values replace missing values
dffin1 = df_a.merge(df_b.drop(cols_to_join + ['platform'], axis=1), left_on='player', right_on='handle', how='left')
dffin2 = df_a.merge(df2.drop(['platform','variable'], axis=1), on='website', how='left')
dffin = dffin2.fillna(dffin1)
print (dffin)
player website merch followers following \
0 michael jordan www.michaeljordan.com Y 0.0 0.0
1 Lebron James www.kingjames.com Y NaN NaN
2 Kobe Bryant www.mamba.com Y 14900000.0 514.0
3 Larry Bird www.larrybird.com Y 0.0 0.0
4 luka Doncic www.77.com N 1500000.0 347.0
handle
0 nh
1 NaN
2 Kobe Bryant
3 nh
4 luka7doncic