Pandas 合并两个 df,其中 df1.column 是 = 或者在 df2.column1 和 df2.column2 之间
Pandas merge two df's where df1.column is = or between df2.column1 and df2.column2
我似乎无法在任何地方找到这个例子。无法确定我是否需要使用索引、布尔掩码,或者是否可以直接合并。我尝试了 .isin
和 .between
的变体,但都没有成功。
场景:
没有公共索引的两个数据帧:
df1 = pd.DataFrame({'printId': ['x','y', 'z', 'a'],'locCode': [0.9, 1.5, 4.0, 7.8]})
df2 = pd.DataFrame({'assetId': ['1','1a', '2', '2a', '3', '4'], 'locStart': [0.9, 0.9, 1, 1, 4, 8], 'locEnd': [0.9, 0.9, 3, 3, 5, 13]})
df1:
df2:
需要这个:
df3 = pd.DataFrame({'printId': ['x','x', 'y', 'y', 'z', 'a', 'NaN'], 'locCode': ['0.9', '0.9', '1.5', '1.5', '4.0', '7.8', 'NaN'], 'assetID': ['1', '1a', '2', '2a', '3', 'NaN', '4'], 'locStart': ['0.9', '0.9', '1.0', '1.0', '4.0', 'NaN', '8.0'], 'locEnd':['0.9', '0.9', '3.0', '3.0', '4.0', 'NaN', '13.0']})
df3
专业人士如何解决这个问题?
已编辑: 经过仔细检查,原始答案无效。
- 其中
df2
有重复的 locStart/End
记录,但 assetID
唯一(第 0、1 行和第 2、3 行),df1
不会合并。
试试这个。
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'printId': ['01','1A', '2B', '3C'],'locCode': [0.9, 1.5, 4.0, 7.8], 'a1': ['foo', 'foo', 'foo', 'foo']})
df2 = pd.DataFrame({'assetId': ['oo', 'aa', 'zz', 'xx'], 'locCodeStart': [0, 1, 4, 8], 'locCodeEnd': [0.4, 3, 5, 13]})
df1['assetId'] = np.nan
for ind, row in df1.iterrows():
loc_code = row['locCode']
temp = (df2['locCodeStart'] <= loc_code) & (loc_code < df2['locCodeEnd'])
df2_index = temp[temp == True].index
if len(df2_index) == 1:
df1['assetId'].loc[ind] = df2['assetId'].loc[df2_index[0]]
pd.merge(df1, df2, how = 'outer')
如果 df2 中有重复的 locStart、locEnd(具有唯一的 assetId),这也有效
如果在 df1 中有重复的 printId 或 locCode,它也会处理这种情况
df1 = pd.DataFrame({'printId': ['x','y', 'z', 'a'],'locCode': [0.9, 1.5, 4.0, 7.8]})
df2 = pd.DataFrame({'assetId': ['1','1a', '2', '2a', '3', '4'], 'locStart': [0.9, 0.9, 1, 1, 4, 8], 'locEnd': [0.9, 0.9, 3, 3, 5, 13]})
merge_id = []
for i,code in df1.locCode.iteritems():
filled = True
partial = []
for j,row in df2.iterrows():
if code>=row.locStart and code<=row.locEnd:
filled = False
partial.append(j)
if filled:
partial.append(-1)
merge_id.append(partial)
df1['merge_id'] = merge_id
df = df1.explode('merge_id').merge(df2, right_index=True, left_on='merge_id', how='outer')
df = df.reset_index(drop=True).drop('merge_id', axis=1)
print(df)
printId locCode assetId locStart locEnd
0 x 0.9 1 0.9 0.9
1 x 0.9 1a 0.9 0.9
2 z 4.0 3 4.0 5.0
3 y 1.5 2 1.0 3.0
4 y 1.5 2a 1.0 3.0
5 a 7.8 NaN NaN NaN
6 NaN NaN 4 8.0 13.0
你可以用不同的方式处理两种情况。第一种情况很简单 merge
,其中 df1.locCode 存在于 df2.locStart 中,第二种情况是在 df1 和 df2 中创建具有所有值 locStart 和 locEnd 的 bin,然后 merge
它们之后删除第一次合并中已经处理的案例:
## handle the case where locCode of df1 is equal to locStart of df2
# get the rows in this case
mask_merge = df1['locCode'].isin(df2['locStart'].unique())
# handle them with a direct
m1 = df1[mask_merge].merge(df2, right_on='locStart', left_on='locCode', how='inner')
## handle the other cases
# get unique values from df2 both columns loc
l_unique = pd.concat([df2['locStart'], df2['locEnd']]).sort_values().unique()
# add cat columns with pd.cut in both dataframe with all unique values
df2['cat'] = pd.cut(df2['locEnd'], bins=[-np.inf]+l_unique.tolist() + [+np.inf],
labels=range(len(l_unique)+1))
df1['cat'] = pd.cut(df1['locCode'], bins=[-np.inf]+l_unique.tolist() + [+np.inf],
labels=range(len(l_unique)+1))
# mask to get which asser does not have same start and end
mask_startEnd = df2['locStart'].ne(df2['locEnd'])
# mask assert already merged above
mask_df2merged = ~df2['locStart'].isin(df1['locCode'])
# merge the rows needed
m2 = df1[~mask_merge].merge(df2[mask_startEnd&mask_df2merged], on='cat', how='outer')
#concat both and drop column cat
res = pd.concat([m1, m2], axis=0, ignore_index=True).drop('cat', axis=1)
print (res)
printId locCode assetId locStart locEnd
0 x 0.9 1 0.9 0.9
1 x 0.9 1a 0.9 0.9
2 z 4.0 3 4.0 5.0
3 y 1.5 2 1.0 3.0
4 y 1.5 2a 1.0 3.0
5 a 7.8 NaN NaN NaN
6 NaN NaN 4 8.0 13.0
我似乎无法在任何地方找到这个例子。无法确定我是否需要使用索引、布尔掩码,或者是否可以直接合并。我尝试了 .isin
和 .between
的变体,但都没有成功。
场景:
没有公共索引的两个数据帧:
df1 = pd.DataFrame({'printId': ['x','y', 'z', 'a'],'locCode': [0.9, 1.5, 4.0, 7.8]}) df2 = pd.DataFrame({'assetId': ['1','1a', '2', '2a', '3', '4'], 'locStart': [0.9, 0.9, 1, 1, 4, 8], 'locEnd': [0.9, 0.9, 3, 3, 5, 13]})
df1:
df2:
需要这个:
df3 = pd.DataFrame({'printId': ['x','x', 'y', 'y', 'z', 'a', 'NaN'], 'locCode': ['0.9', '0.9', '1.5', '1.5', '4.0', '7.8', 'NaN'], 'assetID': ['1', '1a', '2', '2a', '3', 'NaN', '4'], 'locStart': ['0.9', '0.9', '1.0', '1.0', '4.0', 'NaN', '8.0'], 'locEnd':['0.9', '0.9', '3.0', '3.0', '4.0', 'NaN', '13.0']})
df3
专业人士如何解决这个问题?
已编辑: 经过仔细检查,原始答案无效。
- 其中
df2
有重复的locStart/End
记录,但assetID
唯一(第 0、1 行和第 2、3 行),df1
不会合并。
试试这个。
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'printId': ['01','1A', '2B', '3C'],'locCode': [0.9, 1.5, 4.0, 7.8], 'a1': ['foo', 'foo', 'foo', 'foo']})
df2 = pd.DataFrame({'assetId': ['oo', 'aa', 'zz', 'xx'], 'locCodeStart': [0, 1, 4, 8], 'locCodeEnd': [0.4, 3, 5, 13]})
df1['assetId'] = np.nan
for ind, row in df1.iterrows():
loc_code = row['locCode']
temp = (df2['locCodeStart'] <= loc_code) & (loc_code < df2['locCodeEnd'])
df2_index = temp[temp == True].index
if len(df2_index) == 1:
df1['assetId'].loc[ind] = df2['assetId'].loc[df2_index[0]]
pd.merge(df1, df2, how = 'outer')
如果 df2 中有重复的 locStart、locEnd(具有唯一的 assetId),这也有效
如果在 df1 中有重复的 printId 或 locCode,它也会处理这种情况
df1 = pd.DataFrame({'printId': ['x','y', 'z', 'a'],'locCode': [0.9, 1.5, 4.0, 7.8]})
df2 = pd.DataFrame({'assetId': ['1','1a', '2', '2a', '3', '4'], 'locStart': [0.9, 0.9, 1, 1, 4, 8], 'locEnd': [0.9, 0.9, 3, 3, 5, 13]})
merge_id = []
for i,code in df1.locCode.iteritems():
filled = True
partial = []
for j,row in df2.iterrows():
if code>=row.locStart and code<=row.locEnd:
filled = False
partial.append(j)
if filled:
partial.append(-1)
merge_id.append(partial)
df1['merge_id'] = merge_id
df = df1.explode('merge_id').merge(df2, right_index=True, left_on='merge_id', how='outer')
df = df.reset_index(drop=True).drop('merge_id', axis=1)
print(df)
printId locCode assetId locStart locEnd
0 x 0.9 1 0.9 0.9
1 x 0.9 1a 0.9 0.9
2 z 4.0 3 4.0 5.0
3 y 1.5 2 1.0 3.0
4 y 1.5 2a 1.0 3.0
5 a 7.8 NaN NaN NaN
6 NaN NaN 4 8.0 13.0
你可以用不同的方式处理两种情况。第一种情况很简单 merge
,其中 df1.locCode 存在于 df2.locStart 中,第二种情况是在 df1 和 df2 中创建具有所有值 locStart 和 locEnd 的 bin,然后 merge
它们之后删除第一次合并中已经处理的案例:
## handle the case where locCode of df1 is equal to locStart of df2
# get the rows in this case
mask_merge = df1['locCode'].isin(df2['locStart'].unique())
# handle them with a direct
m1 = df1[mask_merge].merge(df2, right_on='locStart', left_on='locCode', how='inner')
## handle the other cases
# get unique values from df2 both columns loc
l_unique = pd.concat([df2['locStart'], df2['locEnd']]).sort_values().unique()
# add cat columns with pd.cut in both dataframe with all unique values
df2['cat'] = pd.cut(df2['locEnd'], bins=[-np.inf]+l_unique.tolist() + [+np.inf],
labels=range(len(l_unique)+1))
df1['cat'] = pd.cut(df1['locCode'], bins=[-np.inf]+l_unique.tolist() + [+np.inf],
labels=range(len(l_unique)+1))
# mask to get which asser does not have same start and end
mask_startEnd = df2['locStart'].ne(df2['locEnd'])
# mask assert already merged above
mask_df2merged = ~df2['locStart'].isin(df1['locCode'])
# merge the rows needed
m2 = df1[~mask_merge].merge(df2[mask_startEnd&mask_df2merged], on='cat', how='outer')
#concat both and drop column cat
res = pd.concat([m1, m2], axis=0, ignore_index=True).drop('cat', axis=1)
print (res)
printId locCode assetId locStart locEnd
0 x 0.9 1 0.9 0.9
1 x 0.9 1a 0.9 0.9
2 z 4.0 3 4.0 5.0
3 y 1.5 2 1.0 3.0
4 y 1.5 2a 1.0 3.0
5 a 7.8 NaN NaN NaN
6 NaN NaN 4 8.0 13.0