如何根据索引合并两个数据框?

How to merge two dataframes according to their indexes?

我正在尝试使用 Pandas 进行数据分析。我需要根据索引合并两个数据框。但是,它们的指标完全不同。规则是,如果 df2 的索引是 df1 的子串,那么我应该合并它们。例如,df1.index == ['a/aa/aaa'、'b/bb/bbb'、'c/cc/ccc'] 和 df2.index == ['bb/bbb'、'ccc' , 'hello']。那么 df1 和 df2 有两个共同的索引,我们应该根据这些索引来合并它们。我该怎么办?

由于您有一个已知的分隔符,您可以在该分隔符上拆分,进行一些合并,然后添加回原始数据。

# sample data
df1 = pd.DataFrame({'ColumnA': [1,2,3]}, index=['a/aa/aaa','b/bb/bbb', 'c/cc/ccc'])
df2 = pd.DataFrame({'ColumnB': [4,5,6]}, index=['bb/bbb', 'ccc', 'hello'])

# set original index as column
# make a copy of each dataframe to preserve original data
# reset index of copy to keep track of original row number
df1 = df1.reset_index()
copy_df1 = df1
copy_df1.index.name = 'row_df1'
copy_df1 = df1.reset_index()

df2 = df2.reset_index()
copy_df2 = df2
copy_df2.index.name = 'row_df2'
copy_df2 = copy_df2.reset_index()

# split on known delimiter and explode into rows for each substring
copy_df1['index'] = copy_df1['index'].str.split('/')
copy_df1 = copy_df1.explode('index')

copy_df2['index'] = copy_df2['index'].str.split('/')
copy_df2 = copy_df2.explode('index')

# merge based on substrings, drop duplicates in case of multiple substring matches
mrg = copy_df1[['row_df1','index']].merge(copy_df2[['row_df2','index']]).drop(columns='index')
mrg = mrg.drop_duplicates()

# merge back in original details
mrg = mrg.merge(df1, left_on='row_df1', right_index=True)
mrg = mrg.merge(df2, left_on='row_df2', right_index=True, suffixes=('_df1','_df2'))

最终输出为:

   row_df1  row_df2 index_df1  ColumnA index_df2  ColumnB
0        1        0  b/bb/bbb        2    bb/bbb        4
2        2        1  c/cc/ccc        3       ccc        5

拥有你的 DataFrame :

>>> df1 = pd.DataFrame({'col_a': [1, 2, 3]}, index=['a/aa/aaa','b/bb/bbb', 'c/cc/ccc'])
>>> df2 = pd.DataFrame({'col_b': [4, 5, 6]}, index=['bb/bbb', 'ccc', 'hello'])

并将 index 更改为 column :

>>> df1=df1.reset_index(drop=False)
>>> df1 = df1.rename(columns={'index': 'value_df1'})
>>> df1
    value_df1   col_a
0   a/aa/aaa    1
1   b/bb/bbb    2
2   c/cc/ccc    3

>>> df2=df2.reset_index(drop=False)
>>> df2 = df2.rename(columns={'index': 'value_df2'})
>>> df2
    value_df2       col_b
0   bb/bbb          4
1   ccc             5
2   hello           6

我们在 join 列合并两个 DataFrame :

>>> df1['join'] = 1
>>> df2['join'] = 1
>>> dfFull = df1.merge(df2, on='join').drop('join', axis=1)
>>> dfFull
    value_df1   col_a   value_df2       col_b
0   a/aa/aaa    1       bb/bbb          4
1   a/aa/aaa    1       ccc             5
2   a/aa/aaa    1       hello           6
3   b/bb/bbb    2       bb/bbb          4
4   b/bb/bbb    2       ccc             5
5   b/bb/bbb    2       hello           6
6   c/cc/ccc    3       bb/bbb          4
7   c/cc/ccc    3       ccc             5
8   c/cc/ccc    3       hello           6

然后我们使用 apply 来匹配初始 index 值:

>>> df2.drop('join', axis=1, inplace=True)
>>> dfFull['match'] = dfFull.apply(lambda x: x['value_df1'].find(x['value_df2']), axis=1).ge(0)
>>> dfFull
    value_df1   col_a   value_df2       col_b   match
0   a/aa/aaa    1       bb/bbb          4       False
1   a/aa/aaa    1       ccc             5       False
2   a/aa/aaa    1       hello           6       False
3   b/bb/bbb    2       bb/bbb          4       True
4   b/bb/bbb    2       ccc             5       False
5   b/bb/bbb    2       hello           6       False
6   c/cc/ccc    3       bb/bbb          4       False
7   c/cc/ccc    3       ccc             5       True
8   c/cc/ccc    3       hello           6       False 

过滤 match 列为 True 的行并删除 match 列,我们得到预期的结果:

>>> dfFull[dfFull['match']].drop(['match'], axis=1)
    value_df1   col_a   value_df2   col_b
3   b/bb/bbb    2       bb/bbb      4       
7   c/cc/ccc    3       ccc         5       

此解决方案的灵感来自此 post