Pandas:连接多级索引,使另一个 Dataframe 具有不同的排序

Pandas: concatenate multilevel index such that the other Dataframe has different sorting

假设您有以下两个数据帧 dfsdf1

dfs 由

定义
dfs=pd.DataFrame({
    'Year':[2006, 2006, 2006, 2006, 2006], 
    'Provider':list('abcsd'),
    'Accepted': [2570, 1020, 2140, 120, 15]
 }) 

dfs=dfs.groupby(['Year', 'Provider']).sum()

df1 由

定义
df1=pd.DataFrame({
    'Year':[2006, 2006, 2006, 2006, 2006], 
    'Provider':list('aabbc'),
    'Gender': list('mfmfm'),
    'Accepted app': ['990', '1180', '435', '405', '985']
 })

我想合并这两个数据框以获得类似的东西

df2=pd.DataFrame({
    'Year':[2006, 2006, 2006, 2006, 2006,2006, 2006, 2006, 2006, 2006], 
    'Provider':list('abcsdabcsd'),
    'Accepted': [2570, 1020, 2140, 120, 15,2570, 1020, 2140, 120, 15],
    'Gender': ['m', 'm', 'm', 'Nan', 'Nan', 'f', 'f', 'Nan', 'Nan', 'Nan'],
    'Accepted app': ['990', '435', '985', 'Nan', 'Nan','1180', '405', 'Nan', 'Nan', 'Nan']

 }) 

我不知道如何保留dfs的多级索引或如何合并它们。

您可以按 Gender 的唯一值重复行 concat, use DataFrame.reset_index 然后 merge 左连接:

df = (pd.concat([dfs.assign(Gender=c) for c in df1['Gender'].unique()])
        .reset_index()
        .merge(df1, on=['Provider','Year','Gender'], how='left'))
print (df)
   Year Provider  Accepted Gender Accepted app
0  2006        a      2570      m          990
1  2006        b      1020      m          435
2  2006        c      2140      m          985
3  2006        d        15      m          NaN
4  2006        s       120      m          NaN
5  2006        a      2570      f         1180
6  2006        b      1020      f          405
7  2006        c      2140      f          NaN
8  2006        d        15      f          NaN
9  2006        s       120      f          NaN

如果还想将 Gender 列设置为缺失值,可以通过 DataFrame.merge 中的 indicator=True 参数识别新行的来源,然后用缺失值替换:

df = (pd.concat([dfs.assign(Gender=c) for c in df1['Gender'].unique()])
        .reset_index()
        .merge(df1, on=['Provider','Year','Gender'], how='left', indicator=True)
        .assign(Gender=lambda x: x['Gender'].mask(x['_merge'].eq('left_only')))
        .drop('_merge', axis=1))

print (df)
   Year Provider  Accepted Gender Accepted app
0  2006        a      2570      m          990
1  2006        b      1020      m          435
2  2006        c      2140      m          985
3  2006        d        15    NaN          NaN
4  2006        s       120    NaN          NaN
5  2006        a      2570      f         1180
6  2006        b      1020      f          405
7  2006        c      2140    NaN          NaN
8  2006        d        15    NaN          NaN
9  2006        s       120    NaN          NaN