重新索引缺少类别的多级索引

reindex multi level index with missing categories

我有一个包含两个索引的数据框,groupclass。我有一本字典,其中包含需要添加到这两个索引中的其他级别。具体来说,我想将 E 添加到 group 索引中。我想确保所有 g1、g2 和 g3 都存在于 class 索引中,每个 group(因此将 g3 添加到A组,g1到B组,g2和g3到C组,g1和g3到D组,g1,g2和g3到E组。我想在适当的地方用零填充总列

原始数据框在这里

df = pd.DataFrame(data={'group' : ['A','A','B','B','C','D'],
                        'class': ['g1','g2','g2','g3','g1','g2'],
                        'total' : [3,14,12,11,21,9]})

包含所有必需类别的字典(和映射的 df)在这里

dic = {'group':['A','B','C','D','E'],
       'class' : ['g1','g2','g3']}

预期的输出在这里

expectedOutput = pd.DataFrame(data={'group' : ['A','A','A','B','B','B','C','C','C','D','D','D','E','E','E'],
                        'class': ['g1','g2', 'g3','g1','g2', 'g3','g1','g2', 'g3','g1','g2', 'g3','g1','g2', 'g3'],
                        'total' : [3,14,0, 0,12,11,21,0,0,0,9,0, 0,0,0]})

我在重建索引时无法维护重复的值,但我需要保留所有这些值。欢迎任何建议,非常感谢

MultiIndex 的解决方案 - MultiIndex.from_product with DataFrame.reindexdict 创建:

dic = {'group':['A','B','C','D','E'],
       'class' : ['g1','g2','g3']}

mux = pd.MultiIndex.from_product(dic.values(), names=dic)

df = df.set_index(list(dic)).reindex(mux, fill_value=0).reset_index()
print (df)
   group class  total
0      A    g1      3
1      A    g2     14
2      A    g3      0
3      B    g1      0
4      B    g2     12
5      B    g3     11
6      C    g1     21
7      C    g2      0
8      C    g3      0
9      D    g1      0
10     D    g2      9
11     D    g3      0
12     E    g1      0
13     E    g2      0
14     E    g3      0

或左连接 DataFrameitertools.product 创建:

from  itertools import product

dicDf = pd.DataFrame(product(*dic.values()), columns=dic)

df = dicDf.merge(df, how='left').fillna({'total':0})
print (df)
   group class  total
0      A    g1    3.0
1      A    g2   14.0
2      A    g3    0.0
3      B    g1    0.0
4      B    g2   12.0
5      B    g3   11.0
6      C    g1   21.0
7      C    g2    0.0
8      C    g3    0.0
9      D    g1    0.0
10     D    g2    9.0
11     D    g3    0.0
12     E    g1    0.0
13     E    g2    0.0
14     E    g3    0.0

您可以使用不错的 pyjanitor module and its complete 方法:

# pip install pyjanitor
import janitor as jn 
(df.complete({'group': list(df['group'].unique())+['D', 'E']}, 'class')
   .fillna(0, downcast='infer')
)

输出:

   group class  total
0      A    g1      3
1      A    g2     14
2      A    g3      0
3      B    g1      0
4      B    g2     12
5      B    g3     11
6      C    g1     21
7      C    g2      0
8      C    g3      0
9      D    g1      0
10     D    g2      9
11     D    g3      0
12     E    g1      0
13     E    g2      0
14     E    g3      0