如何将具有多级列和轴名称的数据透视表 table 转换为行?

How to transform pivot table with multi level column and axis names to rows?

我有一个下面提到的数据帧,它是使用 pd.pivot_table 传递参数 agg_func 作为 sum:

旋转数据帧的结果
                                          counts           
RACE                   BLACK OR AFRICAN AMERICAN WHITE  All
ETHNIC                                                     
HISPANIC OR LATINO                            11    41   52
NOT HISPANIC OR LATINO                        15    71   86
All                                           26   112  138

您可以 运行 下面的代码将上面的数据框加载到变量中 df:

df = pd.DataFrame.from_dict({('counts', 'BLACK OR AFRICAN AMERICAN'): {'HISPANIC OR LATINO': 11, 'NOT HISPANIC OR LATINO': 15, 'All': 26}, ('counts', 'WHITE'): {'HISPANIC OR LATINO': 41, 'NOT HISPANIC OR LATINO': 71, 'All': 112}, ('counts', 'All'): {'HISPANIC OR LATINO': 52, 'NOT HISPANIC OR LATINO': 86, 'All': 138}}).rename_axis((None, 'RACE'), axis=1).rename_axis((['ETHNIC']), axis=0)

我正在尝试按如下所述转换此数据框:

预期输出:

  level varName                      value counts
0     2  ETHNIC         HISPANIC OR LATINO    52
1     1    RACE  BLACK OR AFRICAN AMERICAN    11
2     1    RACE                      WHITE    41
3     2  ETHNIC     NOT HISPANIC OR LATINO    86
4     1    RACE  BLACK OR AFRICAN AMERICAN    15
5     1    RACE                      WHITE    71

上面的varName字段表示列和行的轴名称,level列轴为1,行轴为2,索引值为[=的行23=] 是可选的,所以它是否在结果数据框中并不重要(后来我认为如果它在那里会更好)

我已经查看了以下 SO 线程,但我没有发现它们与我的问题有多大关系。

我能够转换的一种方法是手动获取值并手动创建所需的数据框,如下所示:

df.index.names
# output: FrozenList(['ETHNIC'])
df.columns.names
#output: FrozenList([None, 'RACE'])
[y for x,y in df][:-1]
#output: ['BLACK OR AFRICAN AMERICAN', 'WHITE']
[x for x in df.index][:-1]
#output: ['HISPANIC OR LATINO', 'NOT HISPANIC OR LATINO']

编辑:

这是创建数据透视表之前的数据table:

data = pd.DataFrame.from_dict({'ETHNIC': {0: 'NOT HISPANIC OR LATINO', 1: 'NOT HISPANIC OR LATINO', 2: 'NOT HISPANIC OR LATINO', 3: 'NOT HISPANIC OR LATINO', 4: 'NOT HISPANIC OR LATINO', 5: 'NOT HISPANIC OR LATINO', 6: 'NOT HISPANIC OR LATINO', 7: 'NOT HISPANIC OR LATINO', 8: 'NOT HISPANIC OR LATINO', 9: 'HISPANIC OR LATINO', 10: 'HISPANIC OR LATINO', 11: 'HISPANIC OR LATINO', 12: 'HISPANIC OR LATINO', 13: 'HISPANIC OR LATINO', 14: 'NOT HISPANIC OR LATINO', 15: 'NOT HISPANIC OR LATINO', 16: 'NOT HISPANIC OR LATINO', 17: 'NOT HISPANIC OR LATINO', 18: 'NOT HISPANIC OR LATINO', 19: 'NOT HISPANIC OR LATINO', 20: 'NOT HISPANIC OR LATINO', 21: 'HISPANIC OR LATINO', 22: 'HISPANIC OR LATINO', 23: 'NOT HISPANIC OR LATINO', 24: 'NOT HISPANIC OR LATINO', 25: 'NOT HISPANIC OR LATINO', 26: 'HISPANIC OR LATINO', 27: 'HISPANIC OR LATINO', 28: 'HISPANIC OR LATINO', 29: 'HISPANIC OR LATINO', 30: 'HISPANIC OR LATINO', 31: 'HISPANIC OR LATINO', 32: 'NOT HISPANIC OR LATINO', 33: 'HISPANIC OR LATINO', 34: 'NOT HISPANIC OR LATINO', 35: 'NOT HISPANIC OR LATINO', 36: 'NOT HISPANIC OR LATINO', 37: 'NOT HISPANIC OR LATINO', 38: 'NOT HISPANIC OR LATINO', 39: 'NOT HISPANIC OR LATINO', 40: 'NOT HISPANIC OR LATINO', 41: 'NOT HISPANIC OR LATINO', 42: 'HISPANIC OR LATINO', 43: 'NOT HISPANIC OR LATINO', 44: 'NOT HISPANIC OR LATINO', 45: 'NOT HISPANIC OR LATINO', 46: 'HISPANIC OR LATINO', 47: 'HISPANIC OR LATINO', 48: 'HISPANIC OR LATINO', 49: 'HISPANIC OR LATINO', 50: 'NOT HISPANIC OR LATINO', 51: 'NOT HISPANIC OR LATINO', 52: 'NOT HISPANIC OR LATINO', 53: 'HISPANIC OR LATINO', 54: 'HISPANIC OR LATINO', 55: 'HISPANIC OR LATINO', 56: 'NOT HISPANIC OR LATINO', 57: 'HISPANIC OR LATINO', 58: 'HISPANIC OR LATINO', 59: 'NOT HISPANIC OR LATINO', 60: 'NOT HISPANIC OR LATINO', 61: 'HISPANIC OR LATINO', 62: 'HISPANIC OR LATINO', 63: 'HISPANIC OR LATINO', 64: 'HISPANIC OR LATINO', 65: 'NOT HISPANIC OR LATINO', 66: 'NOT HISPANIC OR LATINO', 67: 'NOT HISPANIC OR LATINO', 68: 'NOT HISPANIC OR LATINO', 69: 'HISPANIC OR LATINO', 70: 'NOT HISPANIC OR LATINO', 71: 'NOT HISPANIC OR LATINO', 72: 'HISPANIC OR LATINO', 73: 'HISPANIC OR LATINO', 74: 'HISPANIC OR LATINO', 75: 'NOT HISPANIC OR LATINO', 76: 'NOT HISPANIC OR LATINO', 77: 'NOT HISPANIC OR LATINO', 78: 'NOT HISPANIC OR LATINO', 79: 'NOT HISPANIC OR LATINO', 80: 'NOT HISPANIC OR LATINO', 81: 'NOT HISPANIC OR LATINO', 82: 'HISPANIC OR LATINO', 83: 'HISPANIC OR LATINO', 84: 'HISPANIC OR LATINO', 85: 'NOT HISPANIC OR LATINO', 86: 'HISPANIC OR LATINO', 87: 'HISPANIC OR LATINO', 88: 'HISPANIC OR LATINO', 89: 'NOT HISPANIC OR LATINO', 90: 'NOT HISPANIC OR LATINO', 91: 'NOT HISPANIC OR LATINO', 92: 'NOT HISPANIC OR LATINO', 93: 'NOT HISPANIC OR LATINO', 94: 'NOT HISPANIC OR LATINO', 95: 'HISPANIC OR LATINO', 96: 'HISPANIC OR LATINO', 97: 'HISPANIC OR LATINO', 98: 'NOT HISPANIC OR LATINO', 99: 'NOT HISPANIC OR LATINO', 100: 'NOT HISPANIC OR LATINO', 101: 'NOT HISPANIC OR LATINO', 102: 'NOT HISPANIC OR LATINO', 103: 'NOT HISPANIC OR LATINO', 104: 'NOT HISPANIC OR LATINO', 105: 'NOT HISPANIC OR LATINO', 106: 'NOT HISPANIC OR LATINO', 107: 'NOT HISPANIC OR LATINO', 108: 'NOT HISPANIC OR LATINO', 109: 'HISPANIC OR LATINO', 110: 'HISPANIC OR LATINO', 111: 'NOT HISPANIC OR LATINO', 112: 'NOT HISPANIC OR LATINO', 113: 'NOT HISPANIC OR LATINO', 114: 'NOT HISPANIC OR LATINO', 115: 'HISPANIC OR LATINO', 116: 'HISPANIC OR LATINO', 117: 'NOT HISPANIC OR LATINO', 118: 'HISPANIC OR LATINO', 119: 'HISPANIC OR LATINO', 120: 'NOT HISPANIC OR LATINO', 121: 'HISPANIC OR LATINO', 122: 'HISPANIC OR LATINO', 123: 'HISPANIC OR LATINO', 124: 'HISPANIC OR LATINO', 125: 'HISPANIC OR LATINO', 126: 'NOT HISPANIC OR LATINO', 127: 'NOT HISPANIC OR LATINO', 128: 'NOT HISPANIC OR LATINO', 129: 'NOT HISPANIC OR LATINO', 130: 'NOT HISPANIC OR LATINO', 131: 'NOT HISPANIC OR LATINO', 132: 'NOT HISPANIC OR LATINO', 133: 'NOT HISPANIC OR LATINO', 134: 'NOT HISPANIC OR LATINO', 135: 'NOT HISPANIC OR LATINO', 136: 'NOT HISPANIC OR LATINO', 137: 'NOT HISPANIC OR LATINO'}, 'RACE': {0: 'WHITE', 1: 'WHITE', 2: 'WHITE', 3: 'WHITE', 4: 'WHITE', 5: 'WHITE', 6: 'WHITE', 7: 'WHITE', 8: 'WHITE', 9: 'BLACK OR AFRICAN AMERICAN', 10: 'BLACK OR AFRICAN AMERICAN', 11: 'BLACK OR AFRICAN AMERICAN', 12: 'BLACK OR AFRICAN AMERICAN', 13: 'BLACK OR AFRICAN AMERICAN', 14: 'WHITE', 15: 'WHITE', 16: 'WHITE', 17: 'WHITE', 18: 'WHITE', 19: 'WHITE', 20: 'BLACK OR AFRICAN AMERICAN', 21: 'WHITE', 22: 'WHITE', 23: 'WHITE', 24: 'BLACK OR AFRICAN AMERICAN', 25: 'BLACK OR AFRICAN AMERICAN', 26: 'WHITE', 27: 'WHITE', 28: 'WHITE', 29: 'WHITE', 30: 'WHITE', 31: 'WHITE', 32: 'WHITE', 33: 'WHITE', 34: 'WHITE', 35: 'WHITE', 36: 'WHITE', 37: 'WHITE', 38: 'WHITE', 39: 'WHITE', 40: 'BLACK OR AFRICAN AMERICAN', 41: 'BLACK OR AFRICAN AMERICAN', 42: 'WHITE', 43: 'WHITE', 44: 'WHITE', 45: 'WHITE', 46: 'WHITE', 47: 'WHITE', 48: 'WHITE', 49: 'WHITE', 50: 'WHITE', 51: 'BLACK OR AFRICAN AMERICAN', 52: 'BLACK OR AFRICAN AMERICAN', 53: 'WHITE', 54: 'WHITE', 55: 'WHITE', 56: 'WHITE', 57: 'WHITE', 58: 'WHITE', 59: 'WHITE', 60: 'WHITE', 61: 'WHITE', 62: 'WHITE', 63: 'WHITE', 64: 'WHITE', 65: 'BLACK OR AFRICAN AMERICAN', 66: 'BLACK OR AFRICAN AMERICAN', 67: 'BLACK OR AFRICAN AMERICAN', 68: 'BLACK OR AFRICAN AMERICAN', 69: 'WHITE', 70: 'WHITE', 71: 'WHITE', 72: 'WHITE', 73: 'WHITE', 74: 'BLACK OR AFRICAN AMERICAN', 75: 'WHITE', 76: 'WHITE', 77: 'WHITE', 78: 'WHITE', 79: 'WHITE', 80: 'BLACK OR AFRICAN AMERICAN', 81: 'BLACK OR AFRICAN AMERICAN', 82: 'BLACK OR AFRICAN AMERICAN', 83: 'BLACK OR AFRICAN AMERICAN', 84: 'BLACK OR AFRICAN AMERICAN', 85: 'BLACK OR AFRICAN AMERICAN', 86: 'WHITE', 87: 'WHITE', 88: 'WHITE', 89: 'WHITE', 90: 'WHITE', 91: 'WHITE', 92: 'WHITE', 93: 'WHITE', 94: 'WHITE', 95: 'WHITE', 96: 'WHITE', 97: 'WHITE', 98: 'WHITE', 99: 'WHITE', 100: 'WHITE', 101: 'WHITE', 102: 'WHITE', 103: 'WHITE', 104: 'WHITE', 105: 'WHITE', 106: 'WHITE', 107: 'WHITE', 108: 'BLACK OR AFRICAN AMERICAN', 109: 'WHITE', 110: 'WHITE', 111: 'WHITE', 112: 'WHITE', 113: 'WHITE', 114: 'WHITE', 115: 'BLACK OR AFRICAN AMERICAN', 116: 'BLACK OR AFRICAN AMERICAN', 117: 'WHITE', 118: 'WHITE', 119: 'WHITE', 120: 'WHITE', 121: 'WHITE', 122: 'WHITE', 123: 'WHITE', 124: 'WHITE', 125: 'WHITE', 126: 'WHITE', 127: 'WHITE', 128: 'WHITE', 129: 'WHITE', 130: 'WHITE', 131: 'WHITE', 132: 'WHITE', 133: 'WHITE', 134: 'WHITE', 135: 'WHITE', 136: 'WHITE', 137: 'WHITE'}})

这里是旋转代码:

df = (data.groupby(['ETHNIC', 'RACE'])
      .size()
      .to_frame('counts')
      .reset_index(level=['ETHNIC', 'RACE'])
      .pivot_table(index='ETHNIC', columns='RACE', aggfunc='sum', margins=True, dropna=False)
      )

PS请注意,预期数据框中的行顺序很重要。

更新:

按照评论中的建议,我尝试使用pd.crosstab,发现它是 创建我使用 pd.pivot_table 创建的相同聚合 df 的速度几乎慢了 2 倍(在具有 200K 行的数据帧上测试)

我现在能想出的一个方法也是唯一的方法是:

df_flat = pd.crosstab(data['ETHNIC'], data['RACE'])
l = []
for n,g in df_flat.stack().groupby(level=0):
    l.append(g.sum(level=0).rename('count').to_frame().assign(level=2, varname=g.index.names[0]))
    l.append(g.droplevel(level=0).rename('count').to_frame().assign(level=1, varname=g.index.names[1]))
df_out = pd.concat(l).reset_index()
df_out

输出:

                       index  count  level varname
0         HISPANIC OR LATINO     52      2  ETHNIC
1  BLACK OR AFRICAN AMERICAN     11      1    RACE
2                      WHITE     41      1    RACE
3     NOT HISPANIC OR LATINO     86      2  ETHNIC
4  BLACK OR AFRICAN AMERICAN     15      1    RACE
5                      WHITE     71      1    RACE

我们也可以从 multiIndex 中获取那些级别名称 ETHNIC 和 RACE。

我设法通过使用辅助函数(使用原始数据)让它工作:

def agg_data(g):
    df_race = (
        g.groupby('RACE').size().to_frame('count')
        .rename_axis(index='value').reset_index()
        .assign(level=1, varName='RACE')
        [['level', 'varName', 'value', 'count']]
    )
    
    df_ethnic = (
        pd.DataFrame([[2, 'ETHNIC', g.ETHNIC.iloc[0], len(g)]], columns=df_race.columns)
    )
    
    return pd.concat([df_ethnic, df_race])

df.groupby(['ETHNIC']).apply(agg_data).reset_index(drop=True)


    level   varName value                       count
0   2       ETHNIC  HISPANIC OR LATINO          52
1   1       RACE    BLACK OR AFRICAN AMERICAN   11
2   1       RACE    WHITE                       41
3   2       ETHNIC  NOT HISPANIC OR LATINO      86
4   1       RACE    BLACK OR AFRICAN AMERICAN   15
5   1       RACE    WHITE                       71

我的想法是:

df = (
    df.iloc[:-1][['All', *df.columns.difference(['All'])]]
        .stack()
        .reset_index(name='count')
        .rename(columns={'RACE': 'value', 'ETHNIC': 'varName'})
)

m = df['varName'].ne(df['varName'].shift())
df['value'] = np.where(m, df['varName'], df['value'])
df['varName'] = np.where(m, 'ETHNIC', 'RACE')
df['level'] = m + 1

df = df[['level', 'varName', 'value', 'count']]

df:

   level varName                      value  count
0      2  ETHNIC         HISPANIC OR LATINO     52
1      1    RACE  BLACK OR AFRICAN AMERICAN     11
2      1    RACE                      WHITE     41
3      2  ETHNIC     NOT HISPANIC OR LATINO     86
4      1    RACE  BLACK OR AFRICAN AMERICAN     15
5      1    RACE                      WHITE     71

首先去除底部边距并重新排列列:

df.iloc[:-1][['All', *df.columns.difference(['All'])]]
RACE                    All  BLACK OR AFRICAN AMERICAN  WHITE
ETHNIC                                                       
HISPANIC OR LATINO       52                         11     41
NOT HISPANIC OR LATINO   86                         15     71

然后 stackrename:

df.iloc[:-1][['All', *df.columns.difference(['All'])]]
        .stack()
        .reset_index(name='count')
        .rename(columns={'RACE': 'value', 'ETHNIC': 'varName'})
                  varName                      value  count
0      HISPANIC OR LATINO                        All     52
1      HISPANIC OR LATINO  BLACK OR AFRICAN AMERICAN     11
2      HISPANIC OR LATINO                      WHITE     41
3  NOT HISPANIC OR LATINO                        All     86
4  NOT HISPANIC OR LATINO  BLACK OR AFRICAN AMERICAN     15
5  NOT HISPANIC OR LATINO                      WHITE     71

然后剩下的基于varName的布尔索引:

m = df['varName'].ne(df['varName'].shift())
0     True
1    False
2    False
3     True
4    False
5    False
Name: varName, dtype: bool

varName 移动到 value:

df['value'] = np.where(m, df['varName'], df['value'])
                  varName                      value  count
0      HISPANIC OR LATINO         HISPANIC OR LATINO     52
1      HISPANIC OR LATINO  BLACK OR AFRICAN AMERICAN     11
2      HISPANIC OR LATINO                      WHITE     41
3  NOT HISPANIC OR LATINO     NOT HISPANIC OR LATINO     86
4  NOT HISPANIC OR LATINO  BLACK OR AFRICAN AMERICAN     15
5  NOT HISPANIC OR LATINO                      WHITE     71

根据m分配levelvarName

df['varName'] = np.where(m, 'ETHNIC', 'RACE')
df['level'] = m + 1
  varName                      value  count  level
0  ETHNIC         HISPANIC OR LATINO     52      2
1    RACE  BLACK OR AFRICAN AMERICAN     11      1
2    RACE                      WHITE     41      1
3  ETHNIC     NOT HISPANIC OR LATINO     86      2
4    RACE  BLACK OR AFRICAN AMERICAN     15      1
5    RACE                      WHITE     71      1

最后,重新排序列:

df = df[['level', 'varName', 'value', 'count']]