如何将具有多级列和轴名称的数据透视表 table 转换为行?
How to transform pivot table with multi level column and axis names to rows?
我有一个下面提到的数据帧,它是使用 pd.pivot_table
传递参数 agg_func
作为 sum
:
旋转数据帧的结果
counts
RACE BLACK OR AFRICAN AMERICAN WHITE All
ETHNIC
HISPANIC OR LATINO 11 41 52
NOT HISPANIC OR LATINO 15 71 86
All 26 112 138
您可以 运行 下面的代码将上面的数据框加载到变量中 df
:
df = pd.DataFrame.from_dict({('counts', 'BLACK OR AFRICAN AMERICAN'): {'HISPANIC OR LATINO': 11, 'NOT HISPANIC OR LATINO': 15, 'All': 26}, ('counts', 'WHITE'): {'HISPANIC OR LATINO': 41, 'NOT HISPANIC OR LATINO': 71, 'All': 112}, ('counts', 'All'): {'HISPANIC OR LATINO': 52, 'NOT HISPANIC OR LATINO': 86, 'All': 138}}).rename_axis((None, 'RACE'), axis=1).rename_axis((['ETHNIC']), axis=0)
我正在尝试按如下所述转换此数据框:
预期输出:
level varName value counts
0 2 ETHNIC HISPANIC OR LATINO 52
1 1 RACE BLACK OR AFRICAN AMERICAN 11
2 1 RACE WHITE 41
3 2 ETHNIC NOT HISPANIC OR LATINO 86
4 1 RACE BLACK OR AFRICAN AMERICAN 15
5 1 RACE WHITE 71
上面的varName
字段表示列和行的轴名称,level
列轴为1,行轴为2,索引值为[=的行23=] 是可选的,所以它是否在结果数据框中并不重要(后来我认为如果它在那里会更好)
我已经查看了以下 SO 线程,但我没有发现它们与我的问题有多大关系。
我能够转换的一种方法是手动获取值并手动创建所需的数据框,如下所示:
df.index.names
# output: FrozenList(['ETHNIC'])
df.columns.names
#output: FrozenList([None, 'RACE'])
[y for x,y in df][:-1]
#output: ['BLACK OR AFRICAN AMERICAN', 'WHITE']
[x for x in df.index][:-1]
#output: ['HISPANIC OR LATINO', 'NOT HISPANIC OR LATINO']
编辑:
这是创建数据透视表之前的数据table:
data = pd.DataFrame.from_dict({'ETHNIC': {0: 'NOT HISPANIC OR LATINO', 1: 'NOT HISPANIC OR LATINO', 2: 'NOT HISPANIC OR LATINO', 3: 'NOT HISPANIC OR LATINO', 4: 'NOT HISPANIC OR LATINO', 5: 'NOT HISPANIC OR LATINO', 6: 'NOT HISPANIC OR LATINO', 7: 'NOT HISPANIC OR LATINO', 8: 'NOT HISPANIC OR LATINO', 9: 'HISPANIC OR LATINO', 10: 'HISPANIC OR LATINO', 11: 'HISPANIC OR LATINO', 12: 'HISPANIC OR LATINO', 13: 'HISPANIC OR LATINO', 14: 'NOT HISPANIC OR LATINO', 15: 'NOT HISPANIC OR LATINO', 16: 'NOT HISPANIC OR LATINO', 17: 'NOT HISPANIC OR LATINO', 18: 'NOT HISPANIC OR LATINO', 19: 'NOT HISPANIC OR LATINO', 20: 'NOT HISPANIC OR LATINO', 21: 'HISPANIC OR LATINO', 22: 'HISPANIC OR LATINO', 23: 'NOT HISPANIC OR LATINO', 24: 'NOT HISPANIC OR LATINO', 25: 'NOT HISPANIC OR LATINO', 26: 'HISPANIC OR LATINO', 27: 'HISPANIC OR LATINO', 28: 'HISPANIC OR LATINO', 29: 'HISPANIC OR LATINO', 30: 'HISPANIC OR LATINO', 31: 'HISPANIC OR LATINO', 32: 'NOT HISPANIC OR LATINO', 33: 'HISPANIC OR LATINO', 34: 'NOT HISPANIC OR LATINO', 35: 'NOT HISPANIC OR LATINO', 36: 'NOT HISPANIC OR LATINO', 37: 'NOT HISPANIC OR LATINO', 38: 'NOT HISPANIC OR LATINO', 39: 'NOT HISPANIC OR LATINO', 40: 'NOT HISPANIC OR LATINO', 41: 'NOT HISPANIC OR LATINO', 42: 'HISPANIC OR LATINO', 43: 'NOT HISPANIC OR LATINO', 44: 'NOT HISPANIC OR LATINO', 45: 'NOT HISPANIC OR LATINO', 46: 'HISPANIC OR LATINO', 47: 'HISPANIC OR LATINO', 48: 'HISPANIC OR LATINO', 49: 'HISPANIC OR LATINO', 50: 'NOT HISPANIC OR LATINO', 51: 'NOT HISPANIC OR LATINO', 52: 'NOT HISPANIC OR LATINO', 53: 'HISPANIC OR LATINO', 54: 'HISPANIC OR LATINO', 55: 'HISPANIC OR LATINO', 56: 'NOT HISPANIC OR LATINO', 57: 'HISPANIC OR LATINO', 58: 'HISPANIC OR LATINO', 59: 'NOT HISPANIC OR LATINO', 60: 'NOT HISPANIC OR LATINO', 61: 'HISPANIC OR LATINO', 62: 'HISPANIC OR LATINO', 63: 'HISPANIC OR LATINO', 64: 'HISPANIC OR LATINO', 65: 'NOT HISPANIC OR LATINO', 66: 'NOT HISPANIC OR LATINO', 67: 'NOT HISPANIC OR LATINO', 68: 'NOT HISPANIC OR LATINO', 69: 'HISPANIC OR LATINO', 70: 'NOT HISPANIC OR LATINO', 71: 'NOT HISPANIC OR LATINO', 72: 'HISPANIC OR LATINO', 73: 'HISPANIC OR LATINO', 74: 'HISPANIC OR LATINO', 75: 'NOT HISPANIC OR LATINO', 76: 'NOT HISPANIC OR LATINO', 77: 'NOT HISPANIC OR LATINO', 78: 'NOT HISPANIC OR LATINO', 79: 'NOT HISPANIC OR LATINO', 80: 'NOT HISPANIC OR LATINO', 81: 'NOT HISPANIC OR LATINO', 82: 'HISPANIC OR LATINO', 83: 'HISPANIC OR LATINO', 84: 'HISPANIC OR LATINO', 85: 'NOT HISPANIC OR LATINO', 86: 'HISPANIC OR LATINO', 87: 'HISPANIC OR LATINO', 88: 'HISPANIC OR LATINO', 89: 'NOT HISPANIC OR LATINO', 90: 'NOT HISPANIC OR LATINO', 91: 'NOT HISPANIC OR LATINO', 92: 'NOT HISPANIC OR LATINO', 93: 'NOT HISPANIC OR LATINO', 94: 'NOT HISPANIC OR LATINO', 95: 'HISPANIC OR LATINO', 96: 'HISPANIC OR LATINO', 97: 'HISPANIC OR LATINO', 98: 'NOT HISPANIC OR LATINO', 99: 'NOT HISPANIC OR LATINO', 100: 'NOT HISPANIC OR LATINO', 101: 'NOT HISPANIC OR LATINO', 102: 'NOT HISPANIC OR LATINO', 103: 'NOT HISPANIC OR LATINO', 104: 'NOT HISPANIC OR LATINO', 105: 'NOT HISPANIC OR LATINO', 106: 'NOT HISPANIC OR LATINO', 107: 'NOT HISPANIC OR LATINO', 108: 'NOT HISPANIC OR LATINO', 109: 'HISPANIC OR LATINO', 110: 'HISPANIC OR LATINO', 111: 'NOT HISPANIC OR LATINO', 112: 'NOT HISPANIC OR LATINO', 113: 'NOT HISPANIC OR LATINO', 114: 'NOT HISPANIC OR LATINO', 115: 'HISPANIC OR LATINO', 116: 'HISPANIC OR LATINO', 117: 'NOT HISPANIC OR LATINO', 118: 'HISPANIC OR LATINO', 119: 'HISPANIC OR LATINO', 120: 'NOT HISPANIC OR LATINO', 121: 'HISPANIC OR LATINO', 122: 'HISPANIC OR LATINO', 123: 'HISPANIC OR LATINO', 124: 'HISPANIC OR LATINO', 125: 'HISPANIC OR LATINO', 126: 'NOT HISPANIC OR LATINO', 127: 'NOT HISPANIC OR LATINO', 128: 'NOT HISPANIC OR LATINO', 129: 'NOT HISPANIC OR LATINO', 130: 'NOT HISPANIC OR LATINO', 131: 'NOT HISPANIC OR LATINO', 132: 'NOT HISPANIC OR LATINO', 133: 'NOT HISPANIC OR LATINO', 134: 'NOT HISPANIC OR LATINO', 135: 'NOT HISPANIC OR LATINO', 136: 'NOT HISPANIC OR LATINO', 137: 'NOT HISPANIC OR LATINO'}, 'RACE': {0: 'WHITE', 1: 'WHITE', 2: 'WHITE', 3: 'WHITE', 4: 'WHITE', 5: 'WHITE', 6: 'WHITE', 7: 'WHITE', 8: 'WHITE', 9: 'BLACK OR AFRICAN AMERICAN', 10: 'BLACK OR AFRICAN AMERICAN', 11: 'BLACK OR AFRICAN AMERICAN', 12: 'BLACK OR AFRICAN AMERICAN', 13: 'BLACK OR AFRICAN AMERICAN', 14: 'WHITE', 15: 'WHITE', 16: 'WHITE', 17: 'WHITE', 18: 'WHITE', 19: 'WHITE', 20: 'BLACK OR AFRICAN AMERICAN', 21: 'WHITE', 22: 'WHITE', 23: 'WHITE', 24: 'BLACK OR AFRICAN AMERICAN', 25: 'BLACK OR AFRICAN AMERICAN', 26: 'WHITE', 27: 'WHITE', 28: 'WHITE', 29: 'WHITE', 30: 'WHITE', 31: 'WHITE', 32: 'WHITE', 33: 'WHITE', 34: 'WHITE', 35: 'WHITE', 36: 'WHITE', 37: 'WHITE', 38: 'WHITE', 39: 'WHITE', 40: 'BLACK OR AFRICAN AMERICAN', 41: 'BLACK OR AFRICAN AMERICAN', 42: 'WHITE', 43: 'WHITE', 44: 'WHITE', 45: 'WHITE', 46: 'WHITE', 47: 'WHITE', 48: 'WHITE', 49: 'WHITE', 50: 'WHITE', 51: 'BLACK OR AFRICAN AMERICAN', 52: 'BLACK OR AFRICAN AMERICAN', 53: 'WHITE', 54: 'WHITE', 55: 'WHITE', 56: 'WHITE', 57: 'WHITE', 58: 'WHITE', 59: 'WHITE', 60: 'WHITE', 61: 'WHITE', 62: 'WHITE', 63: 'WHITE', 64: 'WHITE', 65: 'BLACK OR AFRICAN AMERICAN', 66: 'BLACK OR AFRICAN AMERICAN', 67: 'BLACK OR AFRICAN AMERICAN', 68: 'BLACK OR AFRICAN AMERICAN', 69: 'WHITE', 70: 'WHITE', 71: 'WHITE', 72: 'WHITE', 73: 'WHITE', 74: 'BLACK OR AFRICAN AMERICAN', 75: 'WHITE', 76: 'WHITE', 77: 'WHITE', 78: 'WHITE', 79: 'WHITE', 80: 'BLACK OR AFRICAN AMERICAN', 81: 'BLACK OR AFRICAN AMERICAN', 82: 'BLACK OR AFRICAN AMERICAN', 83: 'BLACK OR AFRICAN AMERICAN', 84: 'BLACK OR AFRICAN AMERICAN', 85: 'BLACK OR AFRICAN AMERICAN', 86: 'WHITE', 87: 'WHITE', 88: 'WHITE', 89: 'WHITE', 90: 'WHITE', 91: 'WHITE', 92: 'WHITE', 93: 'WHITE', 94: 'WHITE', 95: 'WHITE', 96: 'WHITE', 97: 'WHITE', 98: 'WHITE', 99: 'WHITE', 100: 'WHITE', 101: 'WHITE', 102: 'WHITE', 103: 'WHITE', 104: 'WHITE', 105: 'WHITE', 106: 'WHITE', 107: 'WHITE', 108: 'BLACK OR AFRICAN AMERICAN', 109: 'WHITE', 110: 'WHITE', 111: 'WHITE', 112: 'WHITE', 113: 'WHITE', 114: 'WHITE', 115: 'BLACK OR AFRICAN AMERICAN', 116: 'BLACK OR AFRICAN AMERICAN', 117: 'WHITE', 118: 'WHITE', 119: 'WHITE', 120: 'WHITE', 121: 'WHITE', 122: 'WHITE', 123: 'WHITE', 124: 'WHITE', 125: 'WHITE', 126: 'WHITE', 127: 'WHITE', 128: 'WHITE', 129: 'WHITE', 130: 'WHITE', 131: 'WHITE', 132: 'WHITE', 133: 'WHITE', 134: 'WHITE', 135: 'WHITE', 136: 'WHITE', 137: 'WHITE'}})
这里是旋转代码:
df = (data.groupby(['ETHNIC', 'RACE'])
.size()
.to_frame('counts')
.reset_index(level=['ETHNIC', 'RACE'])
.pivot_table(index='ETHNIC', columns='RACE', aggfunc='sum', margins=True, dropna=False)
)
PS:请注意,预期数据框中的行顺序很重要。
更新:
按照评论中的建议,我尝试使用pd.crosstab
,发现它是
创建我使用 pd.pivot_table
创建的相同聚合 df
的速度几乎慢了 2 倍(在具有 200K 行的数据帧上测试)
我现在能想出的一个方法也是唯一的方法是:
df_flat = pd.crosstab(data['ETHNIC'], data['RACE'])
l = []
for n,g in df_flat.stack().groupby(level=0):
l.append(g.sum(level=0).rename('count').to_frame().assign(level=2, varname=g.index.names[0]))
l.append(g.droplevel(level=0).rename('count').to_frame().assign(level=1, varname=g.index.names[1]))
df_out = pd.concat(l).reset_index()
df_out
输出:
index count level varname
0 HISPANIC OR LATINO 52 2 ETHNIC
1 BLACK OR AFRICAN AMERICAN 11 1 RACE
2 WHITE 41 1 RACE
3 NOT HISPANIC OR LATINO 86 2 ETHNIC
4 BLACK OR AFRICAN AMERICAN 15 1 RACE
5 WHITE 71 1 RACE
我们也可以从 multiIndex 中获取那些级别名称 ETHNIC 和 RACE。
我设法通过使用辅助函数(使用原始数据)让它工作:
def agg_data(g):
df_race = (
g.groupby('RACE').size().to_frame('count')
.rename_axis(index='value').reset_index()
.assign(level=1, varName='RACE')
[['level', 'varName', 'value', 'count']]
)
df_ethnic = (
pd.DataFrame([[2, 'ETHNIC', g.ETHNIC.iloc[0], len(g)]], columns=df_race.columns)
)
return pd.concat([df_ethnic, df_race])
df.groupby(['ETHNIC']).apply(agg_data).reset_index(drop=True)
level varName value count
0 2 ETHNIC HISPANIC OR LATINO 52
1 1 RACE BLACK OR AFRICAN AMERICAN 11
2 1 RACE WHITE 41
3 2 ETHNIC NOT HISPANIC OR LATINO 86
4 1 RACE BLACK OR AFRICAN AMERICAN 15
5 1 RACE WHITE 71
我的想法是:
df = (
df.iloc[:-1][['All', *df.columns.difference(['All'])]]
.stack()
.reset_index(name='count')
.rename(columns={'RACE': 'value', 'ETHNIC': 'varName'})
)
m = df['varName'].ne(df['varName'].shift())
df['value'] = np.where(m, df['varName'], df['value'])
df['varName'] = np.where(m, 'ETHNIC', 'RACE')
df['level'] = m + 1
df = df[['level', 'varName', 'value', 'count']]
df
:
level varName value count
0 2 ETHNIC HISPANIC OR LATINO 52
1 1 RACE BLACK OR AFRICAN AMERICAN 11
2 1 RACE WHITE 41
3 2 ETHNIC NOT HISPANIC OR LATINO 86
4 1 RACE BLACK OR AFRICAN AMERICAN 15
5 1 RACE WHITE 71
首先去除底部边距并重新排列列:
df.iloc[:-1][['All', *df.columns.difference(['All'])]]
RACE All BLACK OR AFRICAN AMERICAN WHITE
ETHNIC
HISPANIC OR LATINO 52 11 41
NOT HISPANIC OR LATINO 86 15 71
然后 stack
和 rename
:
df.iloc[:-1][['All', *df.columns.difference(['All'])]]
.stack()
.reset_index(name='count')
.rename(columns={'RACE': 'value', 'ETHNIC': 'varName'})
varName value count
0 HISPANIC OR LATINO All 52
1 HISPANIC OR LATINO BLACK OR AFRICAN AMERICAN 11
2 HISPANIC OR LATINO WHITE 41
3 NOT HISPANIC OR LATINO All 86
4 NOT HISPANIC OR LATINO BLACK OR AFRICAN AMERICAN 15
5 NOT HISPANIC OR LATINO WHITE 71
然后剩下的基于varName
的布尔索引:
m = df['varName'].ne(df['varName'].shift())
0 True
1 False
2 False
3 True
4 False
5 False
Name: varName, dtype: bool
将 varName
移动到 value
:
df['value'] = np.where(m, df['varName'], df['value'])
varName value count
0 HISPANIC OR LATINO HISPANIC OR LATINO 52
1 HISPANIC OR LATINO BLACK OR AFRICAN AMERICAN 11
2 HISPANIC OR LATINO WHITE 41
3 NOT HISPANIC OR LATINO NOT HISPANIC OR LATINO 86
4 NOT HISPANIC OR LATINO BLACK OR AFRICAN AMERICAN 15
5 NOT HISPANIC OR LATINO WHITE 71
根据m
分配level
和varName
:
df['varName'] = np.where(m, 'ETHNIC', 'RACE')
df['level'] = m + 1
varName value count level
0 ETHNIC HISPANIC OR LATINO 52 2
1 RACE BLACK OR AFRICAN AMERICAN 11 1
2 RACE WHITE 41 1
3 ETHNIC NOT HISPANIC OR LATINO 86 2
4 RACE BLACK OR AFRICAN AMERICAN 15 1
5 RACE WHITE 71 1
最后,重新排序列:
df = df[['level', 'varName', 'value', 'count']]
我有一个下面提到的数据帧,它是使用 pd.pivot_table
传递参数 agg_func
作为 sum
:
counts
RACE BLACK OR AFRICAN AMERICAN WHITE All
ETHNIC
HISPANIC OR LATINO 11 41 52
NOT HISPANIC OR LATINO 15 71 86
All 26 112 138
您可以 运行 下面的代码将上面的数据框加载到变量中 df
:
df = pd.DataFrame.from_dict({('counts', 'BLACK OR AFRICAN AMERICAN'): {'HISPANIC OR LATINO': 11, 'NOT HISPANIC OR LATINO': 15, 'All': 26}, ('counts', 'WHITE'): {'HISPANIC OR LATINO': 41, 'NOT HISPANIC OR LATINO': 71, 'All': 112}, ('counts', 'All'): {'HISPANIC OR LATINO': 52, 'NOT HISPANIC OR LATINO': 86, 'All': 138}}).rename_axis((None, 'RACE'), axis=1).rename_axis((['ETHNIC']), axis=0)
我正在尝试按如下所述转换此数据框:
预期输出:
level varName value counts
0 2 ETHNIC HISPANIC OR LATINO 52
1 1 RACE BLACK OR AFRICAN AMERICAN 11
2 1 RACE WHITE 41
3 2 ETHNIC NOT HISPANIC OR LATINO 86
4 1 RACE BLACK OR AFRICAN AMERICAN 15
5 1 RACE WHITE 71
上面的varName
字段表示列和行的轴名称,level
列轴为1,行轴为2,索引值为[=的行23=] 是可选的,所以它是否在结果数据框中并不重要(后来我认为如果它在那里会更好)
我已经查看了以下 SO 线程,但我没有发现它们与我的问题有多大关系。
我能够转换的一种方法是手动获取值并手动创建所需的数据框,如下所示:
df.index.names
# output: FrozenList(['ETHNIC'])
df.columns.names
#output: FrozenList([None, 'RACE'])
[y for x,y in df][:-1]
#output: ['BLACK OR AFRICAN AMERICAN', 'WHITE']
[x for x in df.index][:-1]
#output: ['HISPANIC OR LATINO', 'NOT HISPANIC OR LATINO']
编辑:
这是创建数据透视表之前的数据table:
data = pd.DataFrame.from_dict({'ETHNIC': {0: 'NOT HISPANIC OR LATINO', 1: 'NOT HISPANIC OR LATINO', 2: 'NOT HISPANIC OR LATINO', 3: 'NOT HISPANIC OR LATINO', 4: 'NOT HISPANIC OR LATINO', 5: 'NOT HISPANIC OR LATINO', 6: 'NOT HISPANIC OR LATINO', 7: 'NOT HISPANIC OR LATINO', 8: 'NOT HISPANIC OR LATINO', 9: 'HISPANIC OR LATINO', 10: 'HISPANIC OR LATINO', 11: 'HISPANIC OR LATINO', 12: 'HISPANIC OR LATINO', 13: 'HISPANIC OR LATINO', 14: 'NOT HISPANIC OR LATINO', 15: 'NOT HISPANIC OR LATINO', 16: 'NOT HISPANIC OR LATINO', 17: 'NOT HISPANIC OR LATINO', 18: 'NOT HISPANIC OR LATINO', 19: 'NOT HISPANIC OR LATINO', 20: 'NOT HISPANIC OR LATINO', 21: 'HISPANIC OR LATINO', 22: 'HISPANIC OR LATINO', 23: 'NOT HISPANIC OR LATINO', 24: 'NOT HISPANIC OR LATINO', 25: 'NOT HISPANIC OR LATINO', 26: 'HISPANIC OR LATINO', 27: 'HISPANIC OR LATINO', 28: 'HISPANIC OR LATINO', 29: 'HISPANIC OR LATINO', 30: 'HISPANIC OR LATINO', 31: 'HISPANIC OR LATINO', 32: 'NOT HISPANIC OR LATINO', 33: 'HISPANIC OR LATINO', 34: 'NOT HISPANIC OR LATINO', 35: 'NOT HISPANIC OR LATINO', 36: 'NOT HISPANIC OR LATINO', 37: 'NOT HISPANIC OR LATINO', 38: 'NOT HISPANIC OR LATINO', 39: 'NOT HISPANIC OR LATINO', 40: 'NOT HISPANIC OR LATINO', 41: 'NOT HISPANIC OR LATINO', 42: 'HISPANIC OR LATINO', 43: 'NOT HISPANIC OR LATINO', 44: 'NOT HISPANIC OR LATINO', 45: 'NOT HISPANIC OR LATINO', 46: 'HISPANIC OR LATINO', 47: 'HISPANIC OR LATINO', 48: 'HISPANIC OR LATINO', 49: 'HISPANIC OR LATINO', 50: 'NOT HISPANIC OR LATINO', 51: 'NOT HISPANIC OR LATINO', 52: 'NOT HISPANIC OR LATINO', 53: 'HISPANIC OR LATINO', 54: 'HISPANIC OR LATINO', 55: 'HISPANIC OR LATINO', 56: 'NOT HISPANIC OR LATINO', 57: 'HISPANIC OR LATINO', 58: 'HISPANIC OR LATINO', 59: 'NOT HISPANIC OR LATINO', 60: 'NOT HISPANIC OR LATINO', 61: 'HISPANIC OR LATINO', 62: 'HISPANIC OR LATINO', 63: 'HISPANIC OR LATINO', 64: 'HISPANIC OR LATINO', 65: 'NOT HISPANIC OR LATINO', 66: 'NOT HISPANIC OR LATINO', 67: 'NOT HISPANIC OR LATINO', 68: 'NOT HISPANIC OR LATINO', 69: 'HISPANIC OR LATINO', 70: 'NOT HISPANIC OR LATINO', 71: 'NOT HISPANIC OR LATINO', 72: 'HISPANIC OR LATINO', 73: 'HISPANIC OR LATINO', 74: 'HISPANIC OR LATINO', 75: 'NOT HISPANIC OR LATINO', 76: 'NOT HISPANIC OR LATINO', 77: 'NOT HISPANIC OR LATINO', 78: 'NOT HISPANIC OR LATINO', 79: 'NOT HISPANIC OR LATINO', 80: 'NOT HISPANIC OR LATINO', 81: 'NOT HISPANIC OR LATINO', 82: 'HISPANIC OR LATINO', 83: 'HISPANIC OR LATINO', 84: 'HISPANIC OR LATINO', 85: 'NOT HISPANIC OR LATINO', 86: 'HISPANIC OR LATINO', 87: 'HISPANIC OR LATINO', 88: 'HISPANIC OR LATINO', 89: 'NOT HISPANIC OR LATINO', 90: 'NOT HISPANIC OR LATINO', 91: 'NOT HISPANIC OR LATINO', 92: 'NOT HISPANIC OR LATINO', 93: 'NOT HISPANIC OR LATINO', 94: 'NOT HISPANIC OR LATINO', 95: 'HISPANIC OR LATINO', 96: 'HISPANIC OR LATINO', 97: 'HISPANIC OR LATINO', 98: 'NOT HISPANIC OR LATINO', 99: 'NOT HISPANIC OR LATINO', 100: 'NOT HISPANIC OR LATINO', 101: 'NOT HISPANIC OR LATINO', 102: 'NOT HISPANIC OR LATINO', 103: 'NOT HISPANIC OR LATINO', 104: 'NOT HISPANIC OR LATINO', 105: 'NOT HISPANIC OR LATINO', 106: 'NOT HISPANIC OR LATINO', 107: 'NOT HISPANIC OR LATINO', 108: 'NOT HISPANIC OR LATINO', 109: 'HISPANIC OR LATINO', 110: 'HISPANIC OR LATINO', 111: 'NOT HISPANIC OR LATINO', 112: 'NOT HISPANIC OR LATINO', 113: 'NOT HISPANIC OR LATINO', 114: 'NOT HISPANIC OR LATINO', 115: 'HISPANIC OR LATINO', 116: 'HISPANIC OR LATINO', 117: 'NOT HISPANIC OR LATINO', 118: 'HISPANIC OR LATINO', 119: 'HISPANIC OR LATINO', 120: 'NOT HISPANIC OR LATINO', 121: 'HISPANIC OR LATINO', 122: 'HISPANIC OR LATINO', 123: 'HISPANIC OR LATINO', 124: 'HISPANIC OR LATINO', 125: 'HISPANIC OR LATINO', 126: 'NOT HISPANIC OR LATINO', 127: 'NOT HISPANIC OR LATINO', 128: 'NOT HISPANIC OR LATINO', 129: 'NOT HISPANIC OR LATINO', 130: 'NOT HISPANIC OR LATINO', 131: 'NOT HISPANIC OR LATINO', 132: 'NOT HISPANIC OR LATINO', 133: 'NOT HISPANIC OR LATINO', 134: 'NOT HISPANIC OR LATINO', 135: 'NOT HISPANIC OR LATINO', 136: 'NOT HISPANIC OR LATINO', 137: 'NOT HISPANIC OR LATINO'}, 'RACE': {0: 'WHITE', 1: 'WHITE', 2: 'WHITE', 3: 'WHITE', 4: 'WHITE', 5: 'WHITE', 6: 'WHITE', 7: 'WHITE', 8: 'WHITE', 9: 'BLACK OR AFRICAN AMERICAN', 10: 'BLACK OR AFRICAN AMERICAN', 11: 'BLACK OR AFRICAN AMERICAN', 12: 'BLACK OR AFRICAN AMERICAN', 13: 'BLACK OR AFRICAN AMERICAN', 14: 'WHITE', 15: 'WHITE', 16: 'WHITE', 17: 'WHITE', 18: 'WHITE', 19: 'WHITE', 20: 'BLACK OR AFRICAN AMERICAN', 21: 'WHITE', 22: 'WHITE', 23: 'WHITE', 24: 'BLACK OR AFRICAN AMERICAN', 25: 'BLACK OR AFRICAN AMERICAN', 26: 'WHITE', 27: 'WHITE', 28: 'WHITE', 29: 'WHITE', 30: 'WHITE', 31: 'WHITE', 32: 'WHITE', 33: 'WHITE', 34: 'WHITE', 35: 'WHITE', 36: 'WHITE', 37: 'WHITE', 38: 'WHITE', 39: 'WHITE', 40: 'BLACK OR AFRICAN AMERICAN', 41: 'BLACK OR AFRICAN AMERICAN', 42: 'WHITE', 43: 'WHITE', 44: 'WHITE', 45: 'WHITE', 46: 'WHITE', 47: 'WHITE', 48: 'WHITE', 49: 'WHITE', 50: 'WHITE', 51: 'BLACK OR AFRICAN AMERICAN', 52: 'BLACK OR AFRICAN AMERICAN', 53: 'WHITE', 54: 'WHITE', 55: 'WHITE', 56: 'WHITE', 57: 'WHITE', 58: 'WHITE', 59: 'WHITE', 60: 'WHITE', 61: 'WHITE', 62: 'WHITE', 63: 'WHITE', 64: 'WHITE', 65: 'BLACK OR AFRICAN AMERICAN', 66: 'BLACK OR AFRICAN AMERICAN', 67: 'BLACK OR AFRICAN AMERICAN', 68: 'BLACK OR AFRICAN AMERICAN', 69: 'WHITE', 70: 'WHITE', 71: 'WHITE', 72: 'WHITE', 73: 'WHITE', 74: 'BLACK OR AFRICAN AMERICAN', 75: 'WHITE', 76: 'WHITE', 77: 'WHITE', 78: 'WHITE', 79: 'WHITE', 80: 'BLACK OR AFRICAN AMERICAN', 81: 'BLACK OR AFRICAN AMERICAN', 82: 'BLACK OR AFRICAN AMERICAN', 83: 'BLACK OR AFRICAN AMERICAN', 84: 'BLACK OR AFRICAN AMERICAN', 85: 'BLACK OR AFRICAN AMERICAN', 86: 'WHITE', 87: 'WHITE', 88: 'WHITE', 89: 'WHITE', 90: 'WHITE', 91: 'WHITE', 92: 'WHITE', 93: 'WHITE', 94: 'WHITE', 95: 'WHITE', 96: 'WHITE', 97: 'WHITE', 98: 'WHITE', 99: 'WHITE', 100: 'WHITE', 101: 'WHITE', 102: 'WHITE', 103: 'WHITE', 104: 'WHITE', 105: 'WHITE', 106: 'WHITE', 107: 'WHITE', 108: 'BLACK OR AFRICAN AMERICAN', 109: 'WHITE', 110: 'WHITE', 111: 'WHITE', 112: 'WHITE', 113: 'WHITE', 114: 'WHITE', 115: 'BLACK OR AFRICAN AMERICAN', 116: 'BLACK OR AFRICAN AMERICAN', 117: 'WHITE', 118: 'WHITE', 119: 'WHITE', 120: 'WHITE', 121: 'WHITE', 122: 'WHITE', 123: 'WHITE', 124: 'WHITE', 125: 'WHITE', 126: 'WHITE', 127: 'WHITE', 128: 'WHITE', 129: 'WHITE', 130: 'WHITE', 131: 'WHITE', 132: 'WHITE', 133: 'WHITE', 134: 'WHITE', 135: 'WHITE', 136: 'WHITE', 137: 'WHITE'}})
这里是旋转代码:
df = (data.groupby(['ETHNIC', 'RACE'])
.size()
.to_frame('counts')
.reset_index(level=['ETHNIC', 'RACE'])
.pivot_table(index='ETHNIC', columns='RACE', aggfunc='sum', margins=True, dropna=False)
)
PS:请注意,预期数据框中的行顺序很重要。
更新:
按照评论中的建议,我尝试使用pd.crosstab
,发现它是
创建我使用 pd.pivot_table
创建的相同聚合 df
的速度几乎慢了 2 倍(在具有 200K 行的数据帧上测试)
我现在能想出的一个方法也是唯一的方法是:
df_flat = pd.crosstab(data['ETHNIC'], data['RACE'])
l = []
for n,g in df_flat.stack().groupby(level=0):
l.append(g.sum(level=0).rename('count').to_frame().assign(level=2, varname=g.index.names[0]))
l.append(g.droplevel(level=0).rename('count').to_frame().assign(level=1, varname=g.index.names[1]))
df_out = pd.concat(l).reset_index()
df_out
输出:
index count level varname
0 HISPANIC OR LATINO 52 2 ETHNIC
1 BLACK OR AFRICAN AMERICAN 11 1 RACE
2 WHITE 41 1 RACE
3 NOT HISPANIC OR LATINO 86 2 ETHNIC
4 BLACK OR AFRICAN AMERICAN 15 1 RACE
5 WHITE 71 1 RACE
我们也可以从 multiIndex 中获取那些级别名称 ETHNIC 和 RACE。
我设法通过使用辅助函数(使用原始数据)让它工作:
def agg_data(g):
df_race = (
g.groupby('RACE').size().to_frame('count')
.rename_axis(index='value').reset_index()
.assign(level=1, varName='RACE')
[['level', 'varName', 'value', 'count']]
)
df_ethnic = (
pd.DataFrame([[2, 'ETHNIC', g.ETHNIC.iloc[0], len(g)]], columns=df_race.columns)
)
return pd.concat([df_ethnic, df_race])
df.groupby(['ETHNIC']).apply(agg_data).reset_index(drop=True)
level varName value count
0 2 ETHNIC HISPANIC OR LATINO 52
1 1 RACE BLACK OR AFRICAN AMERICAN 11
2 1 RACE WHITE 41
3 2 ETHNIC NOT HISPANIC OR LATINO 86
4 1 RACE BLACK OR AFRICAN AMERICAN 15
5 1 RACE WHITE 71
我的想法是:
df = (
df.iloc[:-1][['All', *df.columns.difference(['All'])]]
.stack()
.reset_index(name='count')
.rename(columns={'RACE': 'value', 'ETHNIC': 'varName'})
)
m = df['varName'].ne(df['varName'].shift())
df['value'] = np.where(m, df['varName'], df['value'])
df['varName'] = np.where(m, 'ETHNIC', 'RACE')
df['level'] = m + 1
df = df[['level', 'varName', 'value', 'count']]
df
:
level varName value count
0 2 ETHNIC HISPANIC OR LATINO 52
1 1 RACE BLACK OR AFRICAN AMERICAN 11
2 1 RACE WHITE 41
3 2 ETHNIC NOT HISPANIC OR LATINO 86
4 1 RACE BLACK OR AFRICAN AMERICAN 15
5 1 RACE WHITE 71
首先去除底部边距并重新排列列:
df.iloc[:-1][['All', *df.columns.difference(['All'])]]
RACE All BLACK OR AFRICAN AMERICAN WHITE
ETHNIC
HISPANIC OR LATINO 52 11 41
NOT HISPANIC OR LATINO 86 15 71
然后 stack
和 rename
:
df.iloc[:-1][['All', *df.columns.difference(['All'])]]
.stack()
.reset_index(name='count')
.rename(columns={'RACE': 'value', 'ETHNIC': 'varName'})
varName value count
0 HISPANIC OR LATINO All 52
1 HISPANIC OR LATINO BLACK OR AFRICAN AMERICAN 11
2 HISPANIC OR LATINO WHITE 41
3 NOT HISPANIC OR LATINO All 86
4 NOT HISPANIC OR LATINO BLACK OR AFRICAN AMERICAN 15
5 NOT HISPANIC OR LATINO WHITE 71
然后剩下的基于varName
的布尔索引:
m = df['varName'].ne(df['varName'].shift())
0 True
1 False
2 False
3 True
4 False
5 False
Name: varName, dtype: bool
将 varName
移动到 value
:
df['value'] = np.where(m, df['varName'], df['value'])
varName value count
0 HISPANIC OR LATINO HISPANIC OR LATINO 52
1 HISPANIC OR LATINO BLACK OR AFRICAN AMERICAN 11
2 HISPANIC OR LATINO WHITE 41
3 NOT HISPANIC OR LATINO NOT HISPANIC OR LATINO 86
4 NOT HISPANIC OR LATINO BLACK OR AFRICAN AMERICAN 15
5 NOT HISPANIC OR LATINO WHITE 71
根据m
分配level
和varName
:
df['varName'] = np.where(m, 'ETHNIC', 'RACE')
df['level'] = m + 1
varName value count level
0 ETHNIC HISPANIC OR LATINO 52 2
1 RACE BLACK OR AFRICAN AMERICAN 11 1
2 RACE WHITE 41 1
3 ETHNIC NOT HISPANIC OR LATINO 86 2
4 RACE BLACK OR AFRICAN AMERICAN 15 1
5 RACE WHITE 71 1
最后,重新排序列:
df = df[['level', 'varName', 'value', 'count']]