如何将函数应用于 Pandas 中的多个多索引列?
How to apply a function to multiple multiindex columns in Pandas?
给定一个多索引列
a ...
E1 ... E3
g1 g2 g3 ... g1 g2 g3
0 0.548814 0.715189 0.602763 ... 0.437587 0.891773 0.963663
1 0.383442 0.791725 0.528895 ... 0.087129 0.020218 0.832620
2 0.778157 0.870012 0.978618 ... 0.118274 0.639921 0.143353
3 0.944669 0.521848 0.414662 ... 0.568434 0.018790 0.617635
4 0.612096 0.616934 0.943748 ... 0.697631 0.060225 0.666767
5 0.670638 0.210383 0.128926 ... 0.438602 0.988374 0.102045
6 0.208877 0.161310 0.653108 ... 0.158970 0.110375 0.656330
7 0.138183 0.196582 0.368725 ... 0.096098 0.976459 0.468651
8 0.976761 0.604846 0.739264 ... 0.296140 0.118728 0.317983
9 0.414263 0.064147 0.692472 ... 0.093941 0.575946 0.929296
[10 rows x 9 columns]
我想将第二级过滤的多列(即 E1
、E2
、E3
)应用到一个函数(例如,ration_type1
、 ration_type2
,实际执行中还可以更多)。
例如。假设我们要在函数 ration_type1
和 ration_type2
下计算 E1
的第二级。那么我们只处理下面的df
a
E1
g1 g2 g3
0 0.548814 0.715189 0.602763
1 0.383442 0.791725 0.528895
.................
8 0.976761 0.604846 0.739264
9 0.414263 0.064147 0.692472
为了概括所有第二级,我依赖于下面的列表理解
对于每个 ration_type1
和 ration_type2
。
all_df1 = [ration_type1(df.loc[:, (slice(None), dgroup, slice(None))]) for dgroup in [`E1`, `E2`, `E3`]]
all_df2 = [ration_type2(df.loc[:, (slice(None), dgroup, slice(None))]) for dgroup in [`E1`, `E2`, `E3`]]
在将它连接回原来的 df
之前
但是,我想知道是否有比 list comprehension
方法更优雅和紧凑的方法。这是因为,在现实生活中,可以有更多的配给函数。
完整代码如下
import numpy as np
import pandas as pd
np.random.seed(0)
arr = np.random.rand(10,9)
tuples = [('a', 'E1', 'g1'), ('a', 'E1', 'g2'), ('a', 'E1', 'g3'), ('a', 'E2', 'g1'), ('a', 'E2', 'g2'),
('a', 'E2', 'g3'), ('a', 'E3', 'g1'), ('a', 'E3', 'g2'), ('a', 'E3', 'g3')]
df = pd.DataFrame(data=arr, columns=pd.MultiIndex.from_tuples(tuples))
print(df)
def ration_type1(df):
"""
(g3+g2)/g1
# Ugly way since have to convert to numpy 1st
"""
print(df)
dration = 'ration_type1'
l1, l2, _ = df.columns.tolist()[0]
total = df.loc[:, (slice(None), slice(None), 'g2')].to_numpy() + \
df.loc[:, (slice(None), slice(None), 'g3')].to_numpy()
arr = total / df.loc[:, (slice(None), slice(None), 'g1')].to_numpy()
return pd.DataFrame(data=arr, columns=pd.MultiIndex.from_tuples([(l1, l2, dration)]))
def ration_type2(df):
"""
(g3+g2+g1)/g1
# Ugly way since have to convert to numpy 1st
"""
dration = 'ration_type2'
l1, l2, _ = df.columns.tolist()[0]
total = df.loc[:, (slice(None), slice(None), 'g1')].to_numpy() + \
df.loc[:, (slice(None), slice(None), 'g2')].to_numpy() + \
df.loc[:, (slice(None), slice(None), 'g3')].to_numpy()
arr = total / df.loc[:, (slice(None), slice(None), 'g1')].to_numpy()
return pd.DataFrame(data=arr, columns=pd.MultiIndex.from_tuples([(l1, l2, dration)]))
level1_name = list(set(df.columns.get_level_values(1)))
all_df1 = [ration_type1(df.loc[:, (slice(None), dgroup, slice(None))]) for dgroup in level1_name]
all_df2 = [ration_type2(df.loc[:, (slice(None), dgroup, slice(None))]) for dgroup in level1_name]
df1 = pd.concat(all_df1, axis=1)
df2 = pd.concat(all_df2, axis=1)
df=pd.concat([df,df1,df2],axis=1)
预期输出。
a ...
E1 ... E2 E3
g1 g2 g3 ... ration_type2 ration_type2 ration_type2
0 0.548814 0.715189 0.602763 ... 3.401458 2.962896 5.240151
1 0.383442 0.791725 0.528895 ... 4.444124 2.754497 10.788191
2 0.778157 0.870012 0.978618 ... 3.375653 2.554145 7.622516
3 0.944669 0.521848 0.414662 ... 1.991363 5.650758 2.119612
4 0.612096 0.616934 0.943748 ... 3.549735 2.168255 2.042087
5 0.670638 0.210383 0.128926 ... 1.505949 3.960760 3.486126
6 0.208877 0.161310 0.653108 ... 4.899035 3.806001 5.822965
7 0.138183 0.196582 0.368725 ... 5.091008 2.138921 16.037821
8 0.976761 0.604846 0.739264 ... 2.376088 11.283905 2.474676
9 0.414263 0.064147 0.692472 ... 2.826423 2.391873 17.023361
[10 rows x 15 columns]
我正在考虑使用 apply
# function for prepending 'Geek'
def multiply_by_2(number):
return 2 * number
# executing the function
df[["Integers", "Float"]] = df[["Integers", "Float"]].apply(multiply_by_2)
但是,由于我的示例涉及多索引列,因此我很难(由于我的知识有限)
如果使用 MultiIndex
则不那么容易 - 将 g
值重命名为 ration_type1, ration_type2
的解决方案过滤器级别可能划分 MultiIndex
DataFrames:
idx = pd.IndexSlice
c = {'g1':'ration_type1','g2':'ration_type1','g3':'ration_type1'}
df1 = df.loc[:, idx[:,:,['g3','g2']]].rename(columns=c).groupby(level=[0,1,2], axis=1).sum()
df11 = df1.div(df.xs('g1', level=2, axis=1, drop_level=False).rename(columns=c))
c1 = {'g1':'ration_type2','g2':'ration_type2','g3':'ration_type2'}
df2 = df.rename(columns=c1).groupby(level=[0,1,2], axis=1).sum()
df22 = df2.div(df.xs('g1', level=2, axis=1, drop_level=False).rename(columns=c1))
df=pd.concat([df,df11,df22],axis=1)
最简单的先reshape:
df1 = df.stack([0,1])
df1['ration_type1'] = df1[['g2','g3']].sum(axis=1).div(df1['g1'])
df1['ration_type2'] = df1.sum(axis=1).div(df1['g1'])
print(df1)
g1 g2 g3 ration_type1 ration_type2
0 a E1 0.548814 0.715189 0.602763 2.401458 7.777186
E2 0.544883 0.423655 0.645894 1.962896 6.565312
E3 0.437587 0.891773 0.963663 4.240151 14.929992
1 a E1 0.383442 0.791725 0.528895 3.444124 13.426259
E2 0.568045 0.925597 0.071036 1.754497 5.843159
E3 0.087129 0.020218 0.832620 9.788191 123.129174
2 a E1 0.778157 0.870012 0.978618 2.375653 6.428577
E2 0.799159 0.461479 0.780529 1.554145 4.498872
E3 0.118274 0.639921 0.143353 6.622516 63.615316
3 a E1 0.944669 0.521848 0.414662 0.991363 3.040793
E2 0.264556 0.774234 0.456150 4.650758 23.230266
E3 0.568434 0.018790 0.617635 1.119612 4.089254
4 a E1 0.612096 0.616934 0.943748 2.549735 7.715318
E2 0.681820 0.359508 0.437032 1.168255 3.881690
E3 0.697631 0.060225 0.666767 1.042087 3.535837
5 a E1 0.670638 0.210383 0.128926 0.505949 2.260380
E2 0.315428 0.363711 0.570197 2.960760 13.347233
E3 0.438602 0.988374 0.102045 2.486126 9.154429
6 a E1 0.208877 0.161310 0.653108 3.899035 23.565714
E2 0.253292 0.466311 0.244426 2.806001 14.884143
E3 0.158970 0.110375 0.656330 4.822965 36.161882
7 a E1 0.138183 0.196582 0.368725 4.091008 34.696743
E2 0.820993 0.097101 0.837945 1.138921 3.526168
E3 0.096098 0.976459 0.468651 15.037821 172.521382
8 a E1 0.976761 0.604846 0.739264 1.376088 3.784915
E2 0.039188 0.282807 0.120197 10.283905 273.710140
E3 0.296140 0.118728 0.317983 1.474676 7.454332
9 a E1 0.414263 0.064147 0.692472 1.826423 7.235273
E2 0.566601 0.265389 0.523248 1.391873 4.848404
E3 0.093941 0.575946 0.929296 16.023361 187.592593
最后一次整形为原始 MultiIndex
:
df = df1.unstack([1,2]).reorder_levels([1,2,0], axis=1)
给定一个多索引列
a ...
E1 ... E3
g1 g2 g3 ... g1 g2 g3
0 0.548814 0.715189 0.602763 ... 0.437587 0.891773 0.963663
1 0.383442 0.791725 0.528895 ... 0.087129 0.020218 0.832620
2 0.778157 0.870012 0.978618 ... 0.118274 0.639921 0.143353
3 0.944669 0.521848 0.414662 ... 0.568434 0.018790 0.617635
4 0.612096 0.616934 0.943748 ... 0.697631 0.060225 0.666767
5 0.670638 0.210383 0.128926 ... 0.438602 0.988374 0.102045
6 0.208877 0.161310 0.653108 ... 0.158970 0.110375 0.656330
7 0.138183 0.196582 0.368725 ... 0.096098 0.976459 0.468651
8 0.976761 0.604846 0.739264 ... 0.296140 0.118728 0.317983
9 0.414263 0.064147 0.692472 ... 0.093941 0.575946 0.929296
[10 rows x 9 columns]
我想将第二级过滤的多列(即 E1
、E2
、E3
)应用到一个函数(例如,ration_type1
、 ration_type2
,实际执行中还可以更多)。
例如。假设我们要在函数 ration_type1
和 ration_type2
下计算 E1
的第二级。那么我们只处理下面的df
a
E1
g1 g2 g3
0 0.548814 0.715189 0.602763
1 0.383442 0.791725 0.528895
.................
8 0.976761 0.604846 0.739264
9 0.414263 0.064147 0.692472
为了概括所有第二级,我依赖于下面的列表理解
对于每个 ration_type1
和 ration_type2
。
all_df1 = [ration_type1(df.loc[:, (slice(None), dgroup, slice(None))]) for dgroup in [`E1`, `E2`, `E3`]]
all_df2 = [ration_type2(df.loc[:, (slice(None), dgroup, slice(None))]) for dgroup in [`E1`, `E2`, `E3`]]
在将它连接回原来的 df
之前
但是,我想知道是否有比 list comprehension
方法更优雅和紧凑的方法。这是因为,在现实生活中,可以有更多的配给函数。
完整代码如下
import numpy as np
import pandas as pd
np.random.seed(0)
arr = np.random.rand(10,9)
tuples = [('a', 'E1', 'g1'), ('a', 'E1', 'g2'), ('a', 'E1', 'g3'), ('a', 'E2', 'g1'), ('a', 'E2', 'g2'),
('a', 'E2', 'g3'), ('a', 'E3', 'g1'), ('a', 'E3', 'g2'), ('a', 'E3', 'g3')]
df = pd.DataFrame(data=arr, columns=pd.MultiIndex.from_tuples(tuples))
print(df)
def ration_type1(df):
"""
(g3+g2)/g1
# Ugly way since have to convert to numpy 1st
"""
print(df)
dration = 'ration_type1'
l1, l2, _ = df.columns.tolist()[0]
total = df.loc[:, (slice(None), slice(None), 'g2')].to_numpy() + \
df.loc[:, (slice(None), slice(None), 'g3')].to_numpy()
arr = total / df.loc[:, (slice(None), slice(None), 'g1')].to_numpy()
return pd.DataFrame(data=arr, columns=pd.MultiIndex.from_tuples([(l1, l2, dration)]))
def ration_type2(df):
"""
(g3+g2+g1)/g1
# Ugly way since have to convert to numpy 1st
"""
dration = 'ration_type2'
l1, l2, _ = df.columns.tolist()[0]
total = df.loc[:, (slice(None), slice(None), 'g1')].to_numpy() + \
df.loc[:, (slice(None), slice(None), 'g2')].to_numpy() + \
df.loc[:, (slice(None), slice(None), 'g3')].to_numpy()
arr = total / df.loc[:, (slice(None), slice(None), 'g1')].to_numpy()
return pd.DataFrame(data=arr, columns=pd.MultiIndex.from_tuples([(l1, l2, dration)]))
level1_name = list(set(df.columns.get_level_values(1)))
all_df1 = [ration_type1(df.loc[:, (slice(None), dgroup, slice(None))]) for dgroup in level1_name]
all_df2 = [ration_type2(df.loc[:, (slice(None), dgroup, slice(None))]) for dgroup in level1_name]
df1 = pd.concat(all_df1, axis=1)
df2 = pd.concat(all_df2, axis=1)
df=pd.concat([df,df1,df2],axis=1)
预期输出。
a ...
E1 ... E2 E3
g1 g2 g3 ... ration_type2 ration_type2 ration_type2
0 0.548814 0.715189 0.602763 ... 3.401458 2.962896 5.240151
1 0.383442 0.791725 0.528895 ... 4.444124 2.754497 10.788191
2 0.778157 0.870012 0.978618 ... 3.375653 2.554145 7.622516
3 0.944669 0.521848 0.414662 ... 1.991363 5.650758 2.119612
4 0.612096 0.616934 0.943748 ... 3.549735 2.168255 2.042087
5 0.670638 0.210383 0.128926 ... 1.505949 3.960760 3.486126
6 0.208877 0.161310 0.653108 ... 4.899035 3.806001 5.822965
7 0.138183 0.196582 0.368725 ... 5.091008 2.138921 16.037821
8 0.976761 0.604846 0.739264 ... 2.376088 11.283905 2.474676
9 0.414263 0.064147 0.692472 ... 2.826423 2.391873 17.023361
[10 rows x 15 columns]
我正在考虑使用 apply
# function for prepending 'Geek'
def multiply_by_2(number):
return 2 * number
# executing the function
df[["Integers", "Float"]] = df[["Integers", "Float"]].apply(multiply_by_2)
但是,由于我的示例涉及多索引列,因此我很难(由于我的知识有限)
如果使用 MultiIndex
则不那么容易 - 将 g
值重命名为 ration_type1, ration_type2
的解决方案过滤器级别可能划分 MultiIndex
DataFrames:
idx = pd.IndexSlice
c = {'g1':'ration_type1','g2':'ration_type1','g3':'ration_type1'}
df1 = df.loc[:, idx[:,:,['g3','g2']]].rename(columns=c).groupby(level=[0,1,2], axis=1).sum()
df11 = df1.div(df.xs('g1', level=2, axis=1, drop_level=False).rename(columns=c))
c1 = {'g1':'ration_type2','g2':'ration_type2','g3':'ration_type2'}
df2 = df.rename(columns=c1).groupby(level=[0,1,2], axis=1).sum()
df22 = df2.div(df.xs('g1', level=2, axis=1, drop_level=False).rename(columns=c1))
df=pd.concat([df,df11,df22],axis=1)
最简单的先reshape:
df1 = df.stack([0,1])
df1['ration_type1'] = df1[['g2','g3']].sum(axis=1).div(df1['g1'])
df1['ration_type2'] = df1.sum(axis=1).div(df1['g1'])
print(df1)
g1 g2 g3 ration_type1 ration_type2
0 a E1 0.548814 0.715189 0.602763 2.401458 7.777186
E2 0.544883 0.423655 0.645894 1.962896 6.565312
E3 0.437587 0.891773 0.963663 4.240151 14.929992
1 a E1 0.383442 0.791725 0.528895 3.444124 13.426259
E2 0.568045 0.925597 0.071036 1.754497 5.843159
E3 0.087129 0.020218 0.832620 9.788191 123.129174
2 a E1 0.778157 0.870012 0.978618 2.375653 6.428577
E2 0.799159 0.461479 0.780529 1.554145 4.498872
E3 0.118274 0.639921 0.143353 6.622516 63.615316
3 a E1 0.944669 0.521848 0.414662 0.991363 3.040793
E2 0.264556 0.774234 0.456150 4.650758 23.230266
E3 0.568434 0.018790 0.617635 1.119612 4.089254
4 a E1 0.612096 0.616934 0.943748 2.549735 7.715318
E2 0.681820 0.359508 0.437032 1.168255 3.881690
E3 0.697631 0.060225 0.666767 1.042087 3.535837
5 a E1 0.670638 0.210383 0.128926 0.505949 2.260380
E2 0.315428 0.363711 0.570197 2.960760 13.347233
E3 0.438602 0.988374 0.102045 2.486126 9.154429
6 a E1 0.208877 0.161310 0.653108 3.899035 23.565714
E2 0.253292 0.466311 0.244426 2.806001 14.884143
E3 0.158970 0.110375 0.656330 4.822965 36.161882
7 a E1 0.138183 0.196582 0.368725 4.091008 34.696743
E2 0.820993 0.097101 0.837945 1.138921 3.526168
E3 0.096098 0.976459 0.468651 15.037821 172.521382
8 a E1 0.976761 0.604846 0.739264 1.376088 3.784915
E2 0.039188 0.282807 0.120197 10.283905 273.710140
E3 0.296140 0.118728 0.317983 1.474676 7.454332
9 a E1 0.414263 0.064147 0.692472 1.826423 7.235273
E2 0.566601 0.265389 0.523248 1.391873 4.848404
E3 0.093941 0.575946 0.929296 16.023361 187.592593
最后一次整形为原始 MultiIndex
:
df = df1.unstack([1,2]).reorder_levels([1,2,0], axis=1)