如何使用 Python Pandas 在多条件下按平均值分组
How to groupby average in multicondition with Python Pandas
objective 是根据多索引数据帧中的多条件计算子集列平均值。
第一个条件是通过multiindex的第一级得到平均组。
第二个条件是根据下面的dict_ref
值取平均值
dict_ref = dict ( occ=['F2', 'F4'], gr=['Fp1', 'Fpx'] )
例如,对于键 occ
,获取 F2
和 F4
的平均值。
下面的代码应该可以完成工作
import re
import numpy as np
import numpy.random
import pandas as pd
numpy.random.seed(0)
dict_ref = dict ( occ=['F2', 'F4'], gr=['Fp1', 'Fpx'] )
_names=['pow_fr','pow_fr','pow_fr','pow_fr','pow_fr','pow_fr','pow_fr','pow_fr',
'hjor_com','hjor_com','hjor_com','hjor_com']
_idx=['Fp1_band0','Fp1_band1','Fpx_band0','Fpx_band1','F2_band0','F2_band1','F4_band0','F4_band1',
'Fp1','Fpx','F2','F4']
X=np.random.rand(4,len(_names))
columns = pd.MultiIndex.from_arrays([_names, _idx])
df=pd.DataFrame(data=X, columns=columns)
remove_nan =[(e [0], *re.split ( '_', e [1] )) for e in df.columns]
remove_nan = [t + ('',) * (len ( max ( remove_nan, key=len ) ) - len ( t )) for t in remove_nan]
df.columns = pd.MultiIndex.from_tuples ( remove_nan )
df = df.T.reset_index ().rename ( columns={"level_0": "group_feature",
"level_1": "ch", "level_2": "feature","level_3": "region"} )
all_df = []
for nref in dict_ref:
df_ch = df [df.ch.isin ( dict_ref [nref] )].groupby (["group_feature", "feature"] ).mean ().reset_index ()
df_ch ['ch'] = nref
all_df.append ( df_ch )
df1 = pd.concat ( [df, *all_df] ).pivot_table ( index=['group_feature', 'ch', 'feature'] ).transpose ()
df1.columns=[(gf[0], f'{gf[1]}' if not gf[-1] else f'{gf[1]}_{gf[-1]}') for gf in df1. columns. values. tolist()]
但是,我想知道是否有办法避免 for-loop
all_df = []
for nref in dict_ref:
df_ch = df [df.ch.isin ( dict_ref [nref] )].groupby (["group_feature", "feature"] ).mean ().reset_index ()
df_ch ['ch'] = nref
all_df.append ( df_ch )
没有上述问题那么重要,但如果有办法完全避免以下几行,那将是一个奖励
remove_nan =[(e [0], *re.split ( '_', e [1] )) for e in df.columns]
remove_nan = [t + ('',) * (len ( max ( remove_nan, key=len ) ) - len ( t )) for t in remove_nan]
df.columns = pd.MultiIndex.from_tuples ( remove_nan )
df = df.T.reset_index ().rename ( columns={"level_0": "group_feature",
"level_1": "ch", "level_2": "feature","level_3": "region"} )
预期输出
(hjor_com, F2) (hjor_com, F4) ... (pow_fr, occ_band0) (pow_fr, occ_band1)
0 0.791725 0.528895 ... 0.430621 0.768834
1 0.461479 0.780529 ... 0.399188 0.851316
2 0.018790 0.617635 ... 0.393202 0.594448
3 0.210383 0.128926 ... 0.528570 0.248629
[4 rows x 18 columns]
您可以翻转 dict_ref
,使值数组中的每一项成为键,执行替换,并按新的 ch
:
分组
mapping = {
v: key for key, value in dict_ref.items() for v in value
}
all_df = df.replace({"ch": mapping}).groupby(["group_feature", "feature", "ch"]).mean().reset_index()
df1 = pd.concat([df, all_df])...
objective 是根据多索引数据帧中的多条件计算子集列平均值。
第一个条件是通过multiindex的第一级得到平均组。
第二个条件是根据下面的dict_ref
值取平均值
dict_ref = dict ( occ=['F2', 'F4'], gr=['Fp1', 'Fpx'] )
例如,对于键 occ
,获取 F2
和 F4
的平均值。
下面的代码应该可以完成工作
import re
import numpy as np
import numpy.random
import pandas as pd
numpy.random.seed(0)
dict_ref = dict ( occ=['F2', 'F4'], gr=['Fp1', 'Fpx'] )
_names=['pow_fr','pow_fr','pow_fr','pow_fr','pow_fr','pow_fr','pow_fr','pow_fr',
'hjor_com','hjor_com','hjor_com','hjor_com']
_idx=['Fp1_band0','Fp1_band1','Fpx_band0','Fpx_band1','F2_band0','F2_band1','F4_band0','F4_band1',
'Fp1','Fpx','F2','F4']
X=np.random.rand(4,len(_names))
columns = pd.MultiIndex.from_arrays([_names, _idx])
df=pd.DataFrame(data=X, columns=columns)
remove_nan =[(e [0], *re.split ( '_', e [1] )) for e in df.columns]
remove_nan = [t + ('',) * (len ( max ( remove_nan, key=len ) ) - len ( t )) for t in remove_nan]
df.columns = pd.MultiIndex.from_tuples ( remove_nan )
df = df.T.reset_index ().rename ( columns={"level_0": "group_feature",
"level_1": "ch", "level_2": "feature","level_3": "region"} )
all_df = []
for nref in dict_ref:
df_ch = df [df.ch.isin ( dict_ref [nref] )].groupby (["group_feature", "feature"] ).mean ().reset_index ()
df_ch ['ch'] = nref
all_df.append ( df_ch )
df1 = pd.concat ( [df, *all_df] ).pivot_table ( index=['group_feature', 'ch', 'feature'] ).transpose ()
df1.columns=[(gf[0], f'{gf[1]}' if not gf[-1] else f'{gf[1]}_{gf[-1]}') for gf in df1. columns. values. tolist()]
但是,我想知道是否有办法避免 for-loop
all_df = []
for nref in dict_ref:
df_ch = df [df.ch.isin ( dict_ref [nref] )].groupby (["group_feature", "feature"] ).mean ().reset_index ()
df_ch ['ch'] = nref
all_df.append ( df_ch )
没有上述问题那么重要,但如果有办法完全避免以下几行,那将是一个奖励
remove_nan =[(e [0], *re.split ( '_', e [1] )) for e in df.columns]
remove_nan = [t + ('',) * (len ( max ( remove_nan, key=len ) ) - len ( t )) for t in remove_nan]
df.columns = pd.MultiIndex.from_tuples ( remove_nan )
df = df.T.reset_index ().rename ( columns={"level_0": "group_feature",
"level_1": "ch", "level_2": "feature","level_3": "region"} )
预期输出
(hjor_com, F2) (hjor_com, F4) ... (pow_fr, occ_band0) (pow_fr, occ_band1)
0 0.791725 0.528895 ... 0.430621 0.768834
1 0.461479 0.780529 ... 0.399188 0.851316
2 0.018790 0.617635 ... 0.393202 0.594448
3 0.210383 0.128926 ... 0.528570 0.248629
[4 rows x 18 columns]
您可以翻转 dict_ref
,使值数组中的每一项成为键,执行替换,并按新的 ch
:
mapping = {
v: key for key, value in dict_ref.items() for v in value
}
all_df = df.replace({"ch": mapping}).groupby(["group_feature", "feature", "ch"]).mean().reset_index()
df1 = pd.concat([df, all_df])...