Pandas模糊组汇总统计
Pandas fuzzy group summary statistics
我有一个从 CSV 定义的数据框,我想计算基本的汇总统计数据,例如所有模型的训练部分的均值、方差、...。
插入型号并以此为依据进行分组会很好 - 但似乎不是一个好的解决方案。
如何获得每个模型的汇总统计数据(仅用于训练),因为 group_by modelName 因计数器而不起作用。
df.groupby(['modelName', 'typeOfRun'])['kappa'].mean()
或
df[df.typeOfRun != 'validation'].describe()
没有产生预期的结果。
AUC_R,Accuracy,Error rate,False negative rate,False positive rate,Lift value,Precision J,Precision N,Rate of negative predictions,Rate of positive predictions,Sensitivity (true positives rate),Specificity (true negatives rate),f1_R,kappa,modelName,typeOfRun
0.7747622323007851,0.7182416731216111,0.28175832687838887,0.16519823788546256,0.28527729751296715,2.769918376242967,0.08117369886485329,0.9930703132218424,0.029305447973147433,0.3013813581203202,0.8348017621145375,0.7147227024870328,0.8312130234716368,0.09987857210248623,00_testing_1-training,training
0.7688154033277225,0.7295055512522592,0.27049444874774076,0.1894273127753304,0.27294188056922464,2.807689674786938,0.08228060368921185,0.9921956531603068,0.029305447973147433,0.28869739220242707,0.8105726872246696,0.7270581194307754,0.8391825769931881,0.10159217699431862,00_testing_2-training,training
0.7653761718477654,0.7217918925897238,0.2782081074102763,0.1883259911894273,0.2809216651150419,2.737743031677203,0.08023078597866318,0.9921552436003304,0.029305447973147433,0.29647560030983733,0.8116740088105727,0.7190783348849581,0.8338281219878937,0.09791120175612114,00_testing_3-training,training
0.7666987721022418,0.7202566535628756,0.2797433464371244,0.18396711202466598,0.2826353437708505,2.7358921138891255,0.08018987022168358,0.9923159476282464,0.02931031885891585,0.2982693958700465,0.816032887975334,0.7173646562291496,0.8327314318650539,0.097878484924986,00_testing-validation,validation
0.7776426005660843,0.7300542215336948,0.2699457784663052,0.17180616740088106,0.2729086314669504,2.8639238514789174,0.08392857142857142,0.9929168180167091,0.029305447973147433,0.28918151303898787,0.8281938325991189,0.7270913685330496,0.8394625719769673,0.10476961017159536,01_otherSet_1-training,training
0.7691501646636157,0.737412858249419,0.26258714175058095,0.197136563876652,0.2645631067961165,2.8639098209585327,0.08392816025788626,0.9919723742039644,0.029305447973147433,0.2803382390911438,0.802863436123348,0.7354368932038835,0.8446557452170924,0.1044486077353842,01_otherSet_2-training,training
0.770174515310113,0.7342176607281178,0.2657823392718823,0.19162995594713655,0.26802101343263735,2.847815513920855,0.08345650938032974,0.9921582766235522,0.029305447973147433,0.283856183836819,0.8083700440528634,0.7319789865673627,0.8424375777288816,0.10367514449353035,01_otherSet_3-training,training
0.7676347850606817,0.7317488289428102,0.26825117105718976,0.19424460431654678,0.2704858255620898,2.8156062097690264,0.08252631578947368,0.9920241385858671,0.02931031885891585,0.2861747473378218,0.8057553956834532,0.7295141744379102,0.8407546494992847,0.10196584743637081,01_otherSet-validation,validation
IIUC 你可以使用 DataFrameGroupBy.describe
:
print (df.groupby(['modelName', 'typeOfRun']).describe())
f1_R kappa
modelName typeOfRun
00_testing-validation validation count 1.000000 1.000000
mean 0.832731 0.097878
std NaN NaN
min 0.832731 0.097878
25% 0.832731 0.097878
50% 0.832731 0.097878
75% 0.832731 0.097878
max 0.832731 0.097878
00_testing_1-training training count 1.000000 1.000000
mean 0.831213 0.099879
std NaN NaN
min 0.831213 0.099879
25% 0.831213 0.099879
50% 0.831213 0.099879
75% 0.831213 0.099879
max 0.831213 0.099879
00_testing_2-training training count 1.000000 1.000000
mean 0.839183 0.101592
std NaN NaN
...
...
您可以 groupby
由 split
创建的 Series
并由 str[0]
选择列表的第一项:
print (df.modelName.str.split('_').str[0])
0 00
1 00
2 00
3 00
4 01
5 01
6 01
7 01
Name: modelName, dtype: object
print (df.groupby([df.modelName.str.split('_').str[0]]).describe())
AUC_R Accuracy Error;rate False;negative;rate \
modelName
00 count 4.000000 4.000000 4.000000 4.000000
mean 0.768913 0.722449 0.277551 0.181730
std 0.004149 0.004924 0.004924 0.011270
min 0.765376 0.718242 0.270494 0.165198
25% 0.766368 0.719753 0.276280 0.179275
50% 0.767757 0.721024 0.278976 0.186147
75% 0.770302 0.723720 0.280247 0.188601
max 0.774762 0.729506 0.281758 0.189427
01 count 4.000000 4.000000 4.000000 4.000000
mean 0.771151 0.733358 0.266642 0.188704
std 0.004452 0.003198 0.003198 0.011488
min 0.767635 0.730054 0.262587 0.171806
25% 0.768771 0.731325 0.264984 0.186674
50% 0.769662 0.732983 0.267017 0.192937
75% 0.772042 0.735016 0.268675 0.194968
max 0.777643 0.737413 0.269946 0.197137
...
...
我有一个从 CSV 定义的数据框,我想计算基本的汇总统计数据,例如所有模型的训练部分的均值、方差、...。
插入型号并以此为依据进行分组会很好 - 但似乎不是一个好的解决方案。 如何获得每个模型的汇总统计数据(仅用于训练),因为 group_by modelName 因计数器而不起作用。
df.groupby(['modelName', 'typeOfRun'])['kappa'].mean()
或
df[df.typeOfRun != 'validation'].describe()
没有产生预期的结果。
AUC_R,Accuracy,Error rate,False negative rate,False positive rate,Lift value,Precision J,Precision N,Rate of negative predictions,Rate of positive predictions,Sensitivity (true positives rate),Specificity (true negatives rate),f1_R,kappa,modelName,typeOfRun
0.7747622323007851,0.7182416731216111,0.28175832687838887,0.16519823788546256,0.28527729751296715,2.769918376242967,0.08117369886485329,0.9930703132218424,0.029305447973147433,0.3013813581203202,0.8348017621145375,0.7147227024870328,0.8312130234716368,0.09987857210248623,00_testing_1-training,training
0.7688154033277225,0.7295055512522592,0.27049444874774076,0.1894273127753304,0.27294188056922464,2.807689674786938,0.08228060368921185,0.9921956531603068,0.029305447973147433,0.28869739220242707,0.8105726872246696,0.7270581194307754,0.8391825769931881,0.10159217699431862,00_testing_2-training,training
0.7653761718477654,0.7217918925897238,0.2782081074102763,0.1883259911894273,0.2809216651150419,2.737743031677203,0.08023078597866318,0.9921552436003304,0.029305447973147433,0.29647560030983733,0.8116740088105727,0.7190783348849581,0.8338281219878937,0.09791120175612114,00_testing_3-training,training
0.7666987721022418,0.7202566535628756,0.2797433464371244,0.18396711202466598,0.2826353437708505,2.7358921138891255,0.08018987022168358,0.9923159476282464,0.02931031885891585,0.2982693958700465,0.816032887975334,0.7173646562291496,0.8327314318650539,0.097878484924986,00_testing-validation,validation
0.7776426005660843,0.7300542215336948,0.2699457784663052,0.17180616740088106,0.2729086314669504,2.8639238514789174,0.08392857142857142,0.9929168180167091,0.029305447973147433,0.28918151303898787,0.8281938325991189,0.7270913685330496,0.8394625719769673,0.10476961017159536,01_otherSet_1-training,training
0.7691501646636157,0.737412858249419,0.26258714175058095,0.197136563876652,0.2645631067961165,2.8639098209585327,0.08392816025788626,0.9919723742039644,0.029305447973147433,0.2803382390911438,0.802863436123348,0.7354368932038835,0.8446557452170924,0.1044486077353842,01_otherSet_2-training,training
0.770174515310113,0.7342176607281178,0.2657823392718823,0.19162995594713655,0.26802101343263735,2.847815513920855,0.08345650938032974,0.9921582766235522,0.029305447973147433,0.283856183836819,0.8083700440528634,0.7319789865673627,0.8424375777288816,0.10367514449353035,01_otherSet_3-training,training
0.7676347850606817,0.7317488289428102,0.26825117105718976,0.19424460431654678,0.2704858255620898,2.8156062097690264,0.08252631578947368,0.9920241385858671,0.02931031885891585,0.2861747473378218,0.8057553956834532,0.7295141744379102,0.8407546494992847,0.10196584743637081,01_otherSet-validation,validation
IIUC 你可以使用 DataFrameGroupBy.describe
:
print (df.groupby(['modelName', 'typeOfRun']).describe())
f1_R kappa
modelName typeOfRun
00_testing-validation validation count 1.000000 1.000000
mean 0.832731 0.097878
std NaN NaN
min 0.832731 0.097878
25% 0.832731 0.097878
50% 0.832731 0.097878
75% 0.832731 0.097878
max 0.832731 0.097878
00_testing_1-training training count 1.000000 1.000000
mean 0.831213 0.099879
std NaN NaN
min 0.831213 0.099879
25% 0.831213 0.099879
50% 0.831213 0.099879
75% 0.831213 0.099879
max 0.831213 0.099879
00_testing_2-training training count 1.000000 1.000000
mean 0.839183 0.101592
std NaN NaN
...
...
您可以 groupby
由 split
创建的 Series
并由 str[0]
选择列表的第一项:
print (df.modelName.str.split('_').str[0])
0 00
1 00
2 00
3 00
4 01
5 01
6 01
7 01
Name: modelName, dtype: object
print (df.groupby([df.modelName.str.split('_').str[0]]).describe())
AUC_R Accuracy Error;rate False;negative;rate \
modelName
00 count 4.000000 4.000000 4.000000 4.000000
mean 0.768913 0.722449 0.277551 0.181730
std 0.004149 0.004924 0.004924 0.011270
min 0.765376 0.718242 0.270494 0.165198
25% 0.766368 0.719753 0.276280 0.179275
50% 0.767757 0.721024 0.278976 0.186147
75% 0.770302 0.723720 0.280247 0.188601
max 0.774762 0.729506 0.281758 0.189427
01 count 4.000000 4.000000 4.000000 4.000000
mean 0.771151 0.733358 0.266642 0.188704
std 0.004452 0.003198 0.003198 0.011488
min 0.767635 0.730054 0.262587 0.171806
25% 0.768771 0.731325 0.264984 0.186674
50% 0.769662 0.732983 0.267017 0.192937
75% 0.772042 0.735016 0.268675 0.194968
max 0.777643 0.737413 0.269946 0.197137
...
...