如何在 scikit-learn 中为多类逻辑回归准备一个单热编码?
How to prepare a one-hot encoding in scikit-learn for a multiclass logistic regression?
我正在尝试使用 scikit-learn 中的单热编码从以下 DataFrame 中分类 4 类:
K T_STAR REGIME
15 90.929 0.95524 BoilingInducedBreakup
9 117.483 0.89386 Splash
16 97.764 1.17972 BoilingInducedBreakup
13 76.917 0.91399 BoilingInducedBreakup
6 44.889 0.95725 BoilingInducedBreakup
20 151.662 0.56287 Splash
12 67.155 1.22842 ReboundWithBreakup
7 114.747 0.47618 Splash
17 121.731 0.52956 Splash
12 29.397 0.88702 Deposition
14 31.733 0.69154 Deposition
13 119.433 0.39422 Splash
21 97.913 1.21309 ReboundWithBreakup
10 117.544 0.18538 Splash
27 76.957 0.52879 Deposition
22 155.842 0.17559 Splash
3 25.620 0.18680 Deposition
30 151.773 1.23027 ReboundWithBreakup
34 91.146 0.90138 Deposition
19 58.095 0.46110 Deposition
14 85.596 0.97520 BoilingInducedBreakup
41 97.783 0.16985 Deposition
0 16.683 0.99355 Deposition
28 122.022 1.22977 ReboundWithBreakup
0 25.570 1.24686 ReboundWithBreakup
3 113.315 0.48886 Splash
7 31.873 1.30497 ReboundWithBreakup
0 108.488 0.73423 Splash
2 25.725 1.29953 ReboundWithBreakup
37 97.695 0.50930 Deposition
这里是 CSV 格式的示例:
,K,T_STAR,REGIME
15,90.929,0.95524,BoilingInducedBreakup
9,117.483,0.89386,Splash
16,97.764,1.17972,BoilingInducedBreakup
13,76.917,0.91399,BoilingInducedBreakup
6,44.889,0.95725,BoilingInducedBreakup
20,151.662,0.56287,Splash
12,67.155,1.22842,ReboundWithBreakup
7,114.747,0.47618,Splash
17,121.731,0.52956,Splash
12,29.397,0.88702,Deposition
14,31.733,0.69154,Deposition
13,119.433,0.39422,Splash
21,97.913,1.21309,ReboundWithBreakup
10,117.544,0.18538,Splash
27,76.957,0.52879,Deposition
22,155.842,0.17559,Splash
3,25.62,0.1868,Deposition
30,151.773,1.23027,ReboundWithBreakup
34,91.146,0.90138,Deposition
19,58.095,0.4611,Deposition
14,85.596,0.9752,BoilingInducedBreakup
41,97.783,0.16985,Deposition
0,16.683,0.99355,Deposition
28,122.022,1.22977,ReboundWithBreakup
0,25.57,1.24686,ReboundWithBreakup
3,113.315,0.48886,Splash
7,31.873,1.30497,ReboundWithBreakup
0,108.488,0.73423,Splash
2,25.725,1.29953,ReboundWithBreakup
37,97.695,0.5093,Deposition
特征向量是二维的 (K,T_STAR)
和 REGIMES
是类别,它们没有以任何方式排序。
这是我到目前为止为单热编码和缩放所做的:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
num_attribs = ["K", "T_STAR"]
cat_attribs = ["REGIME"]
preproc_pipeline = ColumnTransformer([("num", MinMaxScaler(), num_attribs),
("cat", OneHotEncoder(), cat_attribs)])
regimes_df_prepared = preproc_pipeline.fit_transform(regimes_df)
但是,当我打印 regimes_df_prepared
的几行时,我得到
array([[0.73836403, 0.19766192, 0. , 0. , 0. ,
1. ],
[0.43284301, 0.65556065, 1. , 0. , 0. ,
0. ],
[0.97076007, 0.93419198, 0. , 0. , 1. ,
0. ],
[0.96996242, 0.34623652, 0. , 0. , 0. ,
1. ],
[0.10915571, 1. , 0. , 0. , 1. ,
0. ]])
所以 one-hot encoding 似乎奏效了,但问题是特征向量与编码一起打包在这个数组中。
如果我尝试像这样训练模型:
from sklearn.linear_model import LogisticRegression
logreg_ovr = LogisticRegression(solver='lbfgs', max_iter=10000, multi_class='ovr')
logreg_ovr.fit(regimes_df_prepared, regimes_df["REGIME"])
print("Model training score : %.3f" % logreg_ovr.score(regimes_df_prepared, regimes_df["REGIME"]))
分数是1.0
,不可能(过拟合?)。
现在我希望模型预测 (K, T_STAR) 对的类别
logreg_ovr.predict([[40,0.6]])
我得到一个错误
ValueError: X has 2 features per sample; expecting 6
正如所怀疑的那样,该模型将 regimes_df_prepared
的整行视为特征向量。我怎样才能避免这种情况?
目标标签不应该被单热编码,sklearn 有 LabelEncoder
。在您的情况下,数据预处理的工作代码类似于:
X,y = regimes_df[num_attribs].values,regimes_df['REGIME'].values
y = LabelEncoder().fit_transform(y)
我注意到您正在计算用于训练模型的相同数据的分数,这自然会导致过度拟合。请使用 train_test_split
或 cross_val_score
之类的内容来正确评估模型的性能。
我正在尝试使用 scikit-learn 中的单热编码从以下 DataFrame 中分类 4 类:
K T_STAR REGIME
15 90.929 0.95524 BoilingInducedBreakup
9 117.483 0.89386 Splash
16 97.764 1.17972 BoilingInducedBreakup
13 76.917 0.91399 BoilingInducedBreakup
6 44.889 0.95725 BoilingInducedBreakup
20 151.662 0.56287 Splash
12 67.155 1.22842 ReboundWithBreakup
7 114.747 0.47618 Splash
17 121.731 0.52956 Splash
12 29.397 0.88702 Deposition
14 31.733 0.69154 Deposition
13 119.433 0.39422 Splash
21 97.913 1.21309 ReboundWithBreakup
10 117.544 0.18538 Splash
27 76.957 0.52879 Deposition
22 155.842 0.17559 Splash
3 25.620 0.18680 Deposition
30 151.773 1.23027 ReboundWithBreakup
34 91.146 0.90138 Deposition
19 58.095 0.46110 Deposition
14 85.596 0.97520 BoilingInducedBreakup
41 97.783 0.16985 Deposition
0 16.683 0.99355 Deposition
28 122.022 1.22977 ReboundWithBreakup
0 25.570 1.24686 ReboundWithBreakup
3 113.315 0.48886 Splash
7 31.873 1.30497 ReboundWithBreakup
0 108.488 0.73423 Splash
2 25.725 1.29953 ReboundWithBreakup
37 97.695 0.50930 Deposition
这里是 CSV 格式的示例:
,K,T_STAR,REGIME
15,90.929,0.95524,BoilingInducedBreakup
9,117.483,0.89386,Splash
16,97.764,1.17972,BoilingInducedBreakup
13,76.917,0.91399,BoilingInducedBreakup
6,44.889,0.95725,BoilingInducedBreakup
20,151.662,0.56287,Splash
12,67.155,1.22842,ReboundWithBreakup
7,114.747,0.47618,Splash
17,121.731,0.52956,Splash
12,29.397,0.88702,Deposition
14,31.733,0.69154,Deposition
13,119.433,0.39422,Splash
21,97.913,1.21309,ReboundWithBreakup
10,117.544,0.18538,Splash
27,76.957,0.52879,Deposition
22,155.842,0.17559,Splash
3,25.62,0.1868,Deposition
30,151.773,1.23027,ReboundWithBreakup
34,91.146,0.90138,Deposition
19,58.095,0.4611,Deposition
14,85.596,0.9752,BoilingInducedBreakup
41,97.783,0.16985,Deposition
0,16.683,0.99355,Deposition
28,122.022,1.22977,ReboundWithBreakup
0,25.57,1.24686,ReboundWithBreakup
3,113.315,0.48886,Splash
7,31.873,1.30497,ReboundWithBreakup
0,108.488,0.73423,Splash
2,25.725,1.29953,ReboundWithBreakup
37,97.695,0.5093,Deposition
特征向量是二维的 (K,T_STAR)
和 REGIMES
是类别,它们没有以任何方式排序。
这是我到目前为止为单热编码和缩放所做的:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
num_attribs = ["K", "T_STAR"]
cat_attribs = ["REGIME"]
preproc_pipeline = ColumnTransformer([("num", MinMaxScaler(), num_attribs),
("cat", OneHotEncoder(), cat_attribs)])
regimes_df_prepared = preproc_pipeline.fit_transform(regimes_df)
但是,当我打印 regimes_df_prepared
的几行时,我得到
array([[0.73836403, 0.19766192, 0. , 0. , 0. ,
1. ],
[0.43284301, 0.65556065, 1. , 0. , 0. ,
0. ],
[0.97076007, 0.93419198, 0. , 0. , 1. ,
0. ],
[0.96996242, 0.34623652, 0. , 0. , 0. ,
1. ],
[0.10915571, 1. , 0. , 0. , 1. ,
0. ]])
所以 one-hot encoding 似乎奏效了,但问题是特征向量与编码一起打包在这个数组中。
如果我尝试像这样训练模型:
from sklearn.linear_model import LogisticRegression
logreg_ovr = LogisticRegression(solver='lbfgs', max_iter=10000, multi_class='ovr')
logreg_ovr.fit(regimes_df_prepared, regimes_df["REGIME"])
print("Model training score : %.3f" % logreg_ovr.score(regimes_df_prepared, regimes_df["REGIME"]))
分数是1.0
,不可能(过拟合?)。
现在我希望模型预测 (K, T_STAR) 对的类别
logreg_ovr.predict([[40,0.6]])
我得到一个错误
ValueError: X has 2 features per sample; expecting 6
正如所怀疑的那样,该模型将 regimes_df_prepared
的整行视为特征向量。我怎样才能避免这种情况?
目标标签不应该被单热编码,sklearn 有 LabelEncoder
。在您的情况下,数据预处理的工作代码类似于:
X,y = regimes_df[num_attribs].values,regimes_df['REGIME'].values
y = LabelEncoder().fit_transform(y)
我注意到您正在计算用于训练模型的相同数据的分数,这自然会导致过度拟合。请使用 train_test_split
或 cross_val_score
之类的内容来正确评估模型的性能。