scikit-learn 中的目标转换和特征选择
Target transformation and feature selection in scikit-learn
我正在使用 RFECV
在 scikit-learn 中进行特征选择。我想将简单线性模型 (X,y
) 的结果与对数转换模型(使用 X, log(y)
)
的结果进行比较
简单模型:
RFECV
和 cross_val_score
提供相同的结果(我们需要将所有折叠的交叉验证的平均分数与所有特征的 RFECV
的分数进行比较:0.66
= 0.66
,没问题,结果靠谱)
日志模型:
问题:似乎RFECV
没有提供转换y
的方法。这种情况下的分数是 0.55
vs 0.53
。不过这是意料之中的,因为我必须手动应用 np.log
来拟合数据:log_seletor = log_selector.fit(X,np.log(y))
。此 r2 分数适用于 y = log(y)
,没有 inverse_func
,而我们需要的是一种在 log(y_train)
上拟合模型并使用 exp(y_test)
计算分数的方法。或者,如果我尝试使用 TransformedTargetRegressor
,我会收到代码中显示的错误:The classifier does not expose "coef_" or "feature_importances_" attributes
如何解决问题并确保特征选择过程可靠?
from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
from sklearn.compose import TransformedTargetRegressor
import numpy as np
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = linear_model.LinearRegression()
log_estimator = TransformedTargetRegressor(regressor=linear_model.LinearRegression(),
func=np.log,
inverse_func=np.exp)
selector = RFECV(estimator, step=1, cv=5, scoring='r2')
selector = selector.fit(X, y)
###
# log_selector = RFECV(log_estimator, step=1, cv=5, scoring='r2')
# log_seletor = log_selector.fit(X,y)
# #RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes
###
log_selector = RFECV(estimator, step=1, cv=5, scoring='r2')
log_seletor = log_selector.fit(X,np.log(y))
print("**Simple Model**")
print("RFECV, r2 scores: ", np.round(selector.grid_scores_,2))
scores = cross_val_score(estimator, X, y, cv=5)
print("cross_val, mean r2 score: ", round(np.mean(scores),2), ", same as RFECV score with all features")
print("no of feat: ", selector.n_features_ )
print("**Log Model**")
log_scores = cross_val_score(log_estimator, X, y, cv=5)
print("RFECV, r2 scores: ", np.round(log_selector.grid_scores_,2))
print("cross_val, mean r2 score: ", round(np.mean(log_scores),2))
print("no of feat: ", log_selector.n_features_ )
输出:
**Simple Model**
RFECV, r2 scores: [0.45 0.6 0.63 0.68 0.68 0.69 0.68 0.67 0.66 0.66]
cross_val, mean r2 score: 0.66 , same as RFECV score with all features
no of feat: 6
**Log Model**
RFECV, r2 scores: [0.39 0.5 0.59 0.56 0.55 0.54 0.53 0.53 0.53 0.53]
cross_val, mean r2 score: 0.55
no of feat: 3
此问题的一个解决方法是确保 coef_
属性公开给特征选择模块 RFECV
。所以基本上你需要扩展 TransformedTargetRegressor
并确保它公开属性 coef_
。我创建了一个从 TransformedTargetRegressor
扩展的子 class,并且还暴露了 coef_
,如下所示。
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
from sklearn.compose import TransformedTargetRegressor
import numpy as np
class myestimator(TransformedTargetRegressor):
def __init__(self,**kwargs):
super().__init__(regressor=LinearRegression(),func=np.log,inverse_func=np.exp)
def fit(self, X, y, **kwargs):
super().fit(X, y, **kwargs)
self.coef_ = self.regressor_.coef_
return self
然后您可以使用 myestimator
创建您的代码,如下所示:
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = linear_model.LinearRegression()
log_estimator = myestimator(regressor=LinearRegression(),func=np.log,inverse_func=np.exp)
selector = RFECV(estimator, step=1, cv=5, scoring='r2')
selector = selector.fit(X, y)
log_selector = RFECV(log_estimator, step=1, cv=5, scoring='r2')
log_seletor = log_selector.fit(X,y)
我有 运行 你的示例代码并显示了结果。
SAMPLE OUTPUT
print("**Simple Model**")
print("RFECV, r2 scores: ", np.round(selector.grid_scores_,2))
scores = cross_val_score(estimator, X, y, cv=5)
print("cross_val, mean r2 score: ", round(np.mean(scores),2), ", same as RFECV score with all features")
print("no of feat: ", selector.n_features_ )
print("**Log Model**")
log_scores = cross_val_score(log_estimator, X, y, cv=5)
print("RFECV, r2 scores: ", np.round(log_selector.grid_scores_,2))
print("cross_val, mean r2 score: ", round(np.mean(log_scores),2))
print("no of feat: ", log_selector.n_features_ )
**Simple Model**
RFECV, r2 scores: [0.45 0.6 0.63 0.68 0.68 0.69 0.68 0.67 0.66 0.66]
cross_val, mean r2 score: 0.66 , same as RFECV score with all features
no of feat: 6
**Log Model**
RFECV, r2 scores: [0.41 0.51 0.59 0.59 0.58 0.56 0.54 0.53 0.55 0.55]
cross_val, mean r2 score: 0.55
no of feat: 4
希望对您有所帮助!
您需要做的就是将这些属性添加到 TransformedTargetRegressor
:
class MyTransformedTargetRegressor(TransformedTargetRegressor):
@property
def feature_importances_(self):
return self.regressor_.feature_importances_
@property
def coef_(self):
return self.regressor_.coef_
然后在你的代码中使用:
log_estimator = MyTransformedTargetRegressor(regressor=linear_model.LinearRegression(),
func=np.log,
inverse_func=np.exp)
我正在使用 RFECV
在 scikit-learn 中进行特征选择。我想将简单线性模型 (X,y
) 的结果与对数转换模型(使用 X, log(y)
)
简单模型:
RFECV
和 cross_val_score
提供相同的结果(我们需要将所有折叠的交叉验证的平均分数与所有特征的 RFECV
的分数进行比较:0.66
= 0.66
,没问题,结果靠谱)
日志模型:
问题:似乎RFECV
没有提供转换y
的方法。这种情况下的分数是 0.55
vs 0.53
。不过这是意料之中的,因为我必须手动应用 np.log
来拟合数据:log_seletor = log_selector.fit(X,np.log(y))
。此 r2 分数适用于 y = log(y)
,没有 inverse_func
,而我们需要的是一种在 log(y_train)
上拟合模型并使用 exp(y_test)
计算分数的方法。或者,如果我尝试使用 TransformedTargetRegressor
,我会收到代码中显示的错误:The classifier does not expose "coef_" or "feature_importances_" attributes
如何解决问题并确保特征选择过程可靠?
from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
from sklearn.compose import TransformedTargetRegressor
import numpy as np
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = linear_model.LinearRegression()
log_estimator = TransformedTargetRegressor(regressor=linear_model.LinearRegression(),
func=np.log,
inverse_func=np.exp)
selector = RFECV(estimator, step=1, cv=5, scoring='r2')
selector = selector.fit(X, y)
###
# log_selector = RFECV(log_estimator, step=1, cv=5, scoring='r2')
# log_seletor = log_selector.fit(X,y)
# #RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes
###
log_selector = RFECV(estimator, step=1, cv=5, scoring='r2')
log_seletor = log_selector.fit(X,np.log(y))
print("**Simple Model**")
print("RFECV, r2 scores: ", np.round(selector.grid_scores_,2))
scores = cross_val_score(estimator, X, y, cv=5)
print("cross_val, mean r2 score: ", round(np.mean(scores),2), ", same as RFECV score with all features")
print("no of feat: ", selector.n_features_ )
print("**Log Model**")
log_scores = cross_val_score(log_estimator, X, y, cv=5)
print("RFECV, r2 scores: ", np.round(log_selector.grid_scores_,2))
print("cross_val, mean r2 score: ", round(np.mean(log_scores),2))
print("no of feat: ", log_selector.n_features_ )
输出:
**Simple Model**
RFECV, r2 scores: [0.45 0.6 0.63 0.68 0.68 0.69 0.68 0.67 0.66 0.66]
cross_val, mean r2 score: 0.66 , same as RFECV score with all features
no of feat: 6
**Log Model**
RFECV, r2 scores: [0.39 0.5 0.59 0.56 0.55 0.54 0.53 0.53 0.53 0.53]
cross_val, mean r2 score: 0.55
no of feat: 3
此问题的一个解决方法是确保 coef_
属性公开给特征选择模块 RFECV
。所以基本上你需要扩展 TransformedTargetRegressor
并确保它公开属性 coef_
。我创建了一个从 TransformedTargetRegressor
扩展的子 class,并且还暴露了 coef_
,如下所示。
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
from sklearn.compose import TransformedTargetRegressor
import numpy as np
class myestimator(TransformedTargetRegressor):
def __init__(self,**kwargs):
super().__init__(regressor=LinearRegression(),func=np.log,inverse_func=np.exp)
def fit(self, X, y, **kwargs):
super().fit(X, y, **kwargs)
self.coef_ = self.regressor_.coef_
return self
然后您可以使用 myestimator
创建您的代码,如下所示:
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = linear_model.LinearRegression()
log_estimator = myestimator(regressor=LinearRegression(),func=np.log,inverse_func=np.exp)
selector = RFECV(estimator, step=1, cv=5, scoring='r2')
selector = selector.fit(X, y)
log_selector = RFECV(log_estimator, step=1, cv=5, scoring='r2')
log_seletor = log_selector.fit(X,y)
我有 运行 你的示例代码并显示了结果。
SAMPLE OUTPUT
print("**Simple Model**")
print("RFECV, r2 scores: ", np.round(selector.grid_scores_,2))
scores = cross_val_score(estimator, X, y, cv=5)
print("cross_val, mean r2 score: ", round(np.mean(scores),2), ", same as RFECV score with all features")
print("no of feat: ", selector.n_features_ )
print("**Log Model**")
log_scores = cross_val_score(log_estimator, X, y, cv=5)
print("RFECV, r2 scores: ", np.round(log_selector.grid_scores_,2))
print("cross_val, mean r2 score: ", round(np.mean(log_scores),2))
print("no of feat: ", log_selector.n_features_ )
**Simple Model**
RFECV, r2 scores: [0.45 0.6 0.63 0.68 0.68 0.69 0.68 0.67 0.66 0.66]
cross_val, mean r2 score: 0.66 , same as RFECV score with all features
no of feat: 6
**Log Model**
RFECV, r2 scores: [0.41 0.51 0.59 0.59 0.58 0.56 0.54 0.53 0.55 0.55]
cross_val, mean r2 score: 0.55
no of feat: 4
希望对您有所帮助!
您需要做的就是将这些属性添加到 TransformedTargetRegressor
:
class MyTransformedTargetRegressor(TransformedTargetRegressor):
@property
def feature_importances_(self):
return self.regressor_.feature_importances_
@property
def coef_(self):
return self.regressor_.coef_
然后在你的代码中使用:
log_estimator = MyTransformedTargetRegressor(regressor=linear_model.LinearRegression(),
func=np.log,
inverse_func=np.exp)