Sklearn Pipeline - 尝试计算估算器被调用的次数
Sklearn Pipeline - trying to count the number of times an estimator is called
我正在尝试计算在此管道中调用 LogisticRegression 的次数,因此我扩展了 class 并覆盖了 .fit()。它本来应该很简单,但它产生了这个奇怪的错误:
TypeError: float() 参数必须是字符串或数字,而不是 'MyLogistic'
其中 MyLogistic 是新的 class。如果您复制并粘贴代码,您应该能够重现整个过程。
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import (GridSearchCV, StratifiedKFold)
from sklearn.feature_selection import SelectFromModel
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
import numpy as np
class MyLogistic(LogisticRegression):
__call_counter = 0
def fit(X, y, sample_weight=None):
print("MyLogistic fit is called.")
MyLogistic._MyLogistic__call_counter += 1
# fit() returns self.
return super().fit(X, y, sample_weight)
# If I use this "extension", everything works fine.
#class MyLogistic(LogisticRegression):
# pass
initial_logistic = MyLogistic(solver="liblinear", random_state = np.random.RandomState(18))
final_logistic = LogisticRegression(solver="liblinear", random_state = np.random.RandomState(20))
# prefit = False by default
select_best = SelectFromModel(estimator = initial_logistic, threshold = -np.inf)
select_k_best_pipeline = Pipeline(steps=[
('first_scaler', StandardScaler(with_mean = False)),
# initial_logistic will be called from select_best, prefit = false by default.
('select_k_best', select_best),
('final_logit', final_logistic)
])
select_best_grid = {'select_k_best__estimator__C' : [0.02, 0.03],
'select_k_best__max_features': [1, 2],
'final_logit__C' : [0.01, 0.5, 1.0]}
skf = StratifiedKFold(n_splits = 3, shuffle = True, random_state = 17)
logit_best_searcher = GridSearchCV(estimator = select_k_best_pipeline, param_grid = select_best_grid, cv = skf,
scoring = "roc_auc", n_jobs = 6, verbose = 4)
X, y = load_iris(return_X_y=True)
logit_best_searcher.fit(X, y > 0)
print("Best hyperparams: ", logit_best_searcher.best_params_)
您只是忘记将 self
作为 fit
签名的第一个参数。因此调用正在 X=self
,并且在尝试检查输入时 X
它在某个时候尝试转换为浮点数,因此出现错误消息。
并行化仍然有些奇怪;我得到的计数器等于 1。改为设置 n_jobs=1
,我得到正确的计数器 37(x3 折叠上的 2x2x3 超参数候选,最终改装 +1)。
我正在尝试计算在此管道中调用 LogisticRegression 的次数,因此我扩展了 class 并覆盖了 .fit()。它本来应该很简单,但它产生了这个奇怪的错误:
TypeError: float() 参数必须是字符串或数字,而不是 'MyLogistic'
其中 MyLogistic 是新的 class。如果您复制并粘贴代码,您应该能够重现整个过程。
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import (GridSearchCV, StratifiedKFold)
from sklearn.feature_selection import SelectFromModel
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
import numpy as np
class MyLogistic(LogisticRegression):
__call_counter = 0
def fit(X, y, sample_weight=None):
print("MyLogistic fit is called.")
MyLogistic._MyLogistic__call_counter += 1
# fit() returns self.
return super().fit(X, y, sample_weight)
# If I use this "extension", everything works fine.
#class MyLogistic(LogisticRegression):
# pass
initial_logistic = MyLogistic(solver="liblinear", random_state = np.random.RandomState(18))
final_logistic = LogisticRegression(solver="liblinear", random_state = np.random.RandomState(20))
# prefit = False by default
select_best = SelectFromModel(estimator = initial_logistic, threshold = -np.inf)
select_k_best_pipeline = Pipeline(steps=[
('first_scaler', StandardScaler(with_mean = False)),
# initial_logistic will be called from select_best, prefit = false by default.
('select_k_best', select_best),
('final_logit', final_logistic)
])
select_best_grid = {'select_k_best__estimator__C' : [0.02, 0.03],
'select_k_best__max_features': [1, 2],
'final_logit__C' : [0.01, 0.5, 1.0]}
skf = StratifiedKFold(n_splits = 3, shuffle = True, random_state = 17)
logit_best_searcher = GridSearchCV(estimator = select_k_best_pipeline, param_grid = select_best_grid, cv = skf,
scoring = "roc_auc", n_jobs = 6, verbose = 4)
X, y = load_iris(return_X_y=True)
logit_best_searcher.fit(X, y > 0)
print("Best hyperparams: ", logit_best_searcher.best_params_)
您只是忘记将 self
作为 fit
签名的第一个参数。因此调用正在 X=self
,并且在尝试检查输入时 X
它在某个时候尝试转换为浮点数,因此出现错误消息。
并行化仍然有些奇怪;我得到的计数器等于 1。改为设置 n_jobs=1
,我得到正确的计数器 37(x3 折叠上的 2x2x3 超参数候选,最终改装 +1)。