RandomizedSearchCV 抽样分布
RandomizedSearchCV sampling distribution
根据 RandomizedSearchCV documentation(强调我的):
param_distributions: dict or list of dicts
Dictionary with parameters names (str) as keys and distributions or
lists of parameters to try. Distributions must provide a rvs method
for sampling (such as those from scipy.stats.distributions). If a list
is given, it is sampled uniformly. If a list of dicts is given, first
a dict is sampled uniformly, and then a parameter is sampled using
that dict as above.
如果我对上述内容的理解是正确的,则在给定 n_iter = 10 的情况下,以下示例中的两种算法(XGBClassifier 和 LogisticRegression)都应该以高概率(>99%)进行采样。
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from xgboost.sklearn import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
param_grid = [
{'scaler': [StandardScaler()],
'feature_selection': [RFE(estimator=XGBClassifier(use_label_encoder=False, eval_metric='logloss'))],
'feature_selection__n_features_to_select': [3],
'classification': [XGBClassifier(use_label_encoder=False, eval_metric='logloss')],
'classification__n_estimators': [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000],
'classification__max_depth': [2, 5, 10],
},
{'scaler': [StandardScaler()],
'feature_selection': [RFE(estimator=LogisticRegression())],
'feature_selection__n_features_to_select': [3],
'classification': [LogisticRegression()],
'classification__C': [0.1],
},
]
pipe = Pipeline(steps=[('scaler', StandardScaler()), ('feature_selection', RFE(estimator=LogisticRegression())),
('classification', LogisticRegression())])
classifier = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid,
scoring='neg_brier_score', n_jobs=-1, verbose=10)
data = load_breast_cancer()
X = data.data
y = data.target.ravel()
classifier.fit(X, y)
虽然每次我 运行 XGBClassifier 都会被选中 10/10 次。我希望一名候选人来自 Logistic Regression,因为每个 dict 被采样的概率是 50-50。
如果两个算法之间的搜索 space 更平衡 ('classification__n_estimators': [100]),则采样按预期工作。
有人可以澄清这里发生了什么吗?
是的,这是不正确的行为。有一个 Issue filed: when all the entries are lists (none are scipy distributions), the current code 从 ParameterGrid
中选择点,这意味着它将不成比例地从列表中较大的字典网格中选择点。
在合并修复之前,您可以通过对您不关心的内容使用 scipy 分发来解决此问题,比如 verbose
?
根据 RandomizedSearchCV documentation(强调我的):
param_distributions: dict or list of dicts
Dictionary with parameters names (str) as keys and distributions or lists of parameters to try. Distributions must provide a rvs method for sampling (such as those from scipy.stats.distributions). If a list is given, it is sampled uniformly. If a list of dicts is given, first a dict is sampled uniformly, and then a parameter is sampled using that dict as above.
如果我对上述内容的理解是正确的,则在给定 n_iter = 10 的情况下,以下示例中的两种算法(XGBClassifier 和 LogisticRegression)都应该以高概率(>99%)进行采样。
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from xgboost.sklearn import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
param_grid = [
{'scaler': [StandardScaler()],
'feature_selection': [RFE(estimator=XGBClassifier(use_label_encoder=False, eval_metric='logloss'))],
'feature_selection__n_features_to_select': [3],
'classification': [XGBClassifier(use_label_encoder=False, eval_metric='logloss')],
'classification__n_estimators': [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000],
'classification__max_depth': [2, 5, 10],
},
{'scaler': [StandardScaler()],
'feature_selection': [RFE(estimator=LogisticRegression())],
'feature_selection__n_features_to_select': [3],
'classification': [LogisticRegression()],
'classification__C': [0.1],
},
]
pipe = Pipeline(steps=[('scaler', StandardScaler()), ('feature_selection', RFE(estimator=LogisticRegression())),
('classification', LogisticRegression())])
classifier = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid,
scoring='neg_brier_score', n_jobs=-1, verbose=10)
data = load_breast_cancer()
X = data.data
y = data.target.ravel()
classifier.fit(X, y)
虽然每次我 运行 XGBClassifier 都会被选中 10/10 次。我希望一名候选人来自 Logistic Regression,因为每个 dict 被采样的概率是 50-50。 如果两个算法之间的搜索 space 更平衡 ('classification__n_estimators': [100]),则采样按预期工作。 有人可以澄清这里发生了什么吗?
是的,这是不正确的行为。有一个 Issue filed: when all the entries are lists (none are scipy distributions), the current code 从 ParameterGrid
中选择点,这意味着它将不成比例地从列表中较大的字典网格中选择点。
在合并修复之前,您可以通过对您不关心的内容使用 scipy 分发来解决此问题,比如 verbose
?