如何调整管道内随机森林分类器中的参数?

How can I tune the parameters in a Random Forest Classifier inside a pipeline?

我试图通过使用管道并调整其中的参数来应用 RandomForestClassifier()。这是正在使用的数据集:https://www.kaggle.com/gbonesso/enem-2016

这是代码

from sklearn.ensemble import RandomForestClassifier

imputer = SimpleImputer(strategy="median")
scaler = StandardScaler()
rf = RandomForestClassifier()

features = [
    "NU_IDADE",
    "TP_ESTADO_CIVIL",
    "NU_NOTA_CN",
    "NU_NOTA_CH",
    "NU_NOTA_LC",
    "NU_NOTA_MT",
    "NU_NOTA_COMP1",
    "NU_NOTA_COMP2",
    "NU_NOTA_COMP3",
    "NU_NOTA_COMP4",
    "NU_NOTA_COMP5",
    "NU_NOTA_REDACAO",
]

X = enem[features]
y = enem[["IN_TREINEIRO"]]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.8, random_state=42
)

pipeline = make_pipeline(imputer, scaler, rf)

pipe_params = {
    "randomforestregressor__n_estimators": [100, 500, 1000],
    "randomforestregressor__max_depth": [1, 5, 10, 25],
    "randomforestregressor__max_features": [*np.arange(0.1, 1.1, 0.1)],
}

gridsearch = GridSearchCV(
    pipeline, param_grid=pipe_params, cv=3, n_jobs=-1, verbose=1000
)

gridsearch.fit(X_train, y_train)

它似乎适用于一些参数,但随后我收到此错误消息:

ValueError: Invalid parameter randomforestregressor for estimator Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),
            ('standardscaler', StandardScaler()),
            ('randomforestclassifier', RandomForestClassifier())]). Check the list of available parameters with `estimator.get_params().keys()`.

此外,还有一个问题是我似乎无法获得简历结果。我尝试了 运行 以下代码:

results = pd.DataFrame(gridsearch.cv_results_)
results.sort_values("rank_test_score").head()
score = pipeline.score(X_test, y_test)
score

但是我得到了这个错误:

AttributeError: 'GridSearchCV' object has no attribute 'cv_results_'

关于如何修复这些错误有什么想法吗?

您的问题可能出在这本词典上:

pipe_params = {
    "randomforestregressor__n_estimators": [100, 500, 1000],
    "randomforestregressor__max_depth": [1, 5, 10, 25],
    "randomforestregressor__max_features": [*np.arange(0.1, 1.1, 0.1)],
}

您的管道没有 randomforestregressor 参数,正如您的错误所暗示的那样。由于您使用的是 RandomForestClassifier,这应该是:

pipe_params = {
    "randomforestclassifier__n_estimators": [100, 500, 1000],
    "randomforestclassifier__max_depth": [1, 5, 10, 25],
    "randomforestclassifier__max_features": [*np.arange(0.1, 1.1, 0.1)],
}

如果您 运行 错误消息中的建议,您将看到适用于您的管道的可用选项 (pipeline.get_params().keys())。

Nick的回答绝对正确,确实能解决你的问题。在您的情况下,您可以实例化管道,避免 make_pipeline 以支持 Pipeline class。我相信它更具可读性和简洁性:

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", RandomForestClassifier())
])

并访问模型参数,并在它们前面加上您的 classifier 名称:

param_grid = {
    "clf__n_estimators": [100, 500, 1000],
    "clf__max_depth": [1, 5, 10, 25],
    "clf__max_features": [*np.arange(0.1, 1.1, 0.1)],
}

下面是一个基于鸢尾花数据集的完整示例:

from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn import datasets
import numpy as np


# Data preparation
iris = datasets.load_iris()
x = iris.data[:, :2]
y = iris.target

x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.33, random_state=42
)

# Build a pipeline object
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", RandomForestClassifier())
])

# Declare a hyperparameter grid
param_grid = {
    "clf__n_estimators": [100, 500, 1000],
    "clf__max_depth": [1, 5, 10, 25],
    "clf__max_features": [*np.arange(0.1, 1.1, 0.1)],
}

# Perform grid search, fit it, and print score
gs = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=-1, verbose=1000)
gs.fit(x_train, y_train)
print(gs.score())