如何将 GridSearchCV 与 MultiOutputClassifier(MLPClassifier) 管道一起使用

Question

我第一次尝试 scikit-learn，解决 多输出多 Class 文本分类 问题。为此，我正在尝试使用 GridSearchCV 来优化 MLPClassifier 的参数。

我承认我是在黑暗中拍摄，没有任何经验。如果这有意义，请告诉我。

以下是我目前拥有的：

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

df = pd.read_csv('data.csv')

df.fillna('', inplace=True) #Replaces NaNs with "" in the DataFrame (which would be considered a viable choice in this multi-classification model)

x_features = df['input_text']
y_labels = df[['output_text_label_1', 'output_text_label_2']]

x_train, x_test, y_train, y_test = train_test_split(x_features, y_labels, test_size=0.3, random_state=7)

pipe = Pipeline(steps=[('cv', CountVectorizer()),
                       ('mlpc', MultiOutputClassifier(MLPClassifier()))])

pipe.fit(x_train, y_train)

pipe.score(x_test, y_test)

pipe.score 给出了 ~0.837 的分数，这似乎表明上面的代码正在做一些事情。运行 pipe.predict() 在某些测试字符串上似乎产生了相对足够的输出结果。

然而，即使看了很多例子，我还是不明白如何为这个 Pipeline 实现 GridSearchCV。（另外，我想知道要搜索哪些参数）。

我怀疑 post 我对 GridSearchCV 的尝试是否有意义，因为它们是多种多样的，但都没有成功。但 Stack Overflow 答案的一个简短示例可能是：

grid = [
        {
        'activation' : ['identity', 'logistic', 'tanh', 'relu'],
        'solver' : ['lbfgs', 'sgd', 'adam'],
        'hidden_layer_sizes': [(100,),(200,)]
        }
       ]

grid_search = GridSearchCV(pipe, grid, scoring='accuracy', n_jobs=-1)

grid_search.fit(x_train, y_train)

这给出了错误：

ValueError: Invalid parameter activation for estimator Pipeline(steps=[('cv', CountVectorizer()), ('mlpc', MultiOutputClassifier(estimator=MLPClassifier()))]). Check the list of available parameters with estimator.get_params().keys().

我不确定是什么原因造成的，也不知道如何利用 estimator.get_params().keys() 找出哪些参数有问题。

也许我对 'cv', CountVectorizer() 或 'mlpc', MultiOutputClassifier(estimator=MLPClassifier())) 的使用与网格参数有关。

我相信我需要在这里使用 CountVectorizer() 因为我的输入（和所需的标签输出）都是字符串。

我非常感谢一个例子，说明 GridSearchCV 应该如何用于 Pipeline 大概以正确的方式利用 CountVectorizer() 和 MLPClassifier，以及哪些网格参数可能建议搜索。

Answer 1

TL;DR 试试这样：

mlpc = MLPClassifier(solver='adam',
                     learning_rate_init=0.01,
                     max_iter=300,
                     activation='relu',
                     early_stopping=True)
pipe = Pipeline(steps=[('cv', CountVectorizer(ngram_range=(1, 1))),
                       ('scale', StandardScaler()),
                       ('mlpc', MultiOutputClassifier(mlpc))])
search_space = {
    'cv__max_df': (0.9, 0.95, 0.99),
    'cv__min_df': (0.01, 0.05, 0.1),
    'mlpc__estimator__alpha': 10.0 ** -np.arange(1, 5),
    'mlpc__estimator__hidden_layer_sizes': ((64, 32), (128, 64),
                                            (64, 32, 16), (128, 64, 32)),
    'mlpc__estimator__tol': (1e-3, 5e-3, 1e-4),
}

讨论：

[编辑] 仅针对 multi-output 二元分类 ，MLPClassifier 支持 multi-output 分类，并且具有相互关联的输出，我不会建议使用 MultiOutputClassifier，因为它训练单独的 MLPClassifier 个实例，而不考虑输出之间的关系。只训练一个 MLPClassifier 更快、更便宜，而且通常更准确。
ValueError 是由于参数网格名称不正确。参见 Nested parameters。
使用适度的工作站 and/or 大量训练数据，设置 solver='adam' 使用更便宜的 first-order 方法而不是 second-order 'lbfgs'。或者，尝试 solver='sgd'---计算成本更低---但还要调整 momentum。我预计您的数据在 CountVectorizer 之后将变得稀疏且具有不同的规模，而 momentum/solver='adam' 是解决变异梯度的一种方法。
在 CountVectorizer 之后插入 standardization transformers (I guess StandardScaler 之一会更好），因为 MLP 对特征缩放很敏感。虽然，solver='adam' 可能会很好地处理不平衡的词袋。不过，我相信标准化数据不会有什么坏处。
我认为调音activation是针。设置 activation='relu'.
使用early_stopping=True，指定足够大的max_iter，并调整tol以防止过度拟合。
肯定用solver='sgd'调learning_rate_init；对于 solver='adam'，我假设更高的学习率就可以了，adam 不需要全面的 learning-rate 调整。
更喜欢更深的网络而不是更宽的网络（例如，hidden_layer_sizes=(128, 64, 32) 到 hidden_layer_sizes=(256, 192)）。
一直调alpha.
最佳 hidden_layer_sizes 可能取决于 document-term 维度。
尝试设置更高的 batch_sizes，但要考虑计算费用。
如果您希望优化 CountVectorizer，请调整 max_df 和 min_df 而不是 ngram_range；我相信至少 two-layer MLP 将在隐藏层中处理 unigram 关系本身，而无需显式处理 n-grams。
首先优化上面代码示例中的超参数。但请注意，剩余的超参数也会影响计算性能和预测能力。

免责声明：大部分评论都是基于我对您的数据的（非实质性的）假设，并且仅适用于 scikit-learn 的 MLP。请参阅 docs 以了解有关神经网络的更多信息并试验其他技巧。请记住，天下没有免费的午餐。

如何将 GridSearchCV 与 MultiOutputClassifier(MLPClassifier) 管道一起使用

How to use GridSearchCV with MultiOutputClassifier(MLPClassifier) Pipeline

python

scikit-learn

multiclass-classification