使用交叉验证训练 8 个不同的分类器可以为同一文件提供相同的准确性?

Training 8 different classifiers with crossvalidation give same accuracy with the same file?

我有下面的脚本,它应该使用交叉验证来训练不同的模型,然后计算平均准确度,这样我就可以将最佳模型用于分类任务。但是我得到的每个分类器的结果都是一样的。

结果如下所示:

---Filename in processed................ corpusAmazon_train
etiquette  : [0 1]
Embeddings bert model used.................... :  sm
Model name: Model_LSVC_ovr
------------cross val predict used---------------- 

accuracy with cross_val_predict : 0.6582974014576258
corpusAmazon_train file terminated--- 

---------------cross val score used ----------------------- 

[0.66348722 0.66234262 0.63334605 0.66959176 0.66081648 0.6463182
 0.66730256 0.65572519 0.65648855 0.66755725]
0.66 accuracy with a standard deviation of 0.01 

Model name: Model_G_NB
------------cross val predict used---------------- 

accuracy with cross_val_predict : 0.6582974014576258
corpusAmazon_train file terminated--- 

---------------cross val score used ----------------------- 

[0.66348722 0.66234262 0.63334605 0.66959176 0.66081648 0.6463182
 0.66730256 0.65572519 0.65648855 0.66755725]
0.66 accuracy with a standard deviation of 0.01 

Model name: Model_LR
------------cross val predict used---------------- 

accuracy with cross_val_predict : 0.6582974014576258
corpusAmazon_train file terminated--- 

---------------cross val score used ----------------------- 

[0.66348722 0.66234262 0.63334605 0.66959176 0.66081648 0.6463182
 0.66730256 0.65572519 0.65648855 0.66755725]
0.66 accuracy with a standard deviation of 0.01 

使用cross_validation的代码行:

models_list = {'Model_LSVC_ovr': model1, 'Model_G_NB': model2, 'Model_LR': model3, 'Model_RF': model4, 'Model_KN': model5, 'Model_MLP': model6, 'Model_LDA': model7, 'Model_XGB': model8}

# cross_validation
def cross_validation(features, ylabels, models_list, n, lge_model):

    cv_splitter = KFold(n_splits=10, shuffle=True, random_state=42)
    features, s = get_flaubert_layer(features, lge_model)
    for model_name, model in models_list.items():
        print("Model name: {}".format(model_name))
        print("------------cross val predict used----------------", "\n")
        y_pred = cross_val_predict(model, features, ylabels, cv=cv_splitter, verbose=1)
        accuracy_score_predict = accuracy_score(ylabels, y_pred)
        print("accuracy with cross_val_predict :", accuracy_score_predict)

        print("---------------cross val score used -----------------------", "\n")
        scores = cross_val_score(model, features, ylabels, scoring='accuracy', cv=cv_splitter)

        print("%0.2f accuracy with a standard deviation of %0.2f" % (accuracy_score_mean, accuracy_score_std), "\n")

即使使用 cross_val_score,模型的准确度也相同。任何想法,也许我在 cross_validation 函数中使用了 random_state?

模型定义代码:

def classifiers_b():

    model1 = LinearSVC()
    model2 = GaussianNB()  # MultinomialNB() X cannot be a non-negative
    model3 = LogisticRegression()
    model4 = RandomForestClassifier()
    model5 = KNeighborsClassifier()
    model6 = MLPClassifier(hidden_layer_sizes=(50, 100, 50), max_iter=500, activation='relu', solver='adam',
                           random_state=1)
    model8 = XGBClassifier(eval_metric="logloss")
    model7 = LinearDiscriminantAnalysis()

    #models_list = {'Model_LSVC_ovr': model1, 'Model_G_NB': model2, 'Model_LR': model3, 'Model_RF': model4, 'Model_KN': model5, 'Model_MLP': model6, 'Model_LDA': model7, 'Model_XGB': model8}

我建议为每个模型使用一个管道。看起来您在每次迭代中都在同一模型上执行 CV。您可以查看文档 here 以获取有关如何使用它们的更多信息。然后对每个模型流水线进行CV。