Variability/randomness 的支持向量机模型在 Python 的 scikit 学习中得分

Variability/randomness of Support Vector Machine model scores in Python's scikitlearn

我正在测试多个 ML 分类模型,在本例中为支持向量机。我对 SVM 算法及其工作原理有基本的了解。

我正在使用 scikit learn 的内置乳腺癌数据集。

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC

使用下面的代码:

cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, 
                                                    stratify=cancer.target, random_state=42)
clf2 = LinearSVC(C=0.01).fit(X_train, y_train)
clf3 = LinearSVC(C=0.1).fit(X_train, y_train)
clf4 = LinearSVC(C=1).fit(X_train, y_train)
clf5 = LinearSVC(C=10).fit(X_train, y_train)
clf6 = LinearSVC(C=100).fit(X_train, y_train)

打印分数时:

print("Model training score with C=0.01:\n{:.3f}".format(clf2.score(X_train, y_train)))
print("Model testing score with C=0.01:\n{:.3f}".format(clf2.score(X_test, y_test)))
print("------------------------------")
print("Model training score with C=0.1:\n{:.3f}".format(clf3.score(X_train, y_train)))
print("Model testing score with C=0.1:\n{:.3f}".format(clf3.score(X_test, y_test)))
print("------------------------------")
print("Model training score with C=1:\n{:.3f}".format(clf4.score(X_train, y_train)))
print("Model testing score with C=1:\n{:.3f}".format(clf4.score(X_test, y_test)))
print("------------------------------")
print("Model training score with C=10:\n{:.3f}".format(clf5.score(X_train, y_train)))
print("Model testing score with C=10:\n{:.3f}".format(clf5.score(X_test, y_test)))
print("------------------------------")
print("Model training score with C=100:\n{:.3f}".format(clf6.score(X_train, y_train)))
print("Model testing score with C=100:\n{:.3f}".format(clf6.score(X_test, y_test)))

当我 运行 这段代码时,我会根据不同的正则化参数 C 获得一定的分数。当我再次 运行 .fit 行(又名再次训练它们)时,这些分数完全不同的。有时它们甚至相差甚远(例如,对于相同的 C 值,72% 与 90%)。

这种可变性从何而来?我认为,假设我使用相同的 random_state 参数,它总是会找到相同的支持向量,因此会给我相同的结果,但由于我再次训练模型时得分会发生变化,所以这不是案件。 例如,在逻辑回归中,无论我 运行 是否合适,分数总是一致的。再次编码。

解释准确率得分的这种变化会很有帮助!

当然可以。 您需要将 random_state=None 固定到特定种子,以便您可以重现结果。

否则,您将使用默认值 random_state=None,因此,每次调用命令时,都会使用随机种子,这就是您获得这种可变性的原因。


使用:

cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, 
                                                    stratify=cancer.target, random_state=42)
clf2 = LinearSVC(C=0.01,random_state=42).fit(X_train, y_train)
clf3 = LinearSVC(C=0.1, random_state=42).fit(X_train, y_train)
clf4 = LinearSVC(C=1,   random_state=42).fit(X_train, y_train)
clf5 = LinearSVC(C=10,  random_state=42).fit(X_train, y_train)
clf6 = LinearSVC(C=100, random_state=42).fit(X_train, y_train)