有没有办法扩展最初传递给 SVC 的训练集？

Question

我想在 Python 中模拟主动学习。我有一个初始训练集和一个未标记的潜在训练数据池。现在，我想迭代地选择池中的一个元素，将其添加到传递给 SVC 的训练集中，并使用新集重新训练 SVC。我不确定如何正确地做到这一点。我可以做（伪代码）：

for i in range(100):
    linearSVC = svm.SVC(kernel='linear', probability=True)
    linearSVC.fit(X_train, y_train)
    addElementToXtrainSetAndYtrainSet()

或者：

linearSVC = svm.SVC(kernel='linear', probability=True)
for i in range(100):
    linearSVC.fit(X_train, y_train)
    addElementToXtrainSetAndYtrainSet()

第一个肯定适合我。每次迭代都会使用迭代扩大的训练数据训练新的 SVC。但是一遍遍重新初始化SVC感觉不对

关于第二种方法，我不确定 SVC 是从头开始重新训练还是保持其先前迭代的状态并在此状态之上重新训练。我不要那个。如果是这种情况，我认为可能有一个选项可以在不再次传递整个训练数据的情况下将随后的一个元素添加到旧状态。

但我也不知道如何.fit 行为在幕后，我也找不到这样的选项。有解决我的问题的“好”方法吗？

Answer 1

你要完成的基本上就是Stochastic Gradient Descent的原理。因此，我建议使用 scikit-learn 的 SGDClassifier。来自其文档：

This estimator implements regularized linear models with stochastic gradient descent (SGD) learning: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate). SGD allows minibatch (online/out-of-core) learning via the partial_fit method.

根据指定的损失函数，会拟合不同的模型。它默认为hinge loss，这相当于一个线性SVM.

在您的情况下，您将使用 fit() 函数在初始训练数据上训练一个具有铰链损失的 SGDClassifier，然后使用 fit() 使用潜在训练数据的元素更新模型=15=]一次一个：

from sklearn.linear_model import SGDClassifier


linearSVC = SGDClassifier(loss='hinge') # hinge is default loss anyway, just shown for clarity

# Fit on initial training set
linearSVC.fit(X_train, y_train)

# Update model one sample at a time
for i in range(100):
    linearSVC.partial_fit(X_pool[i], y_pool[i])

这将按预期工作。作为参考，您还可以查看此的答案，其中澄清了

[...] when fitting new data to your model, partial_fit will only correct the model one step towards the new data [...]

最后一点，因为您在示例中传递了 probability=True 参数。请注意，此分类器仅支持 predict_proba() 函数用于对数损失和修正的 Huber 损失。因此，您可能无法预测类.

的概率

有没有办法扩展最初传递给 SVC 的训练集？

Is there a way to extend the training set initially passed to a SVC?

scikit-learn

svm