使用 sklearn 使用 k 折叠预测类的测试数据

Question

我正在从事数据挖掘项目，我正在使用 python 中的 sklearn 包来 class 验证我的数据。

为了训练我的数据并评估预测值的质量，我使用了 sklearn.cross_validation.cross_val_predict 函数。

然而，当我尝试运行我的模型在测试数据上时，它要求提供基础 class，这是不可用的。

我已经看到使用 sklearn.grid_search.GridSearchCV 函数的（可能的）解决方法，但我不愿意对一组固定的参数使用这种方法。

通过 sklearn.cross_validation 文档，我遇到了 cross_val_score 函数。由于我对 class 化问题的世界还很陌生，所以我不太确定这个函数是否可以解决我的问题。

任何帮助都会很棒！

谢谢！

编辑：

您好！我得到的印象是我对最初的查询相当含糊。我将尝试详细说明我正在做什么。这里是：

我已经生成了 3 个 numpy.ndarrays X、X_test 和 y，nrows = 10158、22513 和 10158 分别对应于我的训练数据、测试数据和 class 标签训练数据。

此后，我运行下面的代码：

    from sklearn.svm import SVC
    from sklearn.cross_validation import cross_val_predict
    clf = SVC()
    testPred = cross_val_predict(clf,X,y,cv=2)

这很好用，然后我可以使用教程中提到的 stemPred 和 y。

但是，我希望预测 X_test 的 classes。错误消息是不言自明的，它说：

    ValueError: Found arrays with inconsistent numbers of samples: [10158 22513]

我目前使用的解决方法（我不知道这是解决方法还是唯一的方法）是：

    from sklearn import grid_search
    # thereafter I create the parameter grid (grid) and appropriate scoring function (scorer)
    model = grid_search.GridSearchCV(estimator = clf, param_grid = grid, scoring = scorer, refit = True, cv = 2, n_jobs = -1)
    model.fit(X,y)
    model.best_estimator_.fit(X,y)
    testPred = model.best_estimator_.predict(X_test)

这种技术暂时可以正常工作；但是，如果我不必使用 GridSearchCV 函数，我就能睡得更好。

Answer 1

IIUC, you're conflating different things.

Suppose you have a classifier with a given scheme. Then you can train it on some data, and predict (usually other) data. This is quite simple, and looks like this.

First we build the predictor and fit it.

from sklearn import svm, grid_search, datasets
from sklearn.cross_validation import train_test_split
iris = datasets.load_iris()
clf = svm.SVC()
train_x, test_x, train_y, test_y = train_test_split(iris.data, iris.target)
>> clf.fit(train_x, train_y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0, kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)

Now that it is completely constructed, you can use it to predict.

>> clf.predict(test_x)
array([1, 0, 0, 2, 0, 1, 1, 1, 0, 2, 2, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 2, 0,
   1, 0, 2, 0, 2, 1, 2, 1, 2, 2, 2, 1, 0, 0, 0])

It's as simple as that.

What has happened here?

The classifier has a completely specified scheme - it just needs to tune its parameters
The classifier tunes its parameters given the train data
The classifier is ready to predict

In many cases, the classifier has a scheme that it needs to tune using parameters, but it also has meta-parameters. An example is the degree argument for your classifier.

How should you tune them? There are a number of ways.

Don't. just stick with the defaults (that's what my example did)
Use some form of cross-validation (e.g., grid search)
Use some measure of complexity, e.g., AIC, BIC, etc.

So it's important not to mix these things up. Cross-validation is not some trick to get a predictor for the test data. The predictor with the default arguments can already do that. Cross validation is for tuning meta parameters. Once you choose them, you tune the parameters. Then you have a different predictor.

使用 sklearn 使用 k 折叠预测类的测试数据

predict classes of test data using k folding using sklearn

python

classification

machine-learning

scikit-learn

cross-validation

谢谢！

使用 sklearn 使用 k 折叠预测 类 的测试数据

predict classes of test data using k folding using sklearn

python

classification

machine-learning

scikit-learn

cross-validation

谢谢！

使用 sklearn 使用 k 折叠预测类的测试数据