使用 Sklearn 留一法交叉验证
Leave one out cross validation using Sklearn
我正在尝试使用交叉验证来使用 Sklearn 测试我的分类器。
我有 3 个 类,总共 50 个样本。
- Class 1 有:5 个样本
- Class 2 有:15 个样本
- Class 3 有:30 个样本
以下按预期运行,大概是在进行 5 折交叉验证。
result = cross_validation.cross_val_score(classifier, X, y, cv=5)
我正在尝试使用 cv=50 折进行留一法,所以我执行以下操作,
result = cross_validation.cross_val_score(classifier, X, y, cv=50)
然而,令人惊讶的是,它给出了以下错误:
/Library/Python/2.7/site-packages/sklearn/cross_validation.py:413: Warning: The least populated class in y has only 5 members, which is too few. The minimum number of labels for any class cannot be less than n_folds=50.
% (min_labels, self.n_folds)), Warning)
/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/_methods.py:55: RuntimeWarning: Mean of empty slice.
warnings.warn("Mean of empty slice.", RuntimeWarning)
/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/_methods.py:67: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
Traceback (most recent call last):
File "b.py", line 96, in <module>
scores1 = cross_validation.cross_val_score(classifier, X, y, cv=50)
File "/Library/Python/2.7/site-packages/sklearn/cross_validation.py", line 1151, in cross_val_score
for train, test in cv)
File "/Library/Python/2.7/site-packages/sklearn/externals/joblib/parallel.py", line 653, in __call__
self.dispatch(function, args, kwargs)
File "/Library/Python/2.7/site-packages/sklearn/externals/joblib/parallel.py", line 400, in dispatch
job = ImmediateApply(func, args, kwargs)
File "/Library/Python/2.7/site-packages/sklearn/externals/joblib/parallel.py", line 138, in __init__
self.results = func(*args, **kwargs)
File "/Library/Python/2.7/site-packages/sklearn/cross_validation.py", line 1240, in _fit_and_score
test_score = _score(estimator, X_test, y_test, scorer)
File "/Library/Python/2.7/site-packages/sklearn/cross_validation.py", line 1296, in _score
score = scorer(estimator, X_test, y_test)
File "/Library/Python/2.7/site-packages/sklearn/metrics/scorer.py", line 176, in _passthrough_scorer
return estimator.score(*args, **kwargs)
File "/Library/Python/2.7/site-packages/sklearn/base.py", line 291, in score
return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
File "/Library/Python/2.7/site-packages/sklearn/neighbors/classification.py", line 147, in predict
neigh_dist, neigh_ind = self.kneighbors(X)
File "/Library/Python/2.7/site-packages/sklearn/neighbors/base.py", line 332, in kneighbors
return_distance=return_distance)
File "binary_tree.pxi", line 1307, in sklearn.neighbors.kd_tree.BinaryTree.query (sklearn/neighbors/kd_tree.c:10506)
File "binary_tree.pxi", line 226, in sklearn.neighbors.kd_tree.get_memview_DTYPE_2D (sklearn/neighbors/kd_tree.c:2715)
File "stringsource", line 247, in View.MemoryView.array_cwrapper (sklearn/neighbors/kd_tree.c:24789)
File "stringsource", line 147, in View.MemoryView.array.__cinit__ (sklearn/neighbors/kd_tree.c:23664)
ValueError: Invalid shape in axis 0: 0.
此外,另一件奇怪的事情是,当我执行 cv=5 时,我没有收到任何警告。当我执行 cv=50 时,我收到上述警告,这很奇怪。因为我认为当 cv 变大时,即使它可能在计算上更难,但结果应该更准确。和我的推理有差距吗?为什么我会收到警告和错误?
在这种情况下如何正确进行留一法交叉验证?
默认情况下,classification 的 cv=5 会进行分层 5 折交叉验证。
这意味着它试图保持一个 class 中样本的比例不变。当折叠数与样本数相同时,这可能会导致麻烦。
你在哪个版本?
这个错误信息肯定不是很有帮助。
顺便说一句,一般来说,我建议您对这么小的数据集使用 StratifiedShuffleSplit
。
[edit]: 当前版本给出警告,应该是错误:
sklearn/cross_validation.py:399: Warning: The least populated class in y has only 13 members, which is too few. The minimum number of labels for any class cannot be less than n_folds=68.
% (min_labels, self.n_folds)), Warning)
我正在尝试使用交叉验证来使用 Sklearn 测试我的分类器。
我有 3 个 类,总共 50 个样本。
- Class 1 有:5 个样本
- Class 2 有:15 个样本
- Class 3 有:30 个样本
以下按预期运行,大概是在进行 5 折交叉验证。
result = cross_validation.cross_val_score(classifier, X, y, cv=5)
我正在尝试使用 cv=50 折进行留一法,所以我执行以下操作,
result = cross_validation.cross_val_score(classifier, X, y, cv=50)
然而,令人惊讶的是,它给出了以下错误:
/Library/Python/2.7/site-packages/sklearn/cross_validation.py:413: Warning: The least populated class in y has only 5 members, which is too few. The minimum number of labels for any class cannot be less than n_folds=50.
% (min_labels, self.n_folds)), Warning)
/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/_methods.py:55: RuntimeWarning: Mean of empty slice.
warnings.warn("Mean of empty slice.", RuntimeWarning)
/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/_methods.py:67: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
Traceback (most recent call last):
File "b.py", line 96, in <module>
scores1 = cross_validation.cross_val_score(classifier, X, y, cv=50)
File "/Library/Python/2.7/site-packages/sklearn/cross_validation.py", line 1151, in cross_val_score
for train, test in cv)
File "/Library/Python/2.7/site-packages/sklearn/externals/joblib/parallel.py", line 653, in __call__
self.dispatch(function, args, kwargs)
File "/Library/Python/2.7/site-packages/sklearn/externals/joblib/parallel.py", line 400, in dispatch
job = ImmediateApply(func, args, kwargs)
File "/Library/Python/2.7/site-packages/sklearn/externals/joblib/parallel.py", line 138, in __init__
self.results = func(*args, **kwargs)
File "/Library/Python/2.7/site-packages/sklearn/cross_validation.py", line 1240, in _fit_and_score
test_score = _score(estimator, X_test, y_test, scorer)
File "/Library/Python/2.7/site-packages/sklearn/cross_validation.py", line 1296, in _score
score = scorer(estimator, X_test, y_test)
File "/Library/Python/2.7/site-packages/sklearn/metrics/scorer.py", line 176, in _passthrough_scorer
return estimator.score(*args, **kwargs)
File "/Library/Python/2.7/site-packages/sklearn/base.py", line 291, in score
return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
File "/Library/Python/2.7/site-packages/sklearn/neighbors/classification.py", line 147, in predict
neigh_dist, neigh_ind = self.kneighbors(X)
File "/Library/Python/2.7/site-packages/sklearn/neighbors/base.py", line 332, in kneighbors
return_distance=return_distance)
File "binary_tree.pxi", line 1307, in sklearn.neighbors.kd_tree.BinaryTree.query (sklearn/neighbors/kd_tree.c:10506)
File "binary_tree.pxi", line 226, in sklearn.neighbors.kd_tree.get_memview_DTYPE_2D (sklearn/neighbors/kd_tree.c:2715)
File "stringsource", line 247, in View.MemoryView.array_cwrapper (sklearn/neighbors/kd_tree.c:24789)
File "stringsource", line 147, in View.MemoryView.array.__cinit__ (sklearn/neighbors/kd_tree.c:23664)
ValueError: Invalid shape in axis 0: 0.
此外,另一件奇怪的事情是,当我执行 cv=5 时,我没有收到任何警告。当我执行 cv=50 时,我收到上述警告,这很奇怪。因为我认为当 cv 变大时,即使它可能在计算上更难,但结果应该更准确。和我的推理有差距吗?为什么我会收到警告和错误?
在这种情况下如何正确进行留一法交叉验证?
默认情况下,classification 的 cv=5 会进行分层 5 折交叉验证。 这意味着它试图保持一个 class 中样本的比例不变。当折叠数与样本数相同时,这可能会导致麻烦。 你在哪个版本? 这个错误信息肯定不是很有帮助。
顺便说一句,一般来说,我建议您对这么小的数据集使用 StratifiedShuffleSplit
。
[edit]: 当前版本给出警告,应该是错误:
sklearn/cross_validation.py:399: Warning: The least populated class in y has only 13 members, which is too few. The minimum number of labels for any class cannot be less than n_folds=68. % (min_labels, self.n_folds)), Warning)