scikit 中的 ShuffleSplit 等价物

Question

sklearn.model_selection.ShuffleSplit 保证所有折叠都不同的等效函数是什么？

Answer 1

虽然 KFold 保证测试索引不会重叠，但它强制了 no 之间的关系。重复模型评估的次数和用于测试集的样本百分比（即 n_splits 和 test_size）。所以如果你想使用 10% 的数据进行测试，你将不得不训练和评估你的模型 10 次——例如你不能重复 3 次，也许是为了节省时间。

为了结合两全其美，一种可能的解决方案是子类 sklearn 的 KFold:

import itertools
class DSS(KFold):
    def __init__(self, n_repeat=5,test_size=.25, *, shuffle=True,
                 random_state=None):
        super().__init__(n_splits=int(1/test_size), shuffle=shuffle, 
                         random_state=random_state)
        self.n_repeat = n_repeat  

    def split(self, X, y=None, groups=None):
        gen_idx = super().split(X,y,groups)
        return itertools.islice(gen_idx,self.n_repeat) #Only keep first few index arrays

使用 Iris 数据集的示例用法：

cv = DSS(n_repeat=3,test_size=.1,shuffle=True)
for _,test_idx in cv.split(X,y): 
    print(test_idx )

输出：

[ 18  20  25  27  48  50  67  95 110 113 124 125 137 145 147]
[ 29  36  56  58  60  63  68  77 100 106 117 121 129 134 141]
[  4  28  40  42  76  86  90  94  98 102 115 139 142 143 144]

您当然可以在 cross_val_score 或 GridSearchCV 中像正常 KFold 一样使用此 cv。

scikit 中的 ShuffleSplit 等价物

ShuffleSplit equivalent in scikit

python

sampling

scikit-learn