StratifiedKFold 拆分训练和验证集大小

StratifiedKFold split train and validation set size

我正在使用 StratifiedKFold,但我不确定下面代码中 kfold.split 返回的训练和测试大小是多少。假设 Print(array.shape) returns (12904, 47) 即行数为 12904,列数为 47,训练和测试大小是多少?

kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=8)

for train, validation in kfold.split(X, Y):
            # Fit the model
            model.fit(X[train], Y[train])
            # predict probabilities for training set
            predicted = model.predict(X[train])

            predicted_report = classification_report(Y[train], predicted)
            print(predicted_report)
            # accuracy: (tp + tn) / (p + n)
            accuracy = accuracy_score(Y[train], predicted)#accuracy_score(Y[train], yhat_classes)

正如评论中已经暗示的那样,您的训练集大小将为 (n_splits-1)/n_splits,而您的验证集大小将为初始数据大小的 1/n_splits,即这里的 4/5 和分别为1/5。

这是一个使用虹膜数据和 n_splits=5 的简单可重现演示,如您的情况:

import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target
print(X.shape) # initial dataset size
# (150, 4)

kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=8)

for train, validation in kfold.split(X, y):
            print(X[train].shape, X[validation].shape)

其结果是:

(120, 4) (30, 4)
(120, 4) (30, 4)
(120, 4) (30, 4)
(120, 4) (30, 4)
(120, 4) (30, 4)

因此,要检查自己的数据,只需在 for 循环中添加上述 print 语句即可。