StratifiedKFold 拆分训练和验证集大小
StratifiedKFold split train and validation set size
我正在使用 StratifiedKFold
,但我不确定下面代码中 kfold.split
返回的训练和测试大小是多少。假设 Print(array.shape)
returns (12904, 47)
即行数为 12904,列数为 47,训练和测试大小是多少?
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=8)
for train, validation in kfold.split(X, Y):
# Fit the model
model.fit(X[train], Y[train])
# predict probabilities for training set
predicted = model.predict(X[train])
predicted_report = classification_report(Y[train], predicted)
print(predicted_report)
# accuracy: (tp + tn) / (p + n)
accuracy = accuracy_score(Y[train], predicted)#accuracy_score(Y[train], yhat_classes)
正如评论中已经暗示的那样,您的训练集大小将为 (n_splits-1)/n_splits
,而您的验证集大小将为初始数据大小的 1/n_splits
,即这里的 4/5 和分别为1/5。
这是一个使用虹膜数据和 n_splits=5
的简单可重现演示,如您的情况:
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
print(X.shape) # initial dataset size
# (150, 4)
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=8)
for train, validation in kfold.split(X, y):
print(X[train].shape, X[validation].shape)
其结果是:
(120, 4) (30, 4)
(120, 4) (30, 4)
(120, 4) (30, 4)
(120, 4) (30, 4)
(120, 4) (30, 4)
因此,要检查自己的数据,只需在 for 循环中添加上述 print
语句即可。
我正在使用 StratifiedKFold
,但我不确定下面代码中 kfold.split
返回的训练和测试大小是多少。假设 Print(array.shape)
returns (12904, 47)
即行数为 12904,列数为 47,训练和测试大小是多少?
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=8)
for train, validation in kfold.split(X, Y):
# Fit the model
model.fit(X[train], Y[train])
# predict probabilities for training set
predicted = model.predict(X[train])
predicted_report = classification_report(Y[train], predicted)
print(predicted_report)
# accuracy: (tp + tn) / (p + n)
accuracy = accuracy_score(Y[train], predicted)#accuracy_score(Y[train], yhat_classes)
正如评论中已经暗示的那样,您的训练集大小将为 (n_splits-1)/n_splits
,而您的验证集大小将为初始数据大小的 1/n_splits
,即这里的 4/5 和分别为1/5。
这是一个使用虹膜数据和 n_splits=5
的简单可重现演示,如您的情况:
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
print(X.shape) # initial dataset size
# (150, 4)
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=8)
for train, validation in kfold.split(X, y):
print(X[train].shape, X[validation].shape)
其结果是:
(120, 4) (30, 4)
(120, 4) (30, 4)
(120, 4) (30, 4)
(120, 4) (30, 4)
(120, 4) (30, 4)
因此,要检查自己的数据,只需在 for 循环中添加上述 print
语句即可。