如何使用 sklearn 将数据分成 3 个或更多部分
how can I split data in 3 or more parts with sklearn
我想把数据拆分成分层的train、test和validation数据集,但是sklearn只提供了cross_validation.train_test_split,只能分成2块。
如果我想这样做我应该怎么做
如果要使用分层 Train/Test 拆分,可以使用 StratifiedKFold in Sklearn
假设 X
是你的特征,y
是你的标签,基于示例 here :
from sklearn.model_selection import StratifiedKFold
cv_stf = StratifiedKFold(n_splits=3)
for train_index, test_index in skf.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
Update :要将数据分成 3 个不同的百分比,使用 numpy.split() 可以像这样完成:
X_train, X_test, X_validate = np.split(X, [int(.7*len(X)), int(.8*len(X))])
y_train, y_test, y_validate = np.split(y, [int(.7*len(y)), int(.8*len(y))])
您也可以多次使用 train_test_split
来实现此目的。第二次,运行 它在第一次调用 train_test_split
.
的训练输出上
from sklearn.model_selection import train_test_split
def train_test_validate_stratified_split(features, targets, test_size=0.2, validate_size=0.1):
# Get test sets
features_train, features_test, targets_train, targets_test = train_test_split(
features,
targets,
stratify=targets,
test_size=test_size
)
# Run train_test_split again to get train and validate sets
post_split_validate_size = validate_size / (1 - test_size)
features_train, features_validate, targets_train, targets_validate = train_test_split(
features_train,
targets_train,
stratify=targets_train,
test_size=post_split_validate_size
)
return features_train, features_test, features_validate, targets_train, targets_test, targets_validate
我想把数据拆分成分层的train、test和validation数据集,但是sklearn只提供了cross_validation.train_test_split,只能分成2块。 如果我想这样做我应该怎么做
如果要使用分层 Train/Test 拆分,可以使用 StratifiedKFold in Sklearn
假设 X
是你的特征,y
是你的标签,基于示例 here :
from sklearn.model_selection import StratifiedKFold
cv_stf = StratifiedKFold(n_splits=3)
for train_index, test_index in skf.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
Update :要将数据分成 3 个不同的百分比,使用 numpy.split() 可以像这样完成:
X_train, X_test, X_validate = np.split(X, [int(.7*len(X)), int(.8*len(X))])
y_train, y_test, y_validate = np.split(y, [int(.7*len(y)), int(.8*len(y))])
您也可以多次使用 train_test_split
来实现此目的。第二次,运行 它在第一次调用 train_test_split
.
from sklearn.model_selection import train_test_split
def train_test_validate_stratified_split(features, targets, test_size=0.2, validate_size=0.1):
# Get test sets
features_train, features_test, targets_train, targets_test = train_test_split(
features,
targets,
stratify=targets,
test_size=test_size
)
# Run train_test_split again to get train and validate sets
post_split_validate_size = validate_size / (1 - test_size)
features_train, features_validate, targets_train, targets_validate = train_test_split(
features_train,
targets_train,
stratify=targets_train,
test_size=post_split_validate_size
)
return features_train, features_test, features_validate, targets_train, targets_test, targets_validate