在 python 3.6 中两次使用 train-test-split 函数时应该将什么作为输入参数传递
What should be passed as input parameter when using train-test-split function twice in python 3.6
基本上我想把我的数据集分成训练集、测试集和验证集。因此,我使用了 train_test_split 函数两次。我有一个大约 1000 万行的数据集。
在第一次拆分时,我将训练和测试数据集拆分为 7000 万训练和 3000 万测试。现在要获得验证集,我有点困惑是使用拆分测试数据还是训练数据作为 train-test-split 的输入参数以获得验证集。给点建议。 TIA
X = features
y = target
# dividing X, y into train and test and validation data 70% training dataset with 15% testing and 15% validation set
from sklearn.model_selection import train_test_split
#features and label splitted into 70-30
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
#furthermore test data is splitted into test and validation set 15-15
x_test, x_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5)
不要让测试集太小。 20% 的测试数据集就可以了。如果将训练数据集分成训练和验证(80%/20% 是一个公平的划分)会更好。考虑到这一点,您应该以这种方式更改您的代码:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
x_test, x_val, y_test, y_val = train_test_split(X_train, y_train, test_size=0.25)
像这样拆分数据集是一种常见的做法。
基本上我想把我的数据集分成训练集、测试集和验证集。因此,我使用了 train_test_split 函数两次。我有一个大约 1000 万行的数据集。
在第一次拆分时,我将训练和测试数据集拆分为 7000 万训练和 3000 万测试。现在要获得验证集,我有点困惑是使用拆分测试数据还是训练数据作为 train-test-split 的输入参数以获得验证集。给点建议。 TIA
X = features
y = target
# dividing X, y into train and test and validation data 70% training dataset with 15% testing and 15% validation set
from sklearn.model_selection import train_test_split
#features and label splitted into 70-30
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
#furthermore test data is splitted into test and validation set 15-15
x_test, x_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5)
不要让测试集太小。 20% 的测试数据集就可以了。如果将训练数据集分成训练和验证(80%/20% 是一个公平的划分)会更好。考虑到这一点,您应该以这种方式更改您的代码:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
x_test, x_val, y_test, y_val = train_test_split(X_train, y_train, test_size=0.25)
像这样拆分数据集是一种常见的做法。