Select 数据的随机子集
Select a random subset of data
我有一个给我的数据集,该数据集以前分为训练和验证(测试)数据。我需要将训练数据进一步拆分为单独的训练数据和校准集,我不想触及我当前的验证(测试)集。我无权访问原始数据集。
我想随机执行此操作,这样每次我可以 运行 我的脚本时,我都会得到不同的训练和校准测试。我知道 .sample() 函数,但我的训练数据集有 44000 行。
原始数据集
training = dataset.loc[dataset['split']== 'train']
print("Training Created")
#print(training.head())
validation = dataset.loc[dataset['split']== 'valid']
print("Validation Created")
#print(validation.head())
我需要这样的东西的地方:
# proper training set
x_train = breast_cancer.values[:-100, :-1]
y_train = breast_cancer.values[:-100, -1]
# calibration set
x_cal = breast_cancer.values[-100:-1, :-1]
y_cal = breast_cancer.values[-100:-1, -1]
# (x_k+1, y_k+1)
x_test = breast_cancer.values[-1, :-1]
y_test = breast_cancer.values[-1, -1]
不确定如何处理第二次拆分
数据集示例
Object | Variable | Split
Cancer1 55 Train
Cancer5 45 Train
Cancer2 56 Valid
Cancer3 68 Valid
Cancer4 75 Valid
您似乎已经分配了一个包含 train
和 validation
集的列。通常的方法是使用sklearn.model_selection.train_test_split
。因此,要进一步将您的训练数据拆分为训练和 "calibration",只需在训练集上使用它即可(注意,您必须拆分为 X
和 y
):
# initial split into train/test
train = df.loc[df['Split']== 'train']
test = df.loc[df['Split']== 'valid']
# split the test set into features and target
x_test = test.loc[:,:-1]
y_test = test.loc[:,-1]
# same with the train set
X_train = train.loc[:,:-1]
y_train = train.loc[:,-1]
# split into train and validation sets
X_train, X_calib, y_train, y_calib = train_test_split(X_train, y_train)
1.将测试集与整个数据集分开
2。然后使用剩余的数据集,将其拆分为训练和校准。
from sklearn.model_selection import train_test_split
# define the test set
X_test = breast_cancer.values[-1, :-1]
y_test = breast_cancer.values[-1, -1]
# Get the remaining dataset
X = breast_cancer.values[:-1, :-1]
y = breast_cancer.values[:-1, -1]
# Split the remaining dataset into train and calibration sets.
X_train, X_calib, y_train, y_calib = train_test_split(X, y)
我有一个给我的数据集,该数据集以前分为训练和验证(测试)数据。我需要将训练数据进一步拆分为单独的训练数据和校准集,我不想触及我当前的验证(测试)集。我无权访问原始数据集。
我想随机执行此操作,这样每次我可以 运行 我的脚本时,我都会得到不同的训练和校准测试。我知道 .sample() 函数,但我的训练数据集有 44000 行。
原始数据集
training = dataset.loc[dataset['split']== 'train']
print("Training Created")
#print(training.head())
validation = dataset.loc[dataset['split']== 'valid']
print("Validation Created")
#print(validation.head())
我需要这样的东西的地方:
# proper training set
x_train = breast_cancer.values[:-100, :-1]
y_train = breast_cancer.values[:-100, -1]
# calibration set
x_cal = breast_cancer.values[-100:-1, :-1]
y_cal = breast_cancer.values[-100:-1, -1]
# (x_k+1, y_k+1)
x_test = breast_cancer.values[-1, :-1]
y_test = breast_cancer.values[-1, -1]
不确定如何处理第二次拆分
数据集示例
Object | Variable | Split
Cancer1 55 Train
Cancer5 45 Train
Cancer2 56 Valid
Cancer3 68 Valid
Cancer4 75 Valid
您似乎已经分配了一个包含 train
和 validation
集的列。通常的方法是使用sklearn.model_selection.train_test_split
。因此,要进一步将您的训练数据拆分为训练和 "calibration",只需在训练集上使用它即可(注意,您必须拆分为 X
和 y
):
# initial split into train/test
train = df.loc[df['Split']== 'train']
test = df.loc[df['Split']== 'valid']
# split the test set into features and target
x_test = test.loc[:,:-1]
y_test = test.loc[:,-1]
# same with the train set
X_train = train.loc[:,:-1]
y_train = train.loc[:,-1]
# split into train and validation sets
X_train, X_calib, y_train, y_calib = train_test_split(X_train, y_train)
1.将测试集与整个数据集分开
2。然后使用剩余的数据集,将其拆分为训练和校准。
from sklearn.model_selection import train_test_split
# define the test set
X_test = breast_cancer.values[-1, :-1]
y_test = breast_cancer.values[-1, -1]
# Get the remaining dataset
X = breast_cancer.values[:-1, :-1]
y = breast_cancer.values[:-1, -1]
# Split the remaining dataset into train and calibration sets.
X_train, X_calib, y_train, y_calib = train_test_split(X, y)