Select 数据的随机子集

Select a random subset of data

我有一个给我的数据集,该数据集以前分为训练和验证(测试)数据。我需要将训练数据进一步拆分为单独的训练数据和校准集,我不想触及我当前的验证(测试)集。我无权访问原始数据集。

我想随机执行此操作,这样每次我可以 运行 我的脚本时,我都会得到不同的训练和校准测试。我知道 .sample() 函数,但我的训练数据集有 44000 行。

原始数据集

training = dataset.loc[dataset['split']== 'train']
print("Training Created")
#print(training.head())

validation = dataset.loc[dataset['split']== 'valid']
print("Validation Created")
#print(validation.head())

我需要这样的东西的地方:

# proper training set
x_train = breast_cancer.values[:-100, :-1]
y_train = breast_cancer.values[:-100, -1]
# calibration set
x_cal = breast_cancer.values[-100:-1, :-1]
y_cal = breast_cancer.values[-100:-1, -1]
# (x_k+1, y_k+1)
x_test = breast_cancer.values[-1, :-1]
y_test = breast_cancer.values[-1, -1]

不确定如何处理第二次拆分

数据集示例

Object  | Variable | Split
Cancer1     55     Train
Cancer5     45     Train
Cancer2     56     Valid
Cancer3     68     Valid
Cancer4     75     Valid

您似乎已经分配了一个包含 trainvalidation 集的列。通常的方法是使用sklearn.model_selection.train_test_split。因此,要进一步将您的训练数据拆分为训练和 "calibration",只需在训练集上使用它即可(注意,您必须拆分为 Xy):

# initial split into train/test
train = df.loc[df['Split']== 'train']
test = df.loc[df['Split']== 'valid']

# split the test set into features and target
x_test = test.loc[:,:-1]
y_test = test.loc[:,-1]

# same with the train set
X_train = train.loc[:,:-1]
y_train = train.loc[:,-1]

# split into train and validation sets
X_train, X_calib, y_train, y_calib = train_test_split(X_train, y_train)

1.将测试集与整个数据集分开

2。然后使用剩余的数据集,将其拆分为训练和校准。

from sklearn.model_selection import train_test_split

# define the test set
X_test = breast_cancer.values[-1, :-1]
y_test = breast_cancer.values[-1, -1]

# Get the remaining dataset 
X = breast_cancer.values[:-1, :-1]
y = breast_cancer.values[:-1, -1]

# Split the remaining dataset into train and calibration sets.
X_train, X_calib, y_train, y_calib = train_test_split(X, y)