火车和评估数据的这种拆分如何确保没有重叠?
How does this split of train and evaluation data ensure there is no overlap?
我正在阅读来自 Tensorflow 的情感分类教程:
https://www.tensorflow.org/tutorials/keras/text_classification
它将数据拆分为train和evaluate的方式如下代码:
batch_size = 32
seed = 42
raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
'aclImdb/train',
batch_size=batch_size,
validation_split=0.2,
subset='training',
seed=seed)
raw_val_ds = tf.keras.preprocessing.text_dataset_from_directory(
'aclImdb/train',
batch_size=batch_size,
validation_split=0.2,
subset='validation',
seed=seed)
函数 text_dataset_from_directory 的一次调用不应该生成两个集合吗?如果调用两次,是否保证两个拆分集之间不会重叠?
您需要设置一个种子或设置 shuffle = False
以确保您在两组中没有重叠。这是引擎盖下发生的事情:
提供子集 (train-val) 时,会检查种子或洗牌参数 (Source)
if validation_split and shuffle and seed is None:
raise ValueError(
'If using `validation_split` and shuffling the data, you must provide '
'a `seed` argument, to make sure that there is no overlap between the '
'training and validation subset.')
然后,数据被保留。 (Source)
num_val_samples = int(validation_split * len(samples))
if subset == 'training':
print('Using %d files for training.' % (len(samples) - num_val_samples,))
samples = samples[:-num_val_samples]
labels = labels[:-num_val_samples]
elif subset == 'validation':
print('Using %d files for validation.' % (num_val_samples,))
samples = samples[-num_val_samples:]
labels = labels[-num_val_samples:]
最后的代码示例和标签仅限于训练或验证集。由于您指定了 seed,因此数据集在 same 顺序中 randomized。
我正在阅读来自 Tensorflow 的情感分类教程:
https://www.tensorflow.org/tutorials/keras/text_classification
它将数据拆分为train和evaluate的方式如下代码:
batch_size = 32
seed = 42
raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
'aclImdb/train',
batch_size=batch_size,
validation_split=0.2,
subset='training',
seed=seed)
raw_val_ds = tf.keras.preprocessing.text_dataset_from_directory(
'aclImdb/train',
batch_size=batch_size,
validation_split=0.2,
subset='validation',
seed=seed)
函数 text_dataset_from_directory 的一次调用不应该生成两个集合吗?如果调用两次,是否保证两个拆分集之间不会重叠?
您需要设置一个种子或设置 shuffle = False
以确保您在两组中没有重叠。这是引擎盖下发生的事情:
提供子集 (train-val) 时,会检查种子或洗牌参数 (Source)
if validation_split and shuffle and seed is None:
raise ValueError(
'If using `validation_split` and shuffling the data, you must provide '
'a `seed` argument, to make sure that there is no overlap between the '
'training and validation subset.')
然后,数据被保留。 (Source)
num_val_samples = int(validation_split * len(samples))
if subset == 'training':
print('Using %d files for training.' % (len(samples) - num_val_samples,))
samples = samples[:-num_val_samples]
labels = labels[:-num_val_samples]
elif subset == 'validation':
print('Using %d files for validation.' % (num_val_samples,))
samples = samples[-num_val_samples:]
labels = labels[-num_val_samples:]
最后的代码示例和标签仅限于训练或验证集。由于您指定了 seed,因此数据集在 same 顺序中 randomized。