火车和评估数据的这种拆分如何确保没有重叠?

How does this split of train and evaluation data ensure there is no overlap?

我正在阅读来自 Tensorflow 的情感分类教程:

https://www.tensorflow.org/tutorials/keras/text_classification

它将数据拆分为train和evaluate的方式如下代码:

batch_size = 32
seed = 42

raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', 
    batch_size=batch_size, 
    validation_split=0.2, 
    subset='training', 
    seed=seed)

raw_val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', 
    batch_size=batch_size, 
    validation_split=0.2, 
    subset='validation', 
    seed=seed)

函数 text_dataset_from_directory 的一次调用不应该生成两个集合吗?如果调用两次,是否保证两个拆分集之间不会重叠?

您需要设置一个种子或设置 shuffle = False 以确保您在两组中没有重叠。这是引擎盖下发生的事情:

提供子集 (train-val) 时,会检查种子或洗牌参数 (Source)

if validation_split and shuffle and seed is None:
        raise ValueError(
            'If using `validation_split` and shuffling the data, you must provide '
            'a `seed` argument, to make sure that there is no overlap between the '
            'training and validation subset.')

然后,数据被保留。 (Source)

num_val_samples = int(validation_split * len(samples))
if subset == 'training':
 print('Using %d files for training.' % (len(samples) - num_val_samples,))
 samples = samples[:-num_val_samples]
 labels = labels[:-num_val_samples]
elif subset == 'validation':
 print('Using %d files for validation.' % (num_val_samples,))
    samples = samples[-num_val_samples:]
    labels = labels[-num_val_samples:]

最后的代码示例和标签仅限于训练或验证集。由于您指定了 seed,因此数据集在 same 顺序中 randomized