如何对 TF 数据集进行切片，以便有 500 个负面示例和 500 个正面示例？（IMDB 数据集）

Question

我有以下数据集：

train = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', batch_size=64, validation_split=0.2, 
    subset='training', seed=123)
test = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', batch_size=64, validation_split=0.2, 
    subset='validation', seed=123)

我正在尝试运行 BERT 在这个模型上，但是，我只想要这个数据集的 1000 个示例（500+ve 和 500-ve 示例），有没有一种快速而简洁的方法做这个？我对 TF 数据集很陌生，所以我不确定如何操作它们...

Answer 1

由于您将拥有 tf.data.Dataset 类型的数据集，因此一切都会变得容易得多。您首先必须从训练和验证数据集中过滤出正例和负例，然后取 500。

我会做一些考虑如下，我会使用tfds包中的IMDB数据集。但是您也可以将这个概念应用于您的示例。我只是不完全知道您的数据集是如何构建的。我假设它是一样的。

# import tensorflow_datasets package.
import tensorflow_datasets as tfds

# load the imdb dataset from the tfds, here you can have your own dataset as well.
dataset, info = tfds.load('imdb_reviews/plain_text', with_info=True, as_supervised=True, shuffle_files=True)

# Here the data is of type tuple and x is the imdb review whereas y is the label.
# 1 means positive and 0 means negative
updated_train_pos = dataset['train'].filter(lambda x,y: y == 1).take(500)
updated_train_neg = dataset['train'].filter(lambda x,y: y == 0).take(500)
train = updated_train_pos.concatenate(updated_train_neg)
# just reshuffle your dataset so that your batch might get positive as well as negative samples for training.
train = train.shuffle(1000, reshuffle_each_iteration=True)

按照相同的步骤准备好验证数据集。

如何对 TF 数据集进行切片，以便有 500 个负面示例和 500 个正面示例？（IMDB 数据集）

How can I slice a TF dataset so that there's 500 negative examples and 500 positive examples? (IMDB dataset)

python

dataset

tensorflow

tensorflow-datasets

如何对 TF 数据集进行切片，以便有 500 个负面示例和 500 个正面示例？ （IMDB 数据集）

How can I slice a TF dataset so that there's 500 negative examples and 500 positive examples? (IMDB dataset)

python

dataset

tensorflow

tensorflow-datasets

如何对 TF 数据集进行切片，以便有 500 个负面示例和 500 个正面示例？（IMDB 数据集）