在 Pytorch 中，我如何洗牌 DataLoader？

Question

我有一个包含 10000 个样本的数据集，其中 classes 以有序的方式存在。首先，我将数据加载到一个 ImageFolder 中，然后加载到一个 DataLoader 中，我想将这个数据集拆分成一个 train-val-test 集。我知道 DataLoader class 有一个 shuffle 参数，但这对我不利，因为它只在数据发生枚举时才对数据进行混洗。我知道 RandomSampler 函数，但是有了它，我只能从数据集中随机获取 n 量的数据，而且我无法控制要取出的内容，因此训练、测试和验证集中可能存在一个样本同时

有没有办法在 DataLoader 中打乱数据？我唯一需要的是洗牌，之后我可以对数据进行子集化。

Answer 1

Subset 数据集 class 采用索引 (https://pytorch.org/docs/stable/data.html#torch.utils.data.Subset)。您可能可以利用它来获得以下功能。从本质上讲，您可以通过打乱索引然后选择数据集的子集来逃脱。

# suppose dataset is the variable pointing to whole datasets
N = len(dataset)

# generate & shuffle indices
indices = numpy.arange(N)
indices = numpy.random.permutation(indices)
# there are many ways to do the above two operation. (Example, using np.random.choice can be used here too

# select train/test/val, for demo I am using 70,15,15
train_indices = indices [:int(0.7*N)]
val_indices = indices[int(0.7*N):int(0.85*N)]
test_indices = indices[int(0.85*N):]

train_dataset = Subset(dataset, train_indices)
val_dataset = Subset(dataset, val_indices)
test_dataset = Subset(dataset, test_indices)

在 Pytorch 中，我如何洗牌 DataLoader？

In Pytorch, how can i shuffle a DataLoader?

shuffle

dataset

pytorch