ValueError: sampler option is mutually exclusive with shuffle pytorch

Question

我正在使用 pytorch 和 mtcnn 进行人脸识别项目，在训练了我的训练数据集之后，现在我想对测试数据集进行预测

这是我训练过的代码

optimizer = optim.Adam(resnet.parameters(), lr=0.001)
scheduler = MultiStepLR(optimizer, [5, 10])

trans = transforms.Compose([
   np.float32,
   transforms.ToTensor(),
   fixed_image_standardization
])
dataset = datasets.ImageFolder(data_dir, transform=trans)
img_inds = np.arange(len(dataset))
np.random.shuffle(img_inds)
train_inds = img_inds[:int(0.8 * len(img_inds))]
val_inds = img_inds[int(0.8 * len(img_inds)):]

train_loader = DataLoader(
   dataset,
   num_workers=workers,
   batch_size=batch_size,
   sampler=SubsetRandomSampler(train_inds)
)
val_loader = DataLoader(
   dataset,
   shuffle=True,
   num_workers=workers,
   batch_size=batch_size,
   sampler=SubsetRandomSampler(val_inds)
)

如果删除 sampler=SubsetRandomSampler(val_inds) 并改用 val_inds 则会出现此错误

val_inds ^ SyntaxError: positional argument follows keyword argument

我想在 pytorch 中进行预测（select 随机从测试数据集中）？这就是为什么我应该使用 shuffle=True 我关注了这个 repo facenet-pytorch

Answer 1

我不确定您的测试数据是什么格式，但是要从您的数据集中随机 select 样本，您可以使用 random 模块中的 random.choice。

Answer 2

您可以将 Dataloader 与 shuffle = True 一起使用，但仅当 sampler = False 使用此标志时，数据集中的样本将被随机选择 (doc)。

编辑1

我同意@SzymonMaszke 的观点：使用 SubsetRandomSampler 不需要使用随机播放，因为您的数据已经随机选取。

Answer 3

TLDR；在这种情况下删除 shuffle=True，因为 SubsetRandomSampler 已经随机播放数据。

torch.utils.data.SubsetRandomSampler 的作用（请consult documentation when in doubt）是它需要一个索引列表和 return 它们的排列。

在您的情况下，您有 indices 对应于 training（这些是 训练数据集中元素的索引 ）和 validation。

让我们假设它们看起来像这样：

train_indices = [0, 2, 3, 4, 5, 6, 9, 10, 12, 13, 15]
val_indices = [1, 7, 8, 11, 14]

在每次通过 SubsetRandomSampler 时 return 这些列表中的一个数字随机和 这些将在所有之后再次随机化他们是 returned（__iter__ 将再次被调用）。

所以 SubsetRandomSampler 可能 return 对于 val_indices 是这样的（类似于 train_indices）：

val_indices = [1, 8, 11, 7, 14]  # Epoch 1
val_indices = [11, 7, 8, 14, 1]  # Epoch 2
val_indices = [7, 1, 14, 8, 11]  # Epoch 3

现在每个数字都是 原始 dataset 的索引。请注意 validation 以这种方式洗牌，train 也是如此，但不使用 shuffle=True。这些索引不重叠，因此数据被正确拆分。

附加信息

shuffle=True

shuffle 在后台使用 torch.utils.data.RandomSampler，请参阅 source code。这反过来等同于使用 torch.utils.data.SubsetRandomSampler 并指定 所有索引 (np.arange(len(datatest)))。
您不必预先洗牌 np.random.shuffle(img_inds) 因为无论如何在每次通过期间都会对索引进行洗牌
如果 torch 提供相同的功能，请不要使用 numpy。有 torch.arange，几乎不需要混合两个库。

推理

单张图片

只需通过您的网络传递它并获得输出，例如：

module.eval()
with torch.no_grad():
    output = module(dataset[5380])

第一行将模型置于评估模式（更改某些层的行为），上下文管理器关闭梯度（因为预测不需要它）。这些几乎总是在 "checking neural network output".

时使用

正在检查验证数据集

按照这些思路，注意与单个图像应用相同的想法：

module.eval()

total_batches = 0
batch_accuracy = 0
for images, labels in val_loader:
    total_batches += 1
    with torch.no_grad():
        output = module(images)
        # In case it outputs logits without activation
        # If it outputs activation you may have to use argmax or > 0.5 for binary case
        # Item gets float from torch.tensor
        batch_accuracy += torch.mean(labels == (output > 0.0)).item()

print("Overall accuracy: {}".format(batch_accuracy / total_batches))

其他情况

请参阅 some beginners guides or tutorials 并理解这些概念，因为 Whosebug 不是重新做这项工作的地方（相当具体和小的问题），谢谢。

ValueError: sampler option is mutually exclusive with shuffle pytorch