使用特定元素自定义批次

Question

我是 PyTorch 的新手。奇怪的是我找不到与此相关的任何东西，虽然它看起来很简单。

我想用特定示例构建我的批次，例如每个批次的所有示例都具有相同的标签，或者只用 2 个类.

的示例填充批次

我该怎么做？对我来说，它似乎是数据加载器中的正确位置而不是数据集中的正确位置？由于数据加载器负责批次而不是数据集？

有没有简单的最小示例？

Answer 1

TLDR;

默认DataLoader只使用一个sampler，不使用批量采样器。
可以定义一个sampler，加上一个batch sampler，一个batch sampler会覆盖sampler。
采样器只产生数据集元素的序列，而不是实际的批次（这由数据加载器处理，取决于batch_size）。

回答您最初的问题：似乎不可能在可迭代数据集上使用采样器 cf. Github issue (still open). Also, read the following note on pytorch/dataloader.py.

采样器（用于地图样式数据集）：

除此之外，如果您要切换到地图样式数据集，这里有一些有关采样器和批量采样器如何工作的详细信息。您可以使用索引访问数据集的基础数据，就像使用 list 一样（因为 torch.utils.data.Dataset 实现了 __getitem__）。换句话说，你的数据集元素都是 dataset[i], for i in [0, len(dataset) - 1].

这是一个玩具数据集：

class DS(Dataset):
    def __getitem__(self, index):
        return index
        
    def __len__(self):
        return 10

在一般用例中，您只需给 torch.utils.data.DataLoader 参数 batch_size 和 shuffle。默认情况下，shuffle 设置为 false，这意味着它将使用 torch.utils.data.SequentialSampler。否则（如果 shuffle 是 true）将使用 torch.utils.data.RandomSampler。采样器定义数据加载器如何访问数据集（访问数据集的顺序）。

上述数据集 (DS) 有 10 个元素。索引为 0、1、2、3、4、5、6、7 、8 和 9。它们映射到元素 0、10、20、30、40、50、60、70、80 和 90。所以批量大小为 2:

SequentialSampler：DataLoader(ds, batch_size=2)（隐式 shuffle=False），等同于 DataLoader(ds, batch_size=2, sampler=SequentialSampler(ds))。数据加载器将传送 tensor([0, 10])、tensor([20, 30])、tensor([40, 50])、tensor([60, 70]) 和 tensor([80, 90]).
RandomSampler：DataLoader(ds, batch_size=2, shuffle=True)，等同于DataLoader(ds, batch_size=2, sampler=RandomSampler(ds))。每次迭代时，数据加载器都会随机抽样。例如：tensor([50, 40])、tensor([90, 80])、tensor([0, 60])、tensor([10, 20]) 和 tensor([30, 70])。但是如果你第二次遍历数据加载器，顺序就会不同！

批量采样器

提供 batch_sampler 将覆盖 batch_size、shuffle、sampler 和 drop_last。它旨在准确定义批处理元素及其内容。例如：

>>> DataLoader(ds, batch_sampler=[[1,2,3], [6,5,4], [7,8], [0,9]])`

将产生 tensor([10, 20, 30])、tensor([60, 50, 40])、tensor([70, 80]) 和 tensor([ 0, 90])。

在 class
上批量采样

Let's say I just want to have 2 elements (different or not) of each class in my batch and have to exclude more examples of each class. So ensuring that not 3 examples are inside of the batch.

假设您有一个包含四个 class 的数据集。这是我会怎么做。首先，跟踪每个 class.
的数据集索引
class DS(Dataset): def __init__(self, data): super(DS, self).__init__() self.data = data self.indices = [[] for _ in range(4)] for i, x in enumerate(data): if x > 0 and x % 2: self.indices[0].append(i) if x > 0 and not x % 2: self.indices[1].append(i) if x < 0 and x % 2: self.indices[2].append(i) if x < 0 and not x % 2: self.indices[3].append(i) def classes(self): return self.indices def __getitem__(self, index): return self.data[index]

例如：

>>> ds = DS([1, 6, 7, -5, 10, -6, 8, 6, 1, -3, 9, -21, -13, 11, -2, -4, -21, 4])

将给予：

>>> ds.classes() [[0, 2, 8, 10, 13], [1, 4, 6, 7, 17], [3, 9, 11, 12, 16], [5, 14, 15]]

然后对于批量采样器，最简单的方法是创建可用的 class 个索引列表，并且具有与数据集元素一样多的 class 个索引。

在上面定义的数据集中，我们有 5 项来自 class 0，5 来自 class 1，5 来自 class 2，3 来自 class 3。因此我们要构造[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3]。我们将洗牌。然后，从这个列表和数据集 classes 内容 (ds.classes()) 我们将能够构建批次。

class Sampler(): def __init__(self, classes): self.classes = classes def __iter__(self): classes = copy.deepcopy(self.classes) indices = flatten([[i for _ in range(len(klass))] for i, klass in enumerate(classes)]) random.shuffle(indices) grouped = zip(*[iter(indices)]*2) res = [] for a, b in grouped: res.append((classes[a].pop(), classes[b].pop())) return iter(res)

注意 - 需要深度复制列表，因为我们要从中弹出元素。

这个采样器的可能输出是：

[(15, 14), (16, 17), (7, 12), (11, 6), (13, 10), (5, 4), (9, 8), (2, 0), (3, 1)]

此时我们可以简单地使用torch.data.utils.DataLoader:

>>> dl = DataLoader(ds, batch_sampler=sampler(ds.classes()))

这可能会产生类似的东西：

[tensor([ 4, -4]), tensor([-21, 11]), tensor([-13, 6]), tensor([9, 1]), tensor([ 8, -21]), tensor([-3, 10]), tensor([ 6, -2]), tensor([-5, 7]), tensor([-6, 1])]

更简单的方法

这是另一种更简单的方法，它不能保证 return 数据集中的所有元素，平均而言它会...

对于每个批次，首先采样 class_per_batch classes，然后从这些选定的 classes 中采样 batch_size 个元素（首先采样一个 class来自 class 子集，然后从 class).
的数据点采样
class Sampler(): def __init__(self, classes, class_per_batch, batch_size): self.classes = classes self.n_batches = sum([len(x) for x in classes]) // batch_size self.class_per_batch = class_per_batch self.batch_size = batch_size def __iter__(self): classes = random.sample(range(len(self.classes)), self.class_per_batch) batches = [] for _ in range(self.n_batches): batch = [] for i in range(self.batch_size): klass = random.choice(classes) batch.append(random.choice(self.classes[klass])) batches.append(batch) return iter(batches)

你可以这样试试：

>>> s = Sampler(ds.classes(), class_per_batch=2, batch_size=4) >>> list(s) [[16, 0, 0, 9], [10, 8, 11, 2], [16, 9, 16, 8], [2, 9, 2, 3]] >>> dl = DataLoader(ds, batch_sampler=s) >>> list(iter(dl)) [tensor([ -5, -6, -21, -13]), tensor([ -4, -4, -13, -13]), tensor([ -3, -21, -2, -5]), tensor([-3, -5, -4, -6])]

使用特定元素自定义批次

Customizing the batch with specific elements

pytorch

pytorch-dataloader

采样器（用于地图样式数据集）：

批量采样器

在 class

更简单的方法