Stack 期望每个张量大小相等

Question

我正在关注 PyTorch tutorial 的语音命令识别，并尝试实现我自己对 22 个德语句子的识别。在本教程中，他们对音频张量使用填充，但对于标签，他们仅使用 torch.stack。因此，当我开始训练网络时出现错误：

RuntimeError: stack expects each tensor to be equal size, but got [456] at entry 0 and [470] at entry 1.

我明白这句话的意思，但不幸的是，由于我是 PyTorch 的新手，无法从头开始为句子实现填充功能。因此，如果你能给我一些提示和技巧，我会很高兴。

这是 collate_fn 和 pad_sequence 函数的代码：

def pad_sequence(batch):
    # Make all tensor in a batch the same length by padding with zeros
    batch = [item.t() for item in batch]
    batch = torch.nn.utils.rnn.pad_sequence(batch, batch_first=True, padding_value=0.)
    return batch.permute(0, 2, 1)


def collate_fn(batch):
    # A data tuple has the form:
    # waveform,  label
    tensors, targets = [], []

    # Gather in lists, and encode labels as indices
    for waveform, label in batch:
        tensors += [waveform]
        targets += [label]

    # Group the list of tensors into a batched tensor
    tensors = pad_sequence(tensors)
    targets = torch.stack(targets)

    return tensors, targets

Answer 1

当我开始直接使用 pad_sequence 工作时，我明白了它的工作原理是多么简单。因此，在我的例子中，我只需要一堆字符串 (batch)，它们由 PyTorch 自动比较并扩展到批处理中几个字符串之一的最大长度。

我的代码现在看起来像这样：

def pad_AudioSequence(batch):
  # Make all tensor in a batch the same length by padding with zeros
  batch = [item.t() for item in batch]
  batch = torch.nn.utils.rnn.pad_sequence(batch, batch_first=True, padding_value=0.)
  return batch.permute(0, 2, 1)

def pad_TextSequence(batch):
  return torch.nn.utils.rnn.pad_sequence(batch,batch_first=True, padding_value=0)

def collate_fn(batch):
  # A data tuple has the form:
  # waveform,  label
  tensors, targets = [], []
  # Gather in lists, and encode labels as indices
  for waveform, label in batch:
      tensors += [waveform]
      targets += [label]
  # Group the list of tensors into a batched tensor
  tensors = pad_AudioSequence(tensors)
  targets = pad_TextSequence(targets)
  return tensors, targets

对于那些仍然不明白它是如何工作的人，这里有一个小例子：

encDecClass2 = dummyEncoderDecoder()
sent1 = audioWorkerClass.sentences[4] # wie viel Prozent hat der Akku noch?
sent2 = audioWorkerClass.sentences[5] # Wie spät ist es?
sent3 = audioWorkerClass.sentences[6] # Mach einen Timer für 5 Sekunden.

# encode sentences into tensor of numbers, representing words, using my own enc-dec class
sent1 = encDecClass2.encode(sent1) # tensor([11, 94, 21, 94, 22, 94, 23, 94, 24, 94, 25, 94, 26, 94, 15, 94])
sent2 = encDecClass2.encode(sent2) # tensor([27, 94, 28, 94, 12, 94, 29, 94, 15, 94])
sent3 = encDecClass2.encode(sent3) # tensor([30, 94, 31, 94, 32, 94, 33, 94, 34, 94, 35, 94, 19, 94])

print(sent1.shape) # torch.Size([16])
print(sent2.shape) # torch.Size([10])
print(sent3.shape) # torch.Size([14])

batch = []
# add sentences to the batch as separate arrays
batch +=[sent1]
batch +=[sent2]
batch +=[sent3]

output = pad_sequence(batch,batch_first=True, padding_value=0)

print(f"{output}\n{output.shape}")

#############################################################################
# output:
# tensor([[11, 94, 21, 94, 22, 94, 23, 94, 24, 94, 25, 94, 26, 94, 15, 94],
#         [27, 94, 28, 94, 12, 94, 29, 94, 15, 94,  0,  0,  0,  0,  0,  0],
#         [30, 94, 31, 94, 32, 94, 33, 94, 34, 94, 35, 94, 19, 94,  0,  0]])
# torch.Size([3, 16])
#############################################################################

如您所见，所有数组均等化为这三个数组的最大长度并用零填充。输出的形状是 3x16，因为我们有三个句子，最长的数组在批处理中有 16 个序列。

Stack 期望每个张量大小相等

Stack expects each tensor to be equal size

speech-recognition

pytorch