解码器总是预测相同的标记

Question

我有以下机器翻译解码器，经过几步后只能预测 EOS 令牌。因此，不可能在虚拟的微小数据集上过度拟合，因此代码中似乎存在很大错误。

Decoder(
  (embedding): Embeddings(
    (word_embeddings): Embedding(30002, 768, padding_idx=3)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (ffn1): FFN(
    (dense): Linear(in_features=768, out_features=512, bias=False)
    (layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.5, inplace=False)
    (activation): GELU()
  )
  (rnn): GRU(512, 512, batch_first=True, bidirectional=True)
  (ffn2): FFN(
    (dense): Linear(in_features=1024, out_features=512, bias=False)
    (layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.5, inplace=False)
    (activation): GELU()
  )
  (selector): Sequential(
    (0): Linear(in_features=512, out_features=30002, bias=True)
    (1): LogSoftmax(dim=-1)
  )
)

前向相对简单（看看我在那里做了什么？）：将 input_ids 传递给嵌入和 FFN，然后在 RNN 中使用该表示，并以给定的 sembedding 作为初始值隐藏状态。通过另一个 FFN 传递输出并进行 softmax。 Return RNN 的逻辑和最后的隐藏状态。在下一步中，将这些隐藏状态用作新的隐藏状态，并将预测最高的标记作为新输入。

def forward(self, input_ids, sembedding):
    embedded = self.embedding(input_ids)
    output = self.ffn1(embedded)
    output, hidden = self.rnn(output, sembedding)
    output = self.ffn2(output)
    logits = self.selector(output)

    return logits, hidden

sembedding 是 RNN 的初始 hidden_state。这类似于编码器-解码器架构，只是在这里我们不训练编码器，但我们确实可以访问预训练编码器 representations.

在我的训练循环中，我从每个批次开始使用 SOS 令牌并将每个预测最高的令牌提供给下一步，直到达到 target_len。我也在老师强制训练之间随机切换。

def step(self, batch, teacher_forcing_ratio=0.5):
    batch_size, target_len = batch["input_ids"].size()[:2]
    # Init first decoder input woth SOS (BOS) token
    decoder_input = torch.tensor([[self.tokenizer.bos_token_id]] * batch_size).to(self.device)
    batch["input_ids"] = batch["input_ids"].to(self.device)

    # Init first decoder hidden_state: one zero'd second embedding in case the RNN is bidirectional
    decoder_hidden = torch.stack((batch["sembedding"],
                                  torch.zeros(*batch["sembedding"].size()))
                                 ).to(self.device) if self.model.num_directions == 2 \
        else batch["sembedding"].unsqueeze(0).to(self.device)

    loss = torch.tensor([0.]).to(self.device)

    use_teacher_forcing = random.random() < teacher_forcing_ratio
    # contains tuples of predicted and correct words
    tokens = []
    for i in range(target_len):
        # overwrite previous decoder_hidden
        output, decoder_hidden = self.model(decoder_input, decoder_hidden)
        batch_correct_ids = batch["input_ids"][:, i]

        # NLLLoss compute loss between predicted classes (bs x classes) and correct classes for _this word_
        # set to ignore the padding index
        loss += self.criterion(output[:, 0, :], batch_correct_ids)

        batch_predicted_ids = output.topk(1).indices.squeeze(1).detach()

        # if use teacher training: use current correct word for next prediction
        # else do NOT use teacher training: us current predction for next prediction
        decoder_input = batch_correct_ids.unsqueeze(1) if use_teacher_forcing else batch_predicted_ids

    return loss, loss.item() / target_len

我还在每一步之后剪辑渐变：

clip_grad_norm_(self.model.parameters(), 1.0)

起初，后续预测已经相对相同，但经过几次迭代后，会有更多变化。但是相对较快地，所有预测都变成了其他词（但总是相同的词），最终变成了 EOS 代币（编辑：将激活更改为 ReLU 后，总是会预测另一个代币——它似乎是一个总是重复的随机代币）。请注意，这已经发生在 80 步之后 (batch_size 128)。

我发现RNN返回的隐藏状态包含很多零。我不确定这是否是问题所在，但似乎可能相关。

tensor([[[  3.9874e-02,  -6.7757e-06,   2.6094e-04,  ...,  -1.2708e-17,
            4.1839e-02,   7.8125e-03],
         [ -7.8125e-03,  -2.5341e-02,   7.8125e-03,  ...,  -7.8125e-03,
           -7.8125e-03,  -7.8125e-03],
         [ -0.0000e+00, -1.0610e-314,   0.0000e+00,  ...,   0.0000e+00,
            0.0000e+00,   0.0000e+00],
         [  0.0000e+00,   0.0000e+00,   0.0000e+00,  ...,   0.0000e+00,
           -0.0000e+00,  1.0610e-314]]], device='cuda:0', dtype=torch.float64,
       grad_fn=<CudnnRnnBackward>)

虽然我怀疑问题出在我的 step 而不是出在模型上，但我不知道可能出了什么问题。我已经尝试过调整学习率，禁用某些层（LayerNorm、dropout、ffn2），使用预训练嵌入并冻结或解冻它们，并禁用教师强制，使用双向 GRU 与单向 GRU。最终结果总是一样的。

如果您有任何指点，那将非常有帮助。我在谷歌上搜索了很多关于神经网络的东西，总是预测相同的项目，我已经尝试了所有我能找到的建议。欢迎任何新人，无论多么疯狂！

Answer 1

在我的例子中，问题似乎是初始隐藏状态的 dtype 是双精度，输入是浮点数。我不太明白 为什么 这是个问题，但是将隐藏状态转换为浮点数解决了这个问题。如果您对为什么这可能是 PyTorch 的问题有任何直觉，请在评论中告诉我，或者更好的是，在 the official PyTorch forums.

上告诉我

编辑：如该主题所示，这是 PyTorch 1.6 中的一个错误，已在 1.7 中解决，在 1.7 中，您将收到一条错误消息，这有望为您省去调试所有代码和找不到问题的麻烦导致奇怪的行为。

解码器总是预测相同的标记

Decoder always predicts the same token

recurrent-neural-network

encoder-decoder

pytorch