关于在 RNNs (Keras) 中正确使用 dropout

Question

我对如何在 keras 中正确使用 dropout 和 RNN 感到困惑，特别是 GRU 单元。 keras 文档参考了这篇论文 (https://arxiv.org/abs/1512.05287)，我知道所有时间步都应该使用相同的 dropout mask。这是在指定 GRU 层本身时通过 dropout 参数实现的。我不明白的是：

为什么互联网上有几个示例，包括 keras 自己的示例 (https://github.com/keras-team/keras/blob/master/examples/imdb_bidirectional_lstm.py) 和 "Trigger word detection" Andrew Ng 的 Coursera Seq 中的作业。模型课程，他们明确地添加了一个丢失层"model.add(Dropout(0.5))"，据我所知，这将为每个时间步添加一个不同的掩码。
上面提到的论文表明这样做是不合适的，我们可能会丢失信号以及长期记忆，因为这种丢失噪声在所有时间步长中都会累积。但是，这些模型（在每个时间步使用不同的 dropout 掩码）如何能够很好地学习和表现。

我自己训练了一个模型，它在每个时间步使用不同的 dropout masks，虽然我没有得到我想要的结果，但该模型能够过度拟合训练数据。根据我的理解，这会使所有时间步长的 "accumulation of noise" 和 "signal getting lost" 无效（我有 1000 个时间步长序列被输入到 GRU 层）。

对这种情况的任何见解、解释或经验都会有所帮助。谢谢

更新：

为了更清楚地说明，我会提到 keras 文档中关于 Dropout Layer 的摘录 ("noise_shape: 1D integer tensor representing the shape of the binary dropout mask that will be multiplied with the input. For instance, if your inputs have shape (batch_size, timesteps, features) and you want the dropout mask to be the same for all timesteps, you can use noise_shape=(batch_size, 1, features")。所以，我相信，可以看出，当明确使用 Dropout 层并且在每个时间步都需要相同的掩码时（如论文中所述），我们需要编辑这个 noise_shape 参数，这在我之前链接的示例。

Answer 1

正如 Asterisk 在他的评论中解释的那样，循环单元内的 dropout 与单元输出后的 dropout 之间存在根本区别。这是您在问题中链接的 keras tutorial 的架构：

model = Sequential()
model.add(Embedding(max_features, 128, input_length=maxlen))
model.add(Bidirectional(LSTM(64)))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

您要在 LSTM 完成计算后添加一个 dropout 层，这意味着该单元中不会再有任何循环传递。将这个 dropout 层想象成教导网络不要依赖特定时间步长的特定特征的输出，而是概括不同特征和时间步长的信息。这里的 Dropout 与前馈架构没有什么不同。

什么Gal & Ghahramani propose in their paper (which you linked in the question) is dropout within the recurrent unit. There, you're dropping input information between the time steps of a sequence. I found this blogpost 对理解本文及其与 keras 实施的关系非常有帮助。

关于在 RNNs (Keras) 中正确使用 dropout

About correctly using dropout in RNNs (Keras)

machine-learning

deep-learning

keras

recurrent-neural-network

dropout