为什么反向传播算法将输入存储到隐藏层的非线性中？

Why does backprop algorithm store the inputs to the non-linearity of the hidden layers?

我一直在阅读 Ian Goodfellow 的《深度学习》一书，它在第 6.5.7 节中提到

The main memory cost of the algorithm is that we need to store the input to the nonlinearity of the hidden layer.

我知道反向传播以类似于动态规划的方式存储梯度，因此不需要重新计算它们。但我很困惑为什么它也存储输入？

反向传播是reverse mode automatic differentiation (AD)的一个特例。与正向模式相比，反向模式的主要优点是您可以计算输出的导数 w.r.t。一次计算的所有输入。

然而，缺点是您需要将要区分的算法的所有中间结果存储在合适的数据结构（如图形或 Wengert 磁带）中，只要您使用反向计算其雅可比行列式模式 AD，因为您基本上是通过算法“向后工作”。

正向模式 AD 没有这个缺点，但是你需要对每个输入重复计算，所以只有当你的算法的输出变量比输入变量多得多时才有意义。