为什么tensorflow中tanh的梯度是`grad = dy * (1 - y*y)`

Why gradient of tanh in tensorflow is `grad = dy * (1 - y*y)`

tf.raw_ops.TanhGrad 表示 grad = dy * (1 - y*y),其中 y = tanh(x).

但我认为自从 dy / dx = 1 - y*y,其中 y = tanh(x),毕业应该是 dy / (1 - y*y)。我哪里错了?

dy / dx 这样的表达式是 mathematical notation for the derivative,它不是实际分数。像使用分子和分母那样单独移动 dydx 是没有意义的。

数学上已知d(tanh(x))/dx = 1 - (tanh(x))^2。 TensorFlow 计算梯度“向后”(所谓的 backpropagation, or more generally reverse automatic differentiation). That means that, in general, we will reach the computation of the gradient of tanh(x) after reaching the step where we compute the gradient of an "outer" function g(tanh(x)). g represents all the operations that are applied to the output of tanh to reach the value for which the gradient is computed. The derivative of this function g, according to the chain rule,是 d(g(tanh(x)))/dx = d(g(tanh(x))/d(tanh(x)) * d(tanh(x))/dx。第一个因素 d(g(tanh(x))/d(tanh(x)) 是直到 tanh 的反向累积梯度,即所有这些后续操作的导数,并且是函数文档中 dy 的值。因此,您只需要计算 d(tanh(x))/dx(即 (1 - y * y),因为 y = tanh(x)) 并将其乘以给定的 dy。然后,结果值将进一步传播回最初产生输入 xtanh 的操作,它将变为该梯度计算中的 dy 值,依此类推,直到达到梯度源。