为什么 softmax 在论文中的值很大时梯度变小 'Attention is all you need'

why softmax get small gradient when the value is large in paper 'Attention is all you need'

nlp
deep-learning
softmax
attention-model

这是原论文的屏幕：the screen of the paper。我理解论文的意思是当dot-product的值很大的时候，softmax的梯度会变得很小。
但是，我尝试用交叉熵损失计算softmax的梯度，发现softmax的梯度与传递给softmax的值没有直接关系。
即使单个值很大，当其他值很大时，它仍然可以得到很大的梯度。（不好意思我不知道怎么把计算过程放在这里）

实际上，在一个热编码向量上，softmax 的交叉熵梯度只是 grad -log(softmax(x)) = (1 - softmax(x)) 在对应 class。 (https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/)。如果传递给 softmax 的值很大，softmax 将产生 1，因此产生 0 梯度。

为什么 softmax 在论文中的值很大时梯度变小 'Attention is all you need'

why softmax get small gradient when the value is large in paper 'Attention is all you need'

nlp

deep-learning

softmax

attention-model