计算 GRU 层的参数数量（Keras）

Question

为什么GRU层的参数个数是9600？

不应该是 ((16+32)*32 + 32) * 3 * 2 = 9,408 吗？

或者，重新排列，

32*(16+32+1)*3*2=9408

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=4500, output_dim=16, input_length=200),
    tf.keras.layers.Bidirectional(tf.keras.layers.GRU(32)),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

Answer 1

关键是当GRUCell中的参数reset_after=True时，tensorflow会分离输入和循环内核的偏差。您可以查看GRUCell中的一些source code如下：

if self.use_bias:
    if not self.reset_after:
        bias_shape = (3 * self.units,)
    else:
        # separate biases for input and recurrent kernels
        # Note: the shape is intentionally different from CuDNNGRU biases
        # `(2 * 3 * self.units,)`, so that we can distinguish the classes
        # when loading and converting saved weights.
        bias_shape = (2, 3 * self.units)

以重置门为例，我们一般会看到以下公式。

但是如果我们设置reset_after=True，实际的公式如下：

可以看到，GRU的默认参数是tensorflow2中的reset_after=True。但是GRU的默认参数是tensorflow1.x.

中的reset_after=False

所以GRU层的参数个数应该是((16+32)*32 + 32 + 32) * 3 * 2 = 9600 in tensorflow2.

Answer 2

作为对已接受答案的补充，我对此有了更多了解。 Keras 在 GRUCell.call() 中所做的是：

$z_t=\sigma(x_tW_z+b_{xz}+h_{t-1}U_z+b_{hz})$

$r_t=\sigma(x_tW_r+b_{xr}+h_{t-1}U_r+b_{hr})$

使用 reset_after=False（TensorFlow 1 中的默认值）：

$h_t=z_t\odot h_{t-1}+(1-z_t)\odot \tanh(x_tW_h+b_{xh}+(r_t\odot h_{t-1})U_h+b_{hh})$

使用 reset_after=True（TensorFlow 2 中的默认值）：

$h_t=z_t\odot h_{t-1}+(1-z_t)\odot \tanh(x_tW_h+b_{xh}+r_t\odot(h_{t-1}U_h+b_{hh}))$

用reset_after=False训练后，b_xh等于b_hz，b_xr等于b_hr，b_xh等于b_hh，因为（我假设）TensorFlow 意识到这些向量对中的每一对都可以组合成一个参数向量——就像上面评论中指出的 OP 一样。但是，对于 reset_after=True， 而不是 b_xh 和 b_hh 的情况 - 它们可以而且将会不同，因此它们可以不被组合成一个向量，这就是为什么总参数数更高的原因。

计算 GRU 层的参数数量（Keras）

calculating the number of parameters of a GRU layer (Keras)

lstm

tensorflow

gated-recurrent-unit