Keras 嵌入层掩蔽。为什么 input_dim 需要 |vocabulary| + 2?

Keras embedding layer masking. Why does input_dim need to be |vocabulary| + 2?

Embedding https://keras.io/layers/embeddings/ 的 Keras 文档中,mask_zero 的解释是

mask_zero: Whether or not the input value 0 is a special "padding" value that should be masked out. This is useful when using recurrent layers which may take variable length input. If this is True then all subsequent layers in the model need to support masking or an exception will be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the vocabulary (input_dim should equal |vocabulary| + 2).

为什么input_dim需要2+词汇量?假设 0 被屏蔽了,不能使用,它不应该只是 1 + 字数吗?另一个额外的条目是什么?

因为input_dim已经是词汇表的+1了,所以你只需要在0上再加一个+1就可以得到+2了。

input_dim: int > 0. Size of the vocabulary, ie. 1 + maximum integer index occurring in the input data.

我认为文档在那里有点误导。在正常情况下,您将 n 输入数据索引 [0, 1, 2, ..., n-1] 映射到向量,因此您的 input_dim 应该与您拥有的元素一样多

input_dim = len(vocabulary_indices)

一种等效的(但有点令人困惑)的说法,以及文档的做法,是说

1 + maximum integer index occurring in the input data.

input_dim = max(vocabulary_indices) + 1

如果您启用屏蔽,值 0 将被区别对待,因此您将 n 索引递增一个:[0, 1, 2, ..., n-1, n],因此您需要

input_dim = len(vocabulary_indices) + 1

或者

input_dim = max(vocabulary_indices) + 2

文档在这里变得特别混乱,因为他们说

(input_dim should equal |vocabulary| + 2)

我会将 |x| 解释为集合的基数(相当于 len(x)),但作者的意思似乎是

2 + maximum integer index occurring in the input data.