序列建模的词索引加1的原因

Reason for adding 1 to word index for sequence modeling

我注意到在许多教程中 1 被添加到 word_index。例如考虑一个示例代码片段,灵感来自 Tensorflow's 教程 NMT https://www.tensorflow.org/tutorials/text/nmt_with_attention :

import tensorflow as tf
sample_input = ["sample sentence 1", "sample sentence 2"]
lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')
lang_tokenizer.fit_on_texts(sample_input)
vocab_inp_size = len(lang_tokenizer.word_index)+1

我不明白将 1 添加到 word_index dictionary 的原因。不会添加 random 1 影响预测。任何建议都会有所帮助

根据documentation: layers.Embedding输入中的最大整数应该小于词汇量 / input_dim.

input_dim: Integer. Size of the vocabulary, i.e. maximum integer index + 1.

这就是为什么

vocab_inp_size = len(inp_lang.word_index)  + 1
vocab_tar_size = len(targ_lang.word_index) + 1

例如,考虑以下情况,

inp = np.array([
  [1, 0, 2, 0],
  [1, 1, 5, 0],
  [1, 1, 3, 0]
])
print(inp.shape, inp.max())

'''
The largest integer (i.e. word index) in the input  
should be no larger than vocabulary size or input_dim in the Embedding layer. 
'''

x = Input(shape=(4,))
e = Embedding(input_dim = inp.max() + 1 , output_dim = 5, mask_zero=False)(x)

m = Model(inputs=x, outputs=e)
m.predict(inp).shape
(3, 4) 5
(3, 4, 5)

Embedding层的input_dim要大于inp. max(),否则会出错。此外,mask_zero 是默认的 False,但如果设置 True,那么索引 0 就不能在词汇表中使用。根据 doc:

mask_zero: Boolean, whether or not the input value 0 is a special "padding" value that should be masked out. This is useful when using recurrent layers which may take variable length input. If this is True, then all subsequent layers in the model need to support masking or an exception will be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the vocabulary (input_dim should equal size of vocabulary + 1).

所以,如果我们在上面的例子中将mask_zero设置为True,那么Embedding层的input_dim就是

Embedding(input_dim = inp.max() + 2 , output_dim = 5, mask_zero=True)