当 mask_zero=True 在前一层中连接层时，Keras 图像字幕模型无法编译

Question

我是 Keras 的新手，我正在尝试为图像字幕项目实现一个模型。

我正在尝试从 Image captioning pre-inject architecture (The picture is taken from this paper: Where to put the image in an image captioning generator 中重现模型）（但有一点点不同：在每个时间步生成一个词，而不是在最后只生成一个词），其中输入第一步的 LSTM 是嵌入的 CNN 特征。 LSTM 应该支持可变输入长度，为了做到这一点，我用零填充了所有序列，以便它们都具有 maxlen 时间步长。

我现在的模型代码如下：

def get_model(model_name, batch_size, maxlen, voc_size, embed_size, 
        cnn_feats_size, dropout_rate):

    # create input layer for the cnn features
    cnn_feats_input = Input(shape=(cnn_feats_size,))

    # normalize CNN features 
    normalized_cnn_feats = BatchNormalization(axis=-1)(cnn_feats_input)

    # embed CNN features to have same dimension with word embeddings
    embedded_cnn_feats = Dense(embed_size)(normalized_cnn_feats)

    # add time dimension so that this layer output shape is (None, 1, embed_size)
    final_cnn_feats = RepeatVector(1)(embedded_cnn_feats)

    # create input layer for the captions (each caption has max maxlen words)
    caption_input = Input(shape=(maxlen,))

    # embed the captions
    embedded_caption = Embedding(input_dim=voc_size,
                                 output_dim=embed_size,
                                 input_length=maxlen)(caption_input)

    # concatenate CNN features and the captions.
    # Ouput shape should be (None, maxlen + 1, embed_size)
    img_caption_concat = concatenate([final_cnn_feats, embedded_caption], axis=1)

    # now feed the concatenation into a LSTM layer (many-to-many)
    lstm_layer = LSTM(units=embed_size,
                      input_shape=(maxlen + 1, embed_size),   # one additional time step for the image features
                      return_sequences=True,
                      dropout=dropout_rate)(img_caption_concat)

    # create a fully connected layer to make the predictions
    pred_layer = TimeDistributed(Dense(units=voc_size))(lstm_layer)

    # build the model with CNN features and captions as input and 
    # predictions output
    model = Model(inputs=[cnn_feats_input, caption_input], 
                  outputs=pred_layer)

    optimizer = Adam(lr=0.0001, 
                     beta_1=0.9, 
                     beta_2=0.999, 
                     epsilon=1e-8)

    model.compile(loss='categorical_crossentropy',optimizer=optimizer)
    model.summary()

    return model

模型（如上）编译没有任何错误（参见：model summary），我设法使用我的数据对其进行训练。但是，它没有考虑到我的序列是零填充的，因此结果不会准确。当我尝试更改嵌入层以支持屏蔽时（还要确保我使用 voc_size + 1 而不是 voc_size，如文档中所述），如下所示：

embedded_caption = Embedding(input_dim=voc_size + 1,
                             output_dim=embed_size,
                             input_length=maxlen, mask_zero=True)(caption_input)

我收到以下错误：

Traceback (most recent call last):
  File "/export/home/.../py3_env/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1567, in _create_c_op
    c_op = c_api.TF_FinishOperation(op_desc)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimension 0 in both shapes must be equal, but are 200 and 1. Shapes are [200] and [1]. for 'concatenate_1/concat_1' (op: 'ConcatV2') with input shapes: [?,1,200], [?,25,1], [] and with computed input tensors: input[2] = <1>

我不知道为什么它说第二个数组的形状是 [?, 25, 1]，因为我在串联之前打印它的形状，它是 [?, 25, 200]（因为它应该是）。我不明白为什么在没有该参数的情况下编译和工作正常的模型会出现问题，但我认为我缺少一些东西。

我也一直在考虑使用 Masking 层而不是 mask_zero=True，但它应该在 Embedding 之前并且文档说 Embedding 层应该是模型中的第一层（之后输入）。

有什么我可以改变的来解决这个问题或者有解决方法吗？

Answer 1

不等形状错误指的是掩码而不是tensors/inputs。由于 concatenate 支持屏蔽，因此需要 handle mask propagation。您的 final_cnn_feats 没有面具 (None)，而您的 embedded_caption 有一个形状为 (?, 25) 的面具。你可以通过以下方式找到它：

print(embedded_caption._keras_history[0].compute_mask(caption_input))

由于 final_cnn_feats 没有掩码，concatenate 将 give it a all non-zero mask 进行适当的掩码传播。虽然这是正确的，但是掩码的形状与 final_cnn_feats 具有相同的形状，即 (?, 1, 200) 而不是 (?, 1)，即在所有时间步长屏蔽所有特征，而不是仅屏蔽所有特征时间步长。这就是不等形状错误的来源（(?, 1, 200) vs (?, 25)）。

要修复它，您需要给 final_cnn_feats 一个 correct/matching 掩码。现在我不熟悉你在这里的项目。一种选择是将 Masking 层应用到 final_cnn_feats，因为它被设计为 mask timestep(s)。

final_cnn_feats = Masking()(RepeatVector(1)(embedded_cnn_feats))

只有当 final_cnn_feats 中的 200 个特征并非全部为零时，这才是正确的，即 final_cnn_feats 中始终至少有一个非零值。在这种情况下，Masking 层将提供 (?, 1) 掩码，并且不会掩码 final_cnn_feats.

中的单个时间步长

当 mask_zero=True 在前一层中连接层时，Keras 图像字幕模型无法编译

Keras image captioning model not compiling because of concatenate layer when mask_zero=True in a previous layer

deep-learning

lstm

keras

tensorflow

word-embedding