LSTM注意力如何具有可变长度输入

Question

LSTM 的注意力机制是一个直接的 softmax 前馈网络，它接收编码器每个时间步长的隐藏状态和解码器的当前状态。

这两个步骤似乎相互矛盾，我无法理解： 1）需要预定义前馈网络的输入数量 2）编码器的隐藏状态数量是可变的（取决于编码过程中的时间步数）。

我是不是误会了什么？另外，训练是否与我训练常规 encoder/decoder 网络一样，还是我必须单独训练注意力机制？

提前致谢

Answer 1

今天我问自己同样的事情，发现了这个问题。我自己从来没有实现过注意力机制，但从 this paper 来看，它似乎不仅仅是一个直接的 softmax。对于解码器网络的每个输出 y_i，上下文向量 c_i 被计算为加权编码器隐藏状态的总和 h₁, ..., h_T:

c_i = α_i1h₁+...+α_iTh_T

每个样本的时间步数T可能不同，因为系数α_ij不是固定大小的向量。事实上，它们是由 softmax(e_i1, ..., e_iT) 计算的，其中每个 e_ij是一个神经网络的输出，其输入是编码器隐藏状态h_j和解码器隐藏状态 s_i-1:

e_ij = f(s_i-1, h_j)

因此，在计算 y_i 之前，这个神经网络必须被评估 T 次，产生 T 个权重 α_i1,.. .,α_iT。另外，this tensorflow impementation 可能会有用。

Answer 2

def attention(inputs, size, scope):
    with tf.variable_scope(scope or 'attention') as scope:
        attention_context_vector = tf.get_variable(name='attention_context_vector',
                                             shape=[size],
                                             regularizer=layers.l2_regularizer(scale=L2_REG),
                                             dtype=tf.float32)
        input_projection = layers.fully_connected(inputs, size,
                                            activation_fn=tf.tanh,
                                            weights_regularizer=layers.l2_regularizer(scale=L2_REG))
        vector_attn = tf.reduce_sum(tf.multiply(input_projection, attention_context_vector), axis=2, keep_dims=True)
        attention_weights = tf.nn.softmax(vector_attn, dim=1)
        weighted_projection = tf.multiply(inputs, attention_weights)
        outputs = tf.reduce_sum(weighted_projection, axis=1)

return outputs

希望这段代码可以帮助你理解注意力是如何工作的。我在我的文档分类作业中使用了这个函数，它是一个 lstm-attention 模型，与您的 encoder-decoder 模型不同。

LSTM注意力如何具有可变长度输入

How can LSTM attention have variable length input

text-processing

machine-learning

neural-network

lstm

recurrent-neural-network