Tensorflow - 机器翻译解码器
Tensorflow - Decoder for Machine Translation
我正在研究 Tensorflow's tutorial 使用注意力机制的神经机器翻译。
解码器的代码如下:
class Decoder(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
super(Decoder, self).__init__()
self.batch_sz = batch_sz
self.dec_units = dec_units
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
self.gru = tf.keras.layers.GRU(self.dec_units,
return_sequences=True,
return_state=True,
recurrent_initializer='glorot_uniform')
self.fc = tf.keras.layers.Dense(vocab_size)
# used for attention
self.attention = BahdanauAttention(self.dec_units)
def call(self, x, hidden, enc_output):
# enc_output shape == (batch_size, max_length, hidden_size)
context_vector, attention_weights = self.attention(hidden, enc_output)
# x shape after passing through embedding == (batch_size, 1, embedding_dim)
x = self.embedding(x)
# x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
# passing the concatenated vector to the GRU
output, state = self.gru(x)
# output shape == (batch_size * 1, hidden_size)
output = tf.reshape(output, (-1, output.shape[2]))
# output shape == (batch_size, vocab)
x = self.fc(output)
return x, state, attention_weights
这里我不明白的是,解码器的 GRU 单元并没有通过使用编码器的最后隐藏状态对其进行初始化来连接到编码器。
output, state = self.gru(x)
# Why is it not initialized with the hidden state of the encoder ?
据我了解,编码器和解码器之间存在联系,只有当解码器使用“思想向量”或编码器的最后隐藏状态进行初始化时。
为什么 Tensorflow 的官方教程中没有这个?这是一个错误吗?或者我在这里遗漏了什么?
谁能帮我理解一下?
这个 detailed NMT guide 很好地总结了这一点,它将经典的 seq2seq NMT 与基于编码器-解码器注意力的 NMT 架构进行了比较。
Vanilla seq2seq: The decoder also needs to have access to the source information, and one simple way to achieve that is to initialize it with the last hidden state of the encoder, encoder_state.
Attention-based encoder-decoder: Remember that in the vanilla seq2seq model, we pass the last source state from the encoder to the decoder when starting the decoding process. This works well for short and medium-length sentences; however, for long sentences, the single fixed-size hidden state becomes an information bottleneck. Instead of discarding all of the hidden states computed in the source RNN, the attention mechanism provides an approach that allows the decoder to peek at them (treating them as a dynamic memory of the source information). By doing so, the attention mechanism improves the translation of longer sentences.
在这两种情况下,您都可以使用teacher forcing来更好地训练模型。
TLDR;注意机制是帮助解码器“峰值”进入编码器的机制,而不是您将编码器正在做的事情明确传递给解码器。
我正在研究 Tensorflow's tutorial 使用注意力机制的神经机器翻译。
解码器的代码如下:
class Decoder(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
super(Decoder, self).__init__()
self.batch_sz = batch_sz
self.dec_units = dec_units
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
self.gru = tf.keras.layers.GRU(self.dec_units,
return_sequences=True,
return_state=True,
recurrent_initializer='glorot_uniform')
self.fc = tf.keras.layers.Dense(vocab_size)
# used for attention
self.attention = BahdanauAttention(self.dec_units)
def call(self, x, hidden, enc_output):
# enc_output shape == (batch_size, max_length, hidden_size)
context_vector, attention_weights = self.attention(hidden, enc_output)
# x shape after passing through embedding == (batch_size, 1, embedding_dim)
x = self.embedding(x)
# x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
# passing the concatenated vector to the GRU
output, state = self.gru(x)
# output shape == (batch_size * 1, hidden_size)
output = tf.reshape(output, (-1, output.shape[2]))
# output shape == (batch_size, vocab)
x = self.fc(output)
return x, state, attention_weights
这里我不明白的是,解码器的 GRU 单元并没有通过使用编码器的最后隐藏状态对其进行初始化来连接到编码器。
output, state = self.gru(x)
# Why is it not initialized with the hidden state of the encoder ?
据我了解,编码器和解码器之间存在联系,只有当解码器使用“思想向量”或编码器的最后隐藏状态进行初始化时。
为什么 Tensorflow 的官方教程中没有这个?这是一个错误吗?或者我在这里遗漏了什么?
谁能帮我理解一下?
这个 detailed NMT guide 很好地总结了这一点,它将经典的 seq2seq NMT 与基于编码器-解码器注意力的 NMT 架构进行了比较。
Vanilla seq2seq: The decoder also needs to have access to the source information, and one simple way to achieve that is to initialize it with the last hidden state of the encoder, encoder_state.
Attention-based encoder-decoder: Remember that in the vanilla seq2seq model, we pass the last source state from the encoder to the decoder when starting the decoding process. This works well for short and medium-length sentences; however, for long sentences, the single fixed-size hidden state becomes an information bottleneck. Instead of discarding all of the hidden states computed in the source RNN, the attention mechanism provides an approach that allows the decoder to peek at them (treating them as a dynamic memory of the source information). By doing so, the attention mechanism improves the translation of longer sentences.
在这两种情况下,您都可以使用teacher forcing来更好地训练模型。
TLDR;注意机制是帮助解码器“峰值”进入编码器的机制,而不是您将编码器正在做的事情明确传递给解码器。