如何为 seq2seq 模型准备数据？

Question

我正在使用序列到序列 lstm 模型构建机器翻译（英语-法语）。

我看过keras seq2seq-lstm 的例子，我无法理解如何从文本中准备数据，这是用于准备数据的for循环。但是里面的东西我有点看不懂。

for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    for t, char in enumerate(input_text):
        encoder_input_data[i, t, input_token_index[char]] = 1.
    encoder_input_data[i, t + 1:, input_token_index[' ']] = 1.
    for t, char in enumerate(target_text):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data[i, t, target_token_index[char]] = 1.
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            decoder_target_data[i, t - 1, target_token_index[char]] = 1.
    decoder_input_data[i, t + 1:, target_token_index[' ']] = 1.
    decoder_target_data[i, t:, target_token_index[' ']] = 1.

为什么我们需要三个不同的数据，encoder_input、decoder_input 和 decoder_ouput？

for t, char in enumerate(target_text):
    decoder_input_data[i, t, target_token_index[char]] = 1.
    if t > 0:
    # decoder_target_data will be ahead by one timestep
    # and will not include the start character.
        decoder_target_data[i, t - 1, target_token_index[char]] = 1.
         # why it's t - 1 shouldn't it be t + 1

这里说decoder target会提前一个timestep，我说的提前是什么意思，不是说"t + 1"而不是"t - 1"吗。我读过"we have to offset decoder_target_data by one timestep."这里是什么意思？

如果可能的话，您能否完整解释一下这个 for 循环 以及我在为未来的 seq2seq 模型准备数据时要记住的任何要点？我的意思是我们如何为模型准备数据？很混乱。

Answer 1

好的，我假设您阅读了第 11 行到第 34 行 ("Summary of the Algorithm")，因此您了解这个特定序列 2 序列模型背后的基本思想。首先编码器产生 2 "state vectors"（潜在的 "something"）。然后它被送到一个解码器，它...不管怎样，让我们一步一步地看一下（第 127-132 行）：

# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

有两个"states"，关于LSTM，看这里： https://keras.io/layers/recurrent/ 在 "Output shape" 下。它是处理输入序列后的内部状态 - 或批处理中所有序列的状态数组（按行）。产生的输出被忽略。 latent_dim 表示 LSTM 单元的数量（第 60 行：它是 256）——它还将确定状态向量的大小。

下一个：

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
                                     initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

首先，请注意这个模型不是 Sequential，它使用函数 API: https://keras.io/models/model/ - 所以输入既是编码器又是解码器输入，并且输出是解码器输出。

解码器输出的大小？ num_decoder_tokens是字典的大小！（不是输出序列）。给定 "history" 和当前输入，它应该产生输出序列中下一个字符的概率分布，但是这个 "history" （初始内部状态）是编码器处理输入序列后的最终状态.

注意 - 解码器将使用编码器的最终状态进行初始化，然后，在对每个字符进行采样后，修改后的状态将用于下一次推理，以及一个新的“输入”——一个带有最后预测字符的单热向量。

现在，回答你的问题 - 我想你想了解为什么训练数据看起来像它看起来的样子。

首先（第 104-112 行）：

encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length, num_encoder_tokens),
    dtype='float32')
decoder_input_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')
decoder_target_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')

编码器训练集由批次组成 - len(input_texts) 个批次。下一维是最大序列长度，第三维是标记 ("character") 索引，包含文本中找到的所有字母（num_encoder_tokens - 英文字母表，以及 num_decoder_tokens -法语字母表，加上 '\t' 作为句子的开头或其他内容）。

所以，让我们用字符串来说明它，然后显示出细微的差别，就是这样。

比方说，解码器输出序列是'Bonjour'（我不懂法语，sorry），并假设 'max_decoder_seq_length == 10'。那么，

decoder_input_data = 'Bonjour   '  # 3 spaces, to fill up to 10
decoder_output_data = 'onjour    ' # 4 spaces, to fill up to 10

但是，这不是表示为一个简单的字符串 - 它实际上是一个掩码 - 0 表示它不是这个字符，1 表示 - 它是。

所以更像是：

decoder_input_data[0]['B'] = 1  # and decoder_input_data[0][anything_else] == 0
decoder_input_data[1]['o'] = 1  # HERE: t == 1
decoder_input_data[2]['n'] = 1
# ... 
decoder_input_data[6]['r'] = 1
decoder_input_data[7:10][' '] = 1  # the padding

并且编码器必须移动 1 "to the left":

# for t == 0, the `decoder_output_data` is not touched (`if t > 0`)

# decoder_output_data[t-1]['o'] = 1  # t-1 == 0
decoder_output_data[0]['o'] = 1  # t == 1
decoder_output_data[1]['n'] = 1  # t == 2
decoder_output_data[2]['j'] = 1  # t == 3
# ...
decoder_output_data[6:10][' '] = 1  # output padding with spaces, longer by 1 than input padding

所以，这基本上就是 "Why t-1" 的答案。

现在"why do we need 3 input data"?

嗯，这就是 seq2seq 方法的思想：

我们需要解码器学习在给定前一个（和初始状态）的情况下生成正确的下一个法语字符。这就是它从移位的输出序列中学习的原因。

但是它首先应该产生什么序列呢？好吧，这就是编码器的用途——它产生一个单一的最终状态——它 "remembered" 从读取输入序列中得到的一切。通过我们的训练，我们导致这种状态（每个序列 2 个向量，每个序列有 256 个浮点数）来指导解码器产生输出序列。

如何为 seq2seq 模型准备数据？

How can I do prepare data for a seq2seq model?

python

deep-learning

keras

seq2seq