师生系统：用Top-k假设列表训练学生

Question

我想配置一个师生系统，老师seq2seq模型生成top-k个假设列表，用于训练学生seq2seq模型。

我的计划是对教师的假设进行批处理，这意味着教师输出批处理轴长度为 k * B[= 的张量61=]，其中 B 是输入的批处理轴长度。输出批次张量现在包含输入批次张量中每个序列的 k 个假设，按输入批次中相关输入序列的位置排序。
这个张量被设置为学生的训练目标。然而，学生的批次张量仍然有 B 的批次轴长度，所以我利用 tf.repeat 重复学生编码器输出张量中的序列 k 次，然后将该张量输入学生的解码器。

出于调试目的，我做了简化以重复老师的单一最佳假设，现在，在我要实施 top-k 列表之前 selection.

这是我的配置文件的摘要：

[...] # Variables: student_target = "teacher_hypotheses_stack" [...] # Custom repeat function: def repeat(source, src_name="source", **kwargs): import tensorflow as tf input = source(0) input = tf.Print(input, [src_name, "in", input, tf.shape(input)]) output = tf.repeat(input, repeats=3, axis=1) output = tf.Print(output, [src_name, "out", output, tf.shape(output)]) return output def repeat_t(source, **kwargs): return repeat(source, "teacher") def repeat_s(source, **kwargs): return repeat(source, "student") [...] # Configuration of the teacher + repeating of its output **teacher_network(), # The teacher_network is a encoder-decoder seq2seq model. The teacher performs search during training and is untrainable "teacher_stack": { "class": "eval", "from": ["teacher_decision"], "eval": repeat_t, "trainable": False # "register_as_extern_data": "teacher_hypotheses_stack" }, "teacher_stack_reinterpreter": { # This is an attempt to explicitly (re-)select the batch axis. It is probably unecessary... "class": "reinterpret_data", "set_axes": {"B": 1, "T": 0}, "enforce_time_major": True, "from": ["teacher_stack"], "trainable": False, "register_as_extern_data": "teacher_hypotheses_stack" } [...] # Repeating of the student's encoder ouput + configuration of its decoder "student_encoder": {"class": "copy", "from": ["student_lstm6_fw", "student_lstm6_bw"]}, # dim: EncValueTotalDim "student_encoder_repeater": {"class": "eval", "from": ["student_encoder"], "eval": repeat}, "student_encoder_stack": { # This is an attempt to explicitly (re-)select the batch axis. It is probably unecessary... "class": "reinterpret_data", "set_axes": {"B": 1, "T": 0}, "enforce_time_major": True, "from": ["student_encoder_repeater"] }, "student_enc_ctx": {"class": "linear", "activation": None, "with_bias": True, "from": ["student_encoder_stack"], "n_out": EncKeyTotalDim}, # preprocessed_attended in Blocks "student_inv_fertility": {"class": "linear", "activation": "sigmoid", "with_bias": False, "from": ["student_encoder_stack"], "n_out": AttNumHeads}, "student_enc_value": {"class": "split_dims", "axis": "F", "dims": (AttNumHeads, EncValuePerHeadDim), "from": ["student_encoder_stack"]}, # (B, enc-T, H, D'/H) "model1_output": {"class": "rec", "from": [], 'cheating': config.bool("cheating", False), "unit": { 'output': {'class': 'choice', 'target': student_target, 'beam_size': beam_size, 'cheating': config.bool("cheating", False), 'from': ["model1_output_prob"], "initial_output": 0}, "end": {"class": "compare", "from": ["output"], "value": 0}, 'model1_target_embed': {'class': 'linear', 'activation': None, "with_bias": False, 'from': ['output'], "n_out": target_embed_size, "initial_output": 0}, # feedback_input "model1_weight_feedback": {"class": "linear", "activation": None, "with_bias": False, "from": ["prev:model1_accum_att_weights"], "n_out": EncKeyTotalDim, "dropout": 0.3}, "model1_s_transformed": {"class": "linear", "activation": None, "with_bias": False, "from": ["model1_s"], "n_out": EncKeyTotalDim, "dropout": 0.3}, "model1_energy_in": {"class": "combine", "kind": "add", "from": ["base:student_enc_ctx", "model1_weight_feedback", "model1_s_transformed"], "n_out": EncKeyTotalDim}, "model1_energy_tanh": {"class": "activation", "activation": "tanh", "from": ["model1_energy_in"]}, "model1_energy": {"class": "linear", "activation": None, "with_bias": False, "from": ["model1_energy_tanh"], "n_out": AttNumHeads}, # (B, enc-T, H) "model1_att_weights": {"class": "softmax_over_spatial", "from": ["model1_energy"]}, # (B, enc-T, H) "model1_accum_att_weights": {"class": "eval", "from": ["prev:model1_accum_att_weights", "model1_att_weights", "base:student_inv_fertility"], "eval": "source(0) + source(1) * source(2) * 0.5", "out_type": {"dim": AttNumHeads, "shape": (None, AttNumHeads)}}, "model1_att0": {"class": "generic_attention", "weights": "model1_att_weights", "base": "base:student_enc_value"}, # (B, H, V) "model1_att": {"class": "merge_dims", "axes": "except_batch", "from": ["model1_att0"]}, # (B, H*V) "model1_s": {"class": "rnn_cell", "unit": "LSTMBlock", "from": ["prev:model1_target_embed", "prev:model1_att"], "n_out": 1000, "dropout": 0.3}, # transform "model1_readout_in": {"class": "linear", "from": ["model1_s", "prev:model1_target_embed", "model1_att"], "activation": None, "n_out": 1000, "dropout": 0.3}, # merge + post_merge bias "model1_readout": {"class": "reduce_out", "mode": "max", "num_pieces": 2, "from": ["model1_readout_in"]}, "model1_output_prob": { "class": "softmax", "from": ["model1_readout"], "dropout": 0.3, "target": student_target, "loss": "ce", "loss_opts": {"label_smoothing": 0.1} } }, "target": student_target}, [...]

运行此配置将向控制台打印以下错误消息：

[...] Create Adam optimizer. Initialize optimizer (default) with slots ['m', 'v']. These additional variable were created by the optimizer: [<tf.Variable 'optimize/beta1_power:0' shape=() dtype=float32_ref>, <tf.Variable 'optimize/beta2_power:0' shape=() dtype=float32_ref>]. [teacher][in][[6656 6657 6658...]...][17 23] [teacher][out][[6656 6656 6656...]...][17 69] TensorFlow exception: assertion failed: [x.shape[0] != y.shape[0]] [69 17] [23] [[node objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/Assert/Assert (defined at home/philipp/Documents/bachelor-thesis/returnn/returnn-venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]] [...] Execute again to debug the op inputs... FetchHelper(0): <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/Shape_1_1:0' shape=(1,) dtype=int32> = shape (1,), dtype int32, min/max 23/23, ([23]) FetchHelper(0): <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/Assert/Assert/data_0_1:0' shape=() dtype=string> = bytes(b'x.shape[0] != y.shape[0]') FetchHelper(0): <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/Shape_2:0' shape=(2,) dtype=int32> = shape (2,), dtype int32, min/max 17/69, ([69 17]) FetchHelper(0): <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/All_1:0' shape=() dtype=bool> = bool_(False) [teacher][in][[6656 6657 6658...]...][17 23] [teacher][out][[6656 6656 6656...]...][17 69] Op inputs: <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/All:0' shape=() dtype=bool>: bool_(False) <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/Assert/Assert/data_0:0' shape=() dtype=string>: bytes(b'x.shape[0] != y.shape[0]') <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/Shape:0' shape=(2,) dtype=int32>: shape (2,), dtype int32, min/max 17/69, ([69 17]) <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/Shape_1:0' shape=(1,) dtype=int32>: shape (1,), dtype int32, min/max 23/23, ([23]) Step meta information: {'seq_idx': [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22], 'seq_tag': ['seq-0','seq-1','seq-2','seq-3','seq-4','seq-5','seq-6','seq-7','seq-8','seq-9','seq-10','seq-11','seq-12','seq-13','seq-14','seq-15','seq-16','seq-17','seq-18','seq-19','seq-20','seq-21','seq-22']} Feed dict: <tf.Tensor 'extern_data/placeholders/data/data:0' shape=(?, ?, 80) dtype=float32>: shape (23, 42, 80), dtype float32, min/max -0.5/0.4, mean/stddev -0.050000004/0.28722814, Data(name='data', shape=(None, 80), batch_shape_meta=[B,T|'time:var:extern_data:data',F|80]) <tf.Tensor 'extern_data/placeholders/data/data_dim0_size:0' shape=(?,) dtype=int32>: shape (23,), dtype int32, min/max 42/42, ([42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42]) <tf.Tensor 'extern_data/placeholders/source_text/source_text:0' shape=(?, ?, 512) dtype=float32>: shape (23, 13, 512), dtype float32, min/max -0.5/0.4, mean/stddev -0.050011758/0.28722063, Data(name='source_text', shape=(None, 512), available_for_inference=False, batch_shape_meta=[B,T|'time:var:extern_data:source_text',F|512]) <tf.Tensor 'extern_data/placeholders/source_text/source_text_dim0_size:0' shape=(?,) dtype=int32>: shape (23,), dtype int32, min/max 13/13, ([13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13]) <tf.Tensor 'extern_data/placeholders/target_text/target_text:0' shape=(?, ?) dtype=int32>: shape (23, 17), dtype int32, min/max 6656/6694, Data(name='target_text', shape=(None,), dtype='int32', sparse=True, dim=35209, available_for_inference=False, batch_shape_meta=[B,T|'time:var:extern_data:target_text']) <tf.Tensor 'extern_data/placeholders/target_text/target_text_dim0_size:0' shape=(?,) dtype=int32>: shape (23,), dtype int32, min/max 17/17, ([17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17]) <tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>: bool(True) EXCEPTION [...] File "home/philipp/Documents/bachelor-thesis/returnn/repository/TFUtil.py", line 4374, in sparse_labels_with_seq_lens x = check_dim_equal(x, 0, seq_lens, 0) [...]

因此，网络构建无误，但在第一个训练步骤中，由于断言错误而崩溃。在我看来，RETURNN 或 TensorFlow 以某种方式根据其原始值验证批处理长度。但是我不知道在哪里，为什么，所以我不知道该怎么办。

我做错了什么？我的想法甚至可以通过 RETURNN 实现吗？

编辑（2020 年 6 月 10 日）： 澄清一下：我的最终目标是让老师为每个输入序列生成一个 top-k 假设列表，然后使用这些假设训练学生。因此，对于学生的每个输入序列，都有 k solutions/target 个序列。为了训练学生，它必须预测每个假设的概率，然后计算交叉熵损失以确定更新梯度。但是，如果每个输入序列有 k 个目标序列，则学生必须解码编码器状态 k 次，每次针对不同的目标序列。这就是为什么我想重复编码器状态 k 次，使学生解码器的数据并行，然后使用 RETURNN 的默认交叉熵损失实现：

input-seq-1 --- teacher-hyp-1-1; input-seq-1 --- teacher-hyp-1-2; ...; input-seq-1 --- teacher-hyp-1-k; input-seq-2 --- teacher-hyp-2-1; ...

有没有更合适的方法来实现我的目标？

编辑（2020 年 6 月 12 日 #1）： 是的，我知道老师的 DecisionLayer 已经 select 是最好的假设这样，我只重复那个最佳假设 k 次。我这样做是为了实现最终目标的中间步骤。后来想办法从老师的ChoiceLayer上取top-k list，但是感觉这又是另外一个工地
但是 Albert，你说 RETURNN 会以某种方式自动扩展批量维度上的数据？我怎么能想象到呢？

编辑（2020 年 6 月 12 日 #2）： 好的，现在我 select 老师选择的前 k 个（这次 k=4）假设列表层（或输出层）：

"teacher_hypotheses": { "class": "copy", "from": ["extra.search:teacherMT_output"], "register_as_extern_data": "teacher_hypotheses_stack" }

但是使用这个数据作为学生的训练目标会导致错误：

TensorFlow exception: assertion failed: [shape[0]:] [92] [!=] [dim:] [23] [[node studentMT_output/rec/subnet_base/check_seq_len_batch_size/check_input_dim/assert_equal_1/Assert/Assert (defined at home/philipp/Documents/bachelor-thesis/returnn/returnn-venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]

我假设，由于学生的目标数据（假设列表）的批次轴长度比学生的输入 data/encoder 状态长 k=4 倍数据。学生编码器状态数据不需要在这里 extended/repeated 来匹配目标数据吗？

编辑（2020 年 6 月 12 日 #3）：我认为最初的问题已解决。整体问题继续在这里

Answer 1

它不仅验证批长度。它会折叠批次和时间（它使用了 flatten_with_seq_len_mask，请参阅 Loss.init 和该函数的代码），然后计算该扁平化张量的损失。因此 seq 长度也需要匹配。这可能是个问题，但我不确定。由于您对 rec 层本身也有相同的目标，因此它在训练中应该具有相同的 seq 长度。

您可以通过仔细检查 debug_print_layer_output_template 的输出来调试它，即检查 Data (batch-shape-meta) 输出，如果轴都像您期望的那样正确. （debug_print_layer_output_template 可以而且应该始终启用。它不会使速度变慢。）你也可以暂时启用debug_print_layer_output_shape，它会真正打印出所有张量的形状。这样你就可以验证它的样子了。

你对 ReinterpretDataLayer 的用法看起来很不对。你永远不应该用整数显式地设置坐标轴（比如 "set_axes": {"B": 1, "T": 0}）。你为什么要这样做？这可能就是最后搞砸的原因。

您的 repeat 函数不是很通用。您也在那里使用硬编码轴整数。你永远不应该那样做。相反，你会这样写：

input_data = source(0, as_data=True)
input = input_data.placeholder
...
output = tf.repeat(input, repeats=3, axis=input_data.batch_dim_axis)

我的理解是否正确，这就是你想要做的？在批处理轴上重复？在这种情况下，您还需要调整该层输出的序列长度信息。您不能简单地在 EvalLayer 中按原样使用该函数。您还需要将 out_type 定义为正确 returns 正确 Data 模板的函数。例如。像这样：

def repeat_out(out):
   out = out.copy()
   out.size_placeholder[0] = tf.repeat(out.size_placeholder[0], axis=0, repeats=3)
   return out

...
"student_encoder_repeater": {
    "class": "eval", "from": ["student_encoder"], "eval": repeat,
    "out_type": lambda sources, **kwargs: repeat_out(sources[0].output)
}

现在你有额外的问题，每次你调用这个 repeat_out，你会得到另一个序列长度信息。 RETURNN 将无法判断这些 seq 长度是否全部相同或不同（在编译时）。这将导致错误或奇怪的效果。要解决这个问题，您应该重复使用相同的序列长度。例如。像这样：

"teacher_stack_": {
    "class": "eval", "from": "teacher_decision", "eval": repeat
},
"teacher_stack": {
    "class": "reinterpret_data", "from": "teacher_stack_", "size_base": "student_encoder_repeater"
}

顺便说一句，你为什么要重复这个过程？这背后的想法是什么？你把学生和老师都重复了 3 遍？所以只要将你的学习率提高 3 倍就可以达到同样的效果吗？

编辑: 好像是为了匹配top-k榜单才这么做的。在那种情况下，这一切都是错误的，因为 RETURNN 应该已经自动进行了这样的重复。您不应该手动执行此操作。

编辑：要了解重复（以及一般的波束搜索解析）是如何工作的，首先你应该查看日志输出（你必须有debug_print_layer_output_template 已启用，但无论如何你应该始终拥有它）。您将看到每一层的输出，尤其是它的 Data 输出对象。这对于检查形状是否符合您的预期已经很有用了（检查日志中的 batch_shape_meta）。然而，这只是编译时的静态形状，所以 batch-dim 只是那里的一个标记。您还将看到搜索光束信息。这将跟踪批次是否来自某个集束搜索（基本上是任何 ChoiceLayer），并且具有集束和集束大小。现在，在代码中，检查 SearchChoices.translate_to_common_search_beam 及其用法。当您按照代码进行操作时，您将看到 SelectSearchSourcesLayer，实际上您的案例将以 output.copy_extend_with_beam(search_choices.get_beam_info()).

结尾

编辑：重复，这是自动完成的。您不需要手动调用copy_extend_with_beam。

如果你希望从老师那里得到 top-k 列表，你也可能做错了，因为我看到你使用 "teacher_decision" 作为输入。我猜这是来自 DecisionLayer？在那种情况下，它已经只从 top-k 光束中取了最好的。

编辑：现在我明白你忽略了这一点，而是想只取第一个最好的，然后也重复这个。我建议不要那样做，因为你让它变得不必要的复杂，而且你在与 RETURNN 作斗争，它知道 batch-dim 应该是什么并且会感到困惑。（你可以按照我上面写的让它工作，但实际上，这只是不必要的复杂。）

顺便说一句，将 EvalLayer 设置为 "trainable": False 没有意义。那没有效果。反正eval层是没有参数的。

师生系统：用Top-k假设列表训练学生

Teacher-Student System: Training Student with Top-k Hypotheses List

returnn