师生系统:用Top-k假设列表训练学生

Teacher-Student System: Training Student with Top-k Hypotheses List

我想配置一个师生系统,老师seq2seq模型生成top-k个假设列表,用于训练学生seq2seq模型。

我的计划是对教师的假设进行批处理,这意味着教师输出批处理轴长度为 k * B[= 的张量61=],其中 B 是输入的批处理轴长度。输出批次张量现在包含输入批次张量中每个序列的 k 个假设,按输入批次中相关输入序列的位置排序。
这个张量被设置为学生的训练目标。然而,学生的批次张量仍然有 B 的批次轴长度,所以我利用 tf.repeat 重复学生编码器输出张量中的序列 k 次,然后将该张量输入学生的解码器。

出于调试目的,我做了简化以重复老师的单一最佳假设,现在,在我要实施 top-k 列表之前 selection.

这是我的配置文件的摘要:

[...]

# Variables:

student_target = "teacher_hypotheses_stack"

[...]

# Custom repeat function:

def repeat(source, src_name="source", **kwargs):
    import tensorflow as tf

    input = source(0)
    input = tf.Print(input, [src_name, "in", input, tf.shape(input)])

    output = tf.repeat(input, repeats=3, axis=1)
    output = tf.Print(output, [src_name, "out", output, tf.shape(output)])

    return output

def repeat_t(source, **kwargs):
    return repeat(source, "teacher")


def repeat_s(source, **kwargs):
    return repeat(source, "student")


[...]

# Configuration of the teacher + repeating of its output

**teacher_network(), # The teacher_network is a encoder-decoder seq2seq model. The teacher performs search during training and is untrainable
"teacher_stack": {
    "class": "eval", "from": ["teacher_decision"], "eval": repeat_t,
    "trainable": False
    # "register_as_extern_data": "teacher_hypotheses_stack"
},
"teacher_stack_reinterpreter": { # This is an attempt to explicitly (re-)select the batch axis. It is probably unecessary...
    "class": "reinterpret_data",
    "set_axes": {"B": 1, "T": 0},
    "enforce_time_major": True,
    "from": ["teacher_stack"],
    "trainable": False,
    "register_as_extern_data": "teacher_hypotheses_stack"
}

[...]

# Repeating of the student's encoder ouput + configuration of its decoder

"student_encoder": {"class": "copy", "from": ["student_lstm6_fw", "student_lstm6_bw"]},  # dim: EncValueTotalDim
"student_encoder_repeater": {"class": "eval", "from": ["student_encoder"], "eval": repeat},
"student_encoder_stack": {  # This is an attempt to explicitly (re-)select the batch axis. It is probably unecessary...
    "class": "reinterpret_data",
    "set_axes": {"B": 1, "T": 0},
    "enforce_time_major": True,
    "from": ["student_encoder_repeater"]
},

"student_enc_ctx": {"class": "linear", "activation": None, "with_bias": True, "from": ["student_encoder_stack"], "n_out": EncKeyTotalDim},  # preprocessed_attended in Blocks
"student_inv_fertility": {"class": "linear", "activation": "sigmoid", "with_bias": False, "from": ["student_encoder_stack"], "n_out": AttNumHeads},
"student_enc_value": {"class": "split_dims", "axis": "F", "dims": (AttNumHeads, EncValuePerHeadDim), "from": ["student_encoder_stack"]},  # (B, enc-T, H, D'/H)

"model1_output": {"class": "rec", "from": [], 'cheating': config.bool("cheating", False), "unit": {
    'output': {'class': 'choice', 'target': student_target, 'beam_size': beam_size, 'cheating': config.bool("cheating", False), 'from': ["model1_output_prob"], "initial_output": 0},
    "end": {"class": "compare", "from": ["output"], "value": 0},
    'model1_target_embed': {'class': 'linear', 'activation': None, "with_bias": False, 'from': ['output'], "n_out": target_embed_size, "initial_output": 0},  # feedback_input
    "model1_weight_feedback": {"class": "linear", "activation": None, "with_bias": False, "from": ["prev:model1_accum_att_weights"], "n_out": EncKeyTotalDim, "dropout": 0.3},
    "model1_s_transformed": {"class": "linear", "activation": None, "with_bias": False, "from": ["model1_s"], "n_out": EncKeyTotalDim, "dropout": 0.3},
    "model1_energy_in": {"class": "combine", "kind": "add", "from": ["base:student_enc_ctx", "model1_weight_feedback", "model1_s_transformed"], "n_out": EncKeyTotalDim},
    "model1_energy_tanh": {"class": "activation", "activation": "tanh", "from": ["model1_energy_in"]},
    "model1_energy": {"class": "linear", "activation": None, "with_bias": False, "from": ["model1_energy_tanh"], "n_out": AttNumHeads},  # (B, enc-T, H)
    "model1_att_weights": {"class": "softmax_over_spatial", "from": ["model1_energy"]},  # (B, enc-T, H)
    "model1_accum_att_weights": {"class": "eval", "from": ["prev:model1_accum_att_weights", "model1_att_weights", "base:student_inv_fertility"],
                                 "eval": "source(0) + source(1) * source(2) * 0.5", "out_type": {"dim": AttNumHeads, "shape": (None, AttNumHeads)}},
    "model1_att0": {"class": "generic_attention", "weights": "model1_att_weights", "base": "base:student_enc_value"},  # (B, H, V)
    "model1_att": {"class": "merge_dims", "axes": "except_batch", "from": ["model1_att0"]},  # (B, H*V)
    "model1_s": {"class": "rnn_cell", "unit": "LSTMBlock", "from": ["prev:model1_target_embed", "prev:model1_att"], "n_out": 1000, "dropout": 0.3},  # transform
    "model1_readout_in": {"class": "linear", "from": ["model1_s", "prev:model1_target_embed", "model1_att"], "activation": None, "n_out": 1000, "dropout": 0.3},  # merge + post_merge bias
    "model1_readout": {"class": "reduce_out", "mode": "max", "num_pieces": 2, "from": ["model1_readout_in"]},
    "model1_output_prob": {
        "class": "softmax", "from": ["model1_readout"], "dropout": 0.3,
        "target": student_target,
        "loss": "ce", "loss_opts": {"label_smoothing": 0.1}
    }
}, "target": student_target},

[...]

运行 此配置将向控制台打印以下错误消息:

[...]

Create Adam optimizer.
Initialize optimizer (default) with slots ['m', 'v'].
These additional variable were created by the optimizer: [<tf.Variable 'optimize/beta1_power:0' shape=() dtype=float32_ref>, <tf.Variable 'optimize/beta2_power:0' shape=() dtype=float32_ref>].
[teacher][in][[6656 6657 6658...]...][17 23]
[teacher][out][[6656 6656 6656...]...][17 69]
TensorFlow exception: assertion failed: [x.shape[0] != y.shape[0]] [69 17] [23]
     [[node objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/Assert/Assert (defined at home/philipp/Documents/bachelor-thesis/returnn/returnn-venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]

[...]

Execute again to debug the op inputs...
FetchHelper(0): <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/Shape_1_1:0' shape=(1,) dtype=int32> = shape (1,), dtype int32, min/max 23/23, ([23])
FetchHelper(0): <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/Assert/Assert/data_0_1:0' shape=() dtype=string> = bytes(b'x.shape[0] != y.shape[0]')
FetchHelper(0): <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/Shape_2:0' shape=(2,) dtype=int32> = shape (2,), dtype int32, min/max 17/69, ([69 17])
FetchHelper(0): <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/All_1:0' shape=() dtype=bool> = bool_(False)
[teacher][in][[6656 6657 6658...]...][17 23]
[teacher][out][[6656 6656 6656...]...][17 69]
Op inputs:
  <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/All:0' shape=() dtype=bool>: bool_(False)
  <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/assert_equal_1/Assert/Assert/data_0:0' shape=() dtype=string>: bytes(b'x.shape[0] != y.shape[0]')
  <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/Shape:0' shape=(2,) dtype=int32>: shape (2,), dtype int32, min/max 17/69, ([69 17])
  <tf.Tensor 'objective/loss/error/sparse_labels/check_dim_equal/Shape_1:0' shape=(1,) dtype=int32>: shape (1,), dtype int32, min/max 23/23, ([23])
Step meta information:
{'seq_idx': [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22],
 'seq_tag': ['seq-0','seq-1','seq-2','seq-3','seq-4','seq-5','seq-6','seq-7','seq-8','seq-9','seq-10','seq-11','seq-12','seq-13','seq-14','seq-15','seq-16','seq-17','seq-18','seq-19','seq-20','seq-21','seq-22']}
Feed dict:
  <tf.Tensor 'extern_data/placeholders/data/data:0' shape=(?, ?, 80) dtype=float32>: shape (23, 42, 80), dtype float32, min/max -0.5/0.4, mean/stddev -0.050000004/0.28722814, Data(name='data', shape=(None, 80), batch_shape_meta=[B,T|'time:var:extern_data:data',F|80])
  <tf.Tensor 'extern_data/placeholders/data/data_dim0_size:0' shape=(?,) dtype=int32>: shape (23,), dtype int32, min/max 42/42, ([42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42])
  <tf.Tensor 'extern_data/placeholders/source_text/source_text:0' shape=(?, ?, 512) dtype=float32>: shape (23, 13, 512), dtype float32, min/max -0.5/0.4, mean/stddev -0.050011758/0.28722063, Data(name='source_text', shape=(None, 512), available_for_inference=False, batch_shape_meta=[B,T|'time:var:extern_data:source_text',F|512])
  <tf.Tensor 'extern_data/placeholders/source_text/source_text_dim0_size:0' shape=(?,) dtype=int32>: shape (23,), dtype int32, min/max 13/13, ([13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13])
  <tf.Tensor 'extern_data/placeholders/target_text/target_text:0' shape=(?, ?) dtype=int32>: shape (23, 17), dtype int32, min/max 6656/6694, Data(name='target_text', shape=(None,), dtype='int32', sparse=True, dim=35209, available_for_inference=False, batch_shape_meta=[B,T|'time:var:extern_data:target_text'])
  <tf.Tensor 'extern_data/placeholders/target_text/target_text_dim0_size:0' shape=(?,) dtype=int32>: shape (23,), dtype int32, min/max 17/17, ([17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17])
  <tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>: bool(True)
EXCEPTION

[...]
File "home/philipp/Documents/bachelor-thesis/returnn/repository/TFUtil.py", line 4374, in sparse_labels_with_seq_lens
    x = check_dim_equal(x, 0, seq_lens, 0)
[...]

因此,网络构建无误,但在第一个训练步骤中,由于断言错误而崩溃。在我看来,RETURNN 或 TensorFlow 以某种方式根据其原始值验证批处理长度。但是我不知道在哪里,为什么,所以我不知道该怎么办。

我做错了什么?我的想法甚至可以通过 RETURNN 实现吗?

编辑(2020 年 6 月 10 日): 澄清一下:我的最终目标是让老师为每个输入序列生成一个 top-k 假设列表,然后使用这些假设训练学生。因此,对于学生的每个输入序列,都有 k solutions/target 个序列。 为了训练学生,它必须预测每个假设的概率,然后计算交叉熵损失以确定更新梯度。但是,如果每个输入序列有 k 个目标序列,则学生必须解码编码器状态 k 次,每次针对不同的目标序列。 这就是为什么我想重复编码器状态 k 次,使学生解码器的数据并行,然后使用 RETURNN 的默认交叉熵损失实现:

input-seq-1 --- teacher-hyp-1-1; 
input-seq-1 --- teacher-hyp-1-2; 
...; 
input-seq-1 --- teacher-hyp-1-k; 
input-seq-2 --- teacher-hyp-2-1; 
... 

有没有更合适的方法来实现我的目标?

编辑(2020 年 6 月 12 日 #1): 是的,我知道老师的 DecisionLayer 已经 select 是最好的假设这样,我只重复那个最佳假设 k 次。我这样做是为了实现最终目标的中间步骤。后来想办法从老师的ChoiceLayer上取top-k list,但是感觉这又是另外一个工地
但是 Albert,你说 RETURNN 会以某种方式自动扩展批量维度上的数据?我怎么能想象到呢?

编辑(2020 年 6 月 12 日 #2): 好的,现在我 select 老师选择的前 k 个(这次 k=4)假设列表层(或输出层):

"teacher_hypotheses": {
    "class": "copy", "from": ["extra.search:teacherMT_output"],
    "register_as_extern_data": "teacher_hypotheses_stack"
}

但是使用这个数据作为学生的训练目标会导致错误:

TensorFlow exception: assertion failed: [shape[0]:] [92] [!=] [dim:] [23]
     [[node studentMT_output/rec/subnet_base/check_seq_len_batch_size/check_input_dim/assert_equal_1/Assert/Assert (defined at home/philipp/Documents/bachelor-thesis/returnn/returnn-venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]

我假设,由于学生的目标数据(假设列表)的批次轴长度比学生的输入 data/encoder 状态长 k=4 倍数据。 学生编码器状态数据不需要在这里 extended/repeated 来匹配目标数据吗?

编辑(2020 年 6 月 12 日 #3):我认为最初的问题已解决。整体问题继续在这里

它不仅验证批长度。它会折叠批次和时间(它使用了 flatten_with_seq_len_mask,请参阅 Loss.init 和该函数的代码),然后计算该扁平化张量的损失。因此 seq 长度也需要匹配。这可能是个问题,但我不确定。由于您对 rec 层本身也有相同的目标,因此它在训练中应该具有相同的 seq 长度。

您可以通过仔细检查 debug_print_layer_output_template 的输出来调试它,即检查 Data (batch-shape-meta) 输出,如果轴都像您期望的那样正确. (debug_print_layer_output_template 可以而且应该始终启用。它不会使速度变慢。) 你也可以暂时启用debug_print_layer_output_shape,它会真正打印出所有张量的形状。这样你就可以验证它的样子了。

你对 ReinterpretDataLayer 的用法看起来很不对。你永远不应该用整数显式地设置坐标轴(比如 "set_axes": {"B": 1, "T": 0})。你为什么要这样做?这可能就是最后搞砸的原因。

您的 repeat 函数不是很通用。您也在那里使用硬编码轴整数。你永远不应该那样做。相反,你会这样写:

input_data = source(0, as_data=True)
input = input_data.placeholder
...
output = tf.repeat(input, repeats=3, axis=input_data.batch_dim_axis)

我的理解是否正确,这就是你想要做的?在批处理轴上重复? 在这种情况下,您还需要调整该层输出的序列长度信息。您不能简单地在 EvalLayer 中按原样使用该函数。您还需要将 out_type 定义为正确 returns 正确 Data 模板的函数。例如。像这样:

def repeat_out(out):
   out = out.copy()
   out.size_placeholder[0] = tf.repeat(out.size_placeholder[0], axis=0, repeats=3)
   return out

...
"student_encoder_repeater": {
    "class": "eval", "from": ["student_encoder"], "eval": repeat,
    "out_type": lambda sources, **kwargs: repeat_out(sources[0].output)
}

现在你有额外的问题,每次你调用这个 repeat_out,你会得到另一个序列长度信息。 RETURNN 将无法判断这些 seq 长度是否全部相同或不同(在编译时)。这将导致错误或奇怪的效果。要解决这个问题,您应该重复使用相同的序列长度。例如。像这样:

"teacher_stack_": {
    "class": "eval", "from": "teacher_decision", "eval": repeat
},
"teacher_stack": {
    "class": "reinterpret_data", "from": "teacher_stack_", "size_base": "student_encoder_repeater"
}

顺便说一句,你为什么要重复这个过程?这背后的想法是什么?你把学生和老师都重复了 3 遍?所以只要将你的学习率提高 3 倍就可以达到同样的效果吗?

编辑: 好像是为了匹配top-k榜单才这么做的。在那种情况下,这一切都是错误的,因为 RETURNN 应该已经自动进行了这样的重复。您不应该手动执行此操作。

编辑:要了解重复(以及一般的波束搜索解析)是如何工作的,首先你应该查看日志输出(你必须有debug_print_layer_output_template 已启用,但无论如何你应该始终拥有它)。您将看到每一层的输出,尤其是它的 Data 输出对象。这对于检查形状是否符合您的预期已经很有用了(检查日志中的 batch_shape_meta)。然而,这只是编译时的静态形状,所以 batch-dim 只是那里的一个标记。您还将看到搜索光束信息。这将跟踪批次是否来自某个集束搜索(基本上是任何 ChoiceLayer),并且具有集束和集束大小。现在,在代码中,检查 SearchChoices.translate_to_common_search_beam 及其用法。当您按照代码进行操作时,您将看到 SelectSearchSourcesLayer,实际上您的案例将以 output.copy_extend_with_beam(search_choices.get_beam_info()).

结尾

编辑:重复,这是自动完成的。您不需要手动调用copy_extend_with_beam

如果你希望从老师那里得到 top-k 列表,你也可能做错了,因为我看到你使用 "teacher_decision" 作为输入。我猜这是来自 DecisionLayer?在那种情况下,它已经只从 top-k 光束中取了最好的。

编辑:现在我明白你忽略了这一点,而是想只取第一个最好的,然后也重复这个。我建议不要那样做,因为你让它变得不必要的复杂,而且你在与 RETURNN 作斗争,它知道 batch-dim 应该是什么并且会感到困惑。 (你可以按照我上面写的让它工作,但实际上,这只是不必要的复杂。)

顺便说一句,将 EvalLayer 设置为 "trainable": False 没有意义。那没有效果。反正eval层是没有参数的。