语音转换器论文中的注意力惩罚是什么？（更新）

Question

github: https://github.com/sephiroce/tfsr/tree/exprimental

我正在尝试重现语音转换器论文 [1] 中描述的识别精度。注意力惩罚是一种我无法完全理解的技术。这是论文中对attention penalty的描述

“此外，我们通过添加鼓励模型关注更近的位置对更远的位置对的注意力权重的更大惩罚。”

我的理解是，除了解码器中的第一个多头注意力之外，这意味着在缩放注意力 logits（掩蔽之前）上添加更小的负值以远离对角线。

这是计算注意力权重的代码片段。

  # Q * trans(K): (..., seq_len_q, seq_len_k)
  matmul_qk = tf.matmul(query, key, transpose_b=True)

  # scaled matmul_qk: ( Q * trans(K) ) / sqrt(d_k)
  dimension_of_key = tf.cast(tf.shape(key)[-1], tf.float32)
  scaled_attention_logits = matmul_qk / tf.math.sqrt(dimension_of_key)

  # add the mask to the scaled tensor
  if mask is not None:
    scaled_attention_logits += (mask * -1e9)

  # softmax is normalized on the last axis (seq_len_k) so that the scores
  # add up to 1.
  attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)

  # Adding penalty to attention weights and linearly re-normalize it.
  if attention_penalty is not None and att_penalty_scale > 0:
    attention_weights += (attention_penalty * att_penalty_scale)
    attention_weights += tf.math.abs(tf.math.reduce_min(attention_weights))
    inv_sum = 1 / tf.math.reduce_sum(attention_weights, axis=-1)
    attention_weights = tf.einsum('ijlm,ijl->ijlm', attention_weights, inv_sum)

下面的源代码片段用于创建注意力惩罚矩阵。我找不到任何有效的方法来为解码器中的第二个多头注意力权重创建注意力惩罚矩阵，因为注意力图不是对角线的。因此，首先我试图将注意力惩罚应用于编码器。源代码为距离对角线更远的元素分配线性更大的惩罚。
有两个超参数，例如 attention_penalty_scale（这类似于 Jindřich 建议的 penalty_values）和对角线的宽度。
我也许可以添加一个选项，例如 stripe_step_size。目前 stripe_step_size 可以解释为 1.

def create_attention_penalty(inp_len, tar_len, num_heads, attention_penalty_width):
  max_inp_len = tf.cast(tf.math.reduce_max(inp_len), tf.int32)
  n_batch = tf.shape(inp_len)[0]

  enc_att_penalty = tf.ones([n_batch, num_heads, max_inp_len, max_inp_len])

  accum = tf.zeros(([n_batch, num_heads, max_inp_len, max_inp_len]))
  for i in range(attention_penalty_width - 1, max_inp_len - 1):
    accum += tf.linalg.band_part(enc_att_penalty, i, i, name=None) - 1

  enc_att_penalty = accum

  return enc_att_penalty, None

尽管我按照我的理解实施了，但我无法获得任何准确性提升。此实现还有另一个缺点。训练速度越来越慢了。

问）如何有效地将这种注意力惩罚方法应用于正方形和非正方形注意力权重？

参考
[1] Linhao Dong, Shuang Xu, Bo Xu, Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition, ICASSP 2018, https://ieeexplore.ieee.org/document/8462506

Answer 1

我觉得你理解的很好。他们可能在对角线周围做了一个条纹，比如：

attention_penalty = (1 - tf.linalg.band_part(scaled_attention_logits, stripe_size, stripe_size)) * penalty

但是，您可能需要对 strip_size 和 penalty_values 应该是什么进行更多实验，因为论文并没有说太多。或者你可以尝试写信给作者。

语音转换器论文中的注意力惩罚是什么？（更新）

What is attention penalty in speech transformer paper? (updated)

speech-recognition

transformer

deep-learning

tensorflow

tf.keras

语音转换器论文中的注意力惩罚是什么？ （更新）

What is attention penalty in speech transformer paper? (updated)

speech-recognition

transformer

deep-learning

tensorflow

tf.keras

语音转换器论文中的注意力惩罚是什么？（更新）