为什么我的 Transformer 实现输给了 BiLSTM?

Why is my Transformer implementation losing to a BiLSTM?

我正在处理一个序列标记问题,我正在使用单个 Transformer 编码器从序列的每个元素中获取 logits。在使用 Transformer 和 BiLSTM 进行实验后,在我的案例中 BiLSTM 看起来工作得更好,所以我想知道是否可能是因为我的 Transformer 实现有一些问题......下面是我实现的 Transformer 编码器和用于创建填充的相关函数掩码和位置嵌入:

def create_mask(src, lengths):
    """Create a mask hiding future tokens
    Parameters:
        src (tensor): the source tensor having shape [batch_size, number_of_steps, features_dimensions]
        length (list): a list of integers representing the length (i.e. number_of_steps) of each sample in the batch."""
    mask = []
    max_len = src.shape[1]
    for index, i in enumerate(src):
        # The mask consists in tensors having false at the step number that doesn't need to be hidden and true otherwise
        mask.append([False if (i+1)>lengths[index] else True for i in range(max_len)])
    return torch.tensor(mask)

class PositionalEncoding(nn.Module):

    def __init__(self, d_model, dropout=0.1, max_len=5000, device = 'cpu'):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)
        self.device = device
        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
        pe = torch.zeros(1, max_len, d_model)
        pe[0, :, 0::2] = torch.sin(position * div_term)
        pe[0, :, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe)
    
    def forward(self, x):
        x = x + self.pe[:, :x.size(1), :].to(self.device)
        return self.dropout(x)

class Transformer(nn.Module):
    """Class implementing transformer ecnoder, partially based on
    https://pytorch.org/tutorials/beginner/transformer_tutorial.html"""
    def __init__(self, in_dim, h_dim, n_heads, n_layers, dropout=0.2, drop_out = 0.0, batch_first = True, device = 'cpu', positional_encoding = True):
        super(Transformer, self).__init__()
        self.model_type = 'Transformer'
        self.pos_encoder = PositionalEncoding(in_dim, dropout, device = device)
        encoder_layers = nn.TransformerEncoderLayer(in_dim, n_heads, h_dim, dropout)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layers, n_layers, norm=nn.LayerNorm(in_dim))
        self.in_dim = in_dim
        self.drop_out = drop_out
        self.positional_encoding = positional_encoding
    
        
    def forward(self, src, mask = None, line_len=None):
        src = src * math.sqrt(self.in_dim)
        if self.positional_encoding:
            src = self.pos_encoder(src)
        if line_len is not None and mask is None:
            mask = create_mask(src, line_len)
        else:
            mask = None
        output = self.transformer_encoder(src, src_key_padding_mask = mask)
        if self.drop_out:
            output = F.dropout(output, p = self.drop_out)
        return src, output

可以看出,上面的网络输出隐藏状态,然后我将它们传递到一个额外的线性层,并在两个 类 和 Adam 优化器上使用 CrossEntropy 损失进行训练。我尝试了多种超参数组合,但 BiLSTM 的表现仍然更好。任何人都可以在我的变形金刚中发现任何问题或解释为什么我会遇到这种违反直觉的结果吗?

这可能令人惊讶,但变形金刚并不总能打败 LSTM。例如,Language Models with Transformers 表示:

Transformer architectures are suboptimal for language model itself. Neither self-attention nor the positional encoding in the Transformer is able to efficiently incorporate the word-level sequential context crucial to language modeling.

如果您 运行 Transformer 教程代码本身(您的代码基于该代码),您还会看到 LSTM 在那里做得更好。有关此主题的更多讨论,请参阅 this thread on stats.SE(免责声明:问题和答案都是我的)