使用 huggingface/transformers (torch) 为 bert-base-uncased 输出注意力

Outputting attention for bert-base-uncased with huggingface/transformers (torch)

我正在关注 a paper 基于 BERT 的词法替换(特别是尝试实现等式 (2) - 如果有人已经实现了整篇论文,那也很棒)。因此,我想获得最后的隐藏层(我唯一不确定的是输出中层的顺序:最后一个还是第一个?)和基本 BERT 模型(bert-base-uncased)的注意力。

但是,我有点不确定 huggingface/transformers library 是否真的输出了 bert-base-uncased 的注意力(我使用的是 torch,但我愿意使用 TF 代替)?

what I had read,我应该得到一个元组 (logits, hidden_states, attentions),但是对于下面的示例(例如在 Google Colab 中运行),我取而代之的是长度 2。

我是不是误解了我得到的东西或以错误的方式处理这件事?我做了明显的测试并使用了 output_attention=False 而不是 output_attention=True (而 output_hidden_states=True 确实确实像预期的那样添加了隐藏状态)并且我得到的输出没有任何变化。这显然是我对图书馆的理解不好的迹象,或者表明存在问题。

import numpy as np
import torch
!pip install transformers

from transformers import (AutoModelWithLMHead, 
                          AutoTokenizer, 
                          BertConfig)

bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
config = BertConfig.from_pretrained('bert-base-uncased', output_hidden_states=True, output_attention=True) # Nothign changes, when I switch to output_attention=False
bert_model = AutoModelWithLMHead.from_config(config)

sequence = "We went to an ice cream cafe and had a chocolate ice cream."
bert_tokenized_sequence = bert_tokenizer.tokenize(sequence)

indexed_tokens = bert_tokenizer.encode(bert_tokenized_sequence, return_tensors='pt')

predictions = bert_model(indexed_tokens)

########## Now let's have a look at what the predictions look like #############
print(len(predictions)) # Length is 2, I expected 3: logits, hidden_layers, attention

print(predictions[0].shape) # torch.Size([1, 16, 30522]) - seems to be logits (shape is 1 x sequence length x vocabulary

print(len(predictions[1])) # Length is 13 - the hidden layers?! There are meant to be 12, right? Is one somehow the attention?

for k in range(len(predictions[1])):
  print(predictions[1][k].shape) # These all seem to be torch.Size([1, 16, 768]), so presumably the hidden layers?

受已接受答案的启发,解释最终有效的方法

import numpy as np
import torch
!pip install transformers

from transformers import BertModel, BertConfig, BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
config = BertConfig.from_pretrained('bert-base-uncased', output_hidden_states=True, output_attentions=True)
model = BertModel.from_pretrained('bert-base-uncased', config=config)
sequence = "We went to an ice cream cafe and had a chocolate ice cream."
tokenized_sequence = tokenizer.tokenize(sequence)
indexed_tokens = tokenizer.encode(tokenized_sequence, return_tensors='pt'
enter code here`outputs = model(indexed_tokens)
print( len(outputs) ) # 4 
print( outputs[0].shape ) #1, 16, 768 
print( outputs[1].shape ) # 1, 768
print( len(outputs[2]) ) # 13  = input embedding (index 0) + 12 hidden layers (indices 1 to 12)
print( outputs[2][0].shape ) # for each of these 13: 1,16,768 = input sequence, index of each input id in sequence, size of hidden layer
print( len(outputs[3]) ) # 12 (=attenion for each layer)
print( outputs[3][0].shape ) # 0 index = first layer, 1,12,16,16 = , layer, index of each input id in sequence, index of each input id in sequence

原因是您使用的是 AutoModelWithLMHead,它是实际模型的包装器。它调用 BERT 模型(即 BERTModel 的实例),然后使用嵌入矩阵作为单词预测的权重矩阵。在基础模型之间确实 returns 关注,但包装器不关心并且只 returns logits。

您可以通过调用 AutoModel 直接获取 BERT 模型。请注意,此模型不 return logits,而是隐藏状态。

bert_model = AutoModel.from_config(config)

或者您可以通过调用从 BertWithLMHead 对象中获取它:

wrapped_model = bert_model.base_model

我想在这里回答已经太晚了,但是随着 huggingface 变形金刚的更新,我想我们可以使用这个

config = BertConfig.from_pretrained('bert-base-uncased', 
output_hidden_states=True, output_attentions=True)  
bert_model = BertModel.from_pretrained('bert-base-uncased', 
config=config)

with torch.no_grad():
  out = bert_model(input_ids)
  last_hidden_states = out.last_hidden_state
  pooler_output = out.pooler_output
  hidden_states = out.hidden_states
  attentions = out.attentions