无论字符串大小如何,BERT 输出的形状都可以固定吗?

Can BERT output be fixed in shape, irrespective of string size?

我对使用 huggingface BERT 模型以及如何使它们以固定形状产生预测感到困惑,无论输入大小(即输入字符串长度)如何。

我尝试使用参数 padding=True, truncation=True, max_length = 15 调用分词器,但 inputs = ["a", "a"*20, "a"*100, "abcede"*20000] 的预测输出维度不固定。我在这里错过了什么?

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

inputs = ["a", "a"*20, "a"*100, "abcede"*20000]
for input in inputs:
  inputs = tokenizer(input, padding=True, truncation=True, max_length = 15, return_tensors="pt")
  outputs = model(**inputs)
  print(outputs.last_hidden_state.shape, input, len(input))

输出:

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
torch.Size([1, 3, 768]) a 1
torch.Size([1, 12, 768]) aaaaaaaaaaaaaaaaaaaa 20
torch.Size([1, 15, 768]) aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 100
torch.Size([1, 3, 768]) abcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcededeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeab....deabbcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcede 120000

BERT 正在为每个标记生成一个嵌入,并且您的输入字符串正在生成不同数量的标记。 (我不知道为什么最后一个字符串产生的这么少,有点奇怪。)

如果你想从这个模型中获得整个输入文本的单个嵌入,有两种方法取决于它是如何训练的:

  1. 如果训练模型的任务之一是例如下一句预测,你应该将输入的任何标记嵌入到该任务中。这通常是第一个或最后一个。所以 outputs.last_hidden_state[:, 0. :]outputs.last_hidden_state[:, -1, :].
  2. 如果那不是真的,您可能应该只取所有标记嵌入的平均值。类似于 np.mean(outputs.last_hidden_state, axis=1).

我实际上不知道你使用的模型是如何训练的,所以我不能说哪个是最好的。

我建议只使用旨在一次嵌入整个句子的模型,例如 https://www.sbert.net/docs/pretrained_models.html 中的模型。

当您仅使用一个句子和 padding=Truetruncation=Truemax_length = 15 调用分词器时,它会将输出序列填充到最长的输入序列,并在需要时截断。由于您只提供了一个句子,因此分词器无法填充任何内容,因为它已经是批次中最长的序列。这意味着您可以通过两种方式实现您想要的:

  1. 提供批次:
from transformers import AutoTokenizer, AutoModel
 
  tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
  model = AutoModel.from_pretrained("bert-base-uncased")
   
  inputs = ["a", "a"*20, "a"*100, "abcede"*200]
  inputs = tokenizer(inputs, padding=True, truncation=True, max_length = 15, return_tensors="pt")
  print(inputs["input_ids"])
  outputs = model(**inputs)
  print(outputs.last_hidden_state.shape)

输出:

tensor([[  101,  1037,   102,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0],
        [  101, 13360, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057,
          2050,   102,     0,     0,     0],
        [  101, 13360, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057,
         11057, 11057, 11057, 11057,   102],
        [  101,   100,   102,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0]])
torch.Size([4, 15, 768])
  1. 设置padding="max_length":
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

inputs = ["a", "a"*20, "a"*100, "abcede"*200]
for i in inputs:
  inputs = tokenizer(i, padding='max_length', truncation=True, max_length = 15, return_tensors="pt")
  print(inputs["input_ids"])
  outputs = model(**inputs)
  print(outputs.last_hidden_state.shape, i, len(i))

输出:

tensor([[ 101, 1037,  102,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0]])
torch.Size([1, 15, 768]) a 1
tensor([[  101, 13360, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057,
          2050,   102,     0,     0,     0]])
torch.Size([1, 15, 768]) aaaaaaaaaaaaaaaaaaaa 20
tensor([[  101, 13360, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057,
         11057, 11057, 11057, 11057,   102]])
torch.Size([1, 15, 768]) aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 100
tensor([[101, 100, 102,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0]])
torch.Size([1, 15, 768]) abcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcede 1200