无论字符串大小如何,BERT 输出的形状都可以固定吗?
Can BERT output be fixed in shape, irrespective of string size?
我对使用 huggingface BERT 模型以及如何使它们以固定形状产生预测感到困惑,无论输入大小(即输入字符串长度)如何。
我尝试使用参数 padding=True, truncation=True, max_length = 15
调用分词器,但 inputs = ["a", "a"*20, "a"*100, "abcede"*20000]
的预测输出维度不固定。我在这里错过了什么?
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = ["a", "a"*20, "a"*100, "abcede"*20000]
for input in inputs:
inputs = tokenizer(input, padding=True, truncation=True, max_length = 15, return_tensors="pt")
outputs = model(**inputs)
print(outputs.last_hidden_state.shape, input, len(input))
输出:
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
torch.Size([1, 3, 768]) a 1
torch.Size([1, 12, 768]) aaaaaaaaaaaaaaaaaaaa 20
torch.Size([1, 15, 768]) aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 100
torch.Size([1, 3, 768]) abcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcededeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeab....deabbcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcede 120000
BERT 正在为每个标记生成一个嵌入,并且您的输入字符串正在生成不同数量的标记。 (我不知道为什么最后一个字符串产生的这么少,有点奇怪。)
如果你想从这个模型中获得整个输入文本的单个嵌入,有两种方法取决于它是如何训练的:
- 如果训练模型的任务之一是例如下一句预测,你应该将输入的任何标记嵌入到该任务中。这通常是第一个或最后一个。所以
outputs.last_hidden_state[:, 0. :]
或 outputs.last_hidden_state[:, -1, :]
.
- 如果那不是真的,您可能应该只取所有标记嵌入的平均值。类似于
np.mean(outputs.last_hidden_state, axis=1)
.
我实际上不知道你使用的模型是如何训练的,所以我不能说哪个是最好的。
我建议只使用旨在一次嵌入整个句子的模型,例如 https://www.sbert.net/docs/pretrained_models.html 中的模型。
当您仅使用一个句子和 padding=True
、truncation=True
、max_length = 15
调用分词器时,它会将输出序列填充到最长的输入序列,并在需要时截断。由于您只提供了一个句子,因此分词器无法填充任何内容,因为它已经是批次中最长的序列。这意味着您可以通过两种方式实现您想要的:
- 提供批次:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = ["a", "a"*20, "a"*100, "abcede"*200]
inputs = tokenizer(inputs, padding=True, truncation=True, max_length = 15, return_tensors="pt")
print(inputs["input_ids"])
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)
输出:
tensor([[ 101, 1037, 102, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0],
[ 101, 13360, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057,
2050, 102, 0, 0, 0],
[ 101, 13360, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057,
11057, 11057, 11057, 11057, 102],
[ 101, 100, 102, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0]])
torch.Size([4, 15, 768])
- 设置
padding="max_length"
:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = ["a", "a"*20, "a"*100, "abcede"*200]
for i in inputs:
inputs = tokenizer(i, padding='max_length', truncation=True, max_length = 15, return_tensors="pt")
print(inputs["input_ids"])
outputs = model(**inputs)
print(outputs.last_hidden_state.shape, i, len(i))
输出:
tensor([[ 101, 1037, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0]])
torch.Size([1, 15, 768]) a 1
tensor([[ 101, 13360, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057,
2050, 102, 0, 0, 0]])
torch.Size([1, 15, 768]) aaaaaaaaaaaaaaaaaaaa 20
tensor([[ 101, 13360, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057,
11057, 11057, 11057, 11057, 102]])
torch.Size([1, 15, 768]) aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 100
tensor([[101, 100, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0]])
torch.Size([1, 15, 768]) abcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcede 1200
我对使用 huggingface BERT 模型以及如何使它们以固定形状产生预测感到困惑,无论输入大小(即输入字符串长度)如何。
我尝试使用参数 padding=True, truncation=True, max_length = 15
调用分词器,但 inputs = ["a", "a"*20, "a"*100, "abcede"*20000]
的预测输出维度不固定。我在这里错过了什么?
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = ["a", "a"*20, "a"*100, "abcede"*20000]
for input in inputs:
inputs = tokenizer(input, padding=True, truncation=True, max_length = 15, return_tensors="pt")
outputs = model(**inputs)
print(outputs.last_hidden_state.shape, input, len(input))
输出:
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
torch.Size([1, 3, 768]) a 1
torch.Size([1, 12, 768]) aaaaaaaaaaaaaaaaaaaa 20
torch.Size([1, 15, 768]) aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 100
torch.Size([1, 3, 768]) abcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcededeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeab....deabbcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcede 120000
BERT 正在为每个标记生成一个嵌入,并且您的输入字符串正在生成不同数量的标记。 (我不知道为什么最后一个字符串产生的这么少,有点奇怪。)
如果你想从这个模型中获得整个输入文本的单个嵌入,有两种方法取决于它是如何训练的:
- 如果训练模型的任务之一是例如下一句预测,你应该将输入的任何标记嵌入到该任务中。这通常是第一个或最后一个。所以
outputs.last_hidden_state[:, 0. :]
或outputs.last_hidden_state[:, -1, :]
. - 如果那不是真的,您可能应该只取所有标记嵌入的平均值。类似于
np.mean(outputs.last_hidden_state, axis=1)
.
我实际上不知道你使用的模型是如何训练的,所以我不能说哪个是最好的。
我建议只使用旨在一次嵌入整个句子的模型,例如 https://www.sbert.net/docs/pretrained_models.html 中的模型。
当您仅使用一个句子和 padding=True
、truncation=True
、max_length = 15
调用分词器时,它会将输出序列填充到最长的输入序列,并在需要时截断。由于您只提供了一个句子,因此分词器无法填充任何内容,因为它已经是批次中最长的序列。这意味着您可以通过两种方式实现您想要的:
- 提供批次:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = ["a", "a"*20, "a"*100, "abcede"*200]
inputs = tokenizer(inputs, padding=True, truncation=True, max_length = 15, return_tensors="pt")
print(inputs["input_ids"])
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)
输出:
tensor([[ 101, 1037, 102, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0],
[ 101, 13360, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057,
2050, 102, 0, 0, 0],
[ 101, 13360, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057,
11057, 11057, 11057, 11057, 102],
[ 101, 100, 102, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0]])
torch.Size([4, 15, 768])
- 设置
padding="max_length"
:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = ["a", "a"*20, "a"*100, "abcede"*200]
for i in inputs:
inputs = tokenizer(i, padding='max_length', truncation=True, max_length = 15, return_tensors="pt")
print(inputs["input_ids"])
outputs = model(**inputs)
print(outputs.last_hidden_state.shape, i, len(i))
输出:
tensor([[ 101, 1037, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0]])
torch.Size([1, 15, 768]) a 1
tensor([[ 101, 13360, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057,
2050, 102, 0, 0, 0]])
torch.Size([1, 15, 768]) aaaaaaaaaaaaaaaaaaaa 20
tensor([[ 101, 13360, 11057, 11057, 11057, 11057, 11057, 11057, 11057, 11057,
11057, 11057, 11057, 11057, 102]])
torch.Size([1, 15, 768]) aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 100
tensor([[101, 100, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0]])
torch.Size([1, 15, 768]) abcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcedeabcede 1200