无法使用变压器包加载 SpanBert 模型
Unable to load SpanBert model with transformers package
我对使用 transformers 包加载 SpanBert 有一些疑问。
我从 SpanBert GitHub Repo 下载了预训练文件,从 Bert vocab.txt
下载了预训练文件。这是我用于加载的代码:
model = BertModel.from_pretrained(config_file=config_file,
pretrained_model_name_or_path=model_file,
vocab_file=vocab_file)
model.to("cuda")
其中
config_file
-> config.json
model_file
-> pytorch_model.bin
vocab_file
-> vocab.txt
但是我得到了 UnicodeDecoderError
上面的代码说 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
我也试过用提到的方法加载 SpanBert here。但它返回 OSError: file SpanBERT/spanbert-base-cased not found
.
您对正确加载预训练模型有什么建议吗?非常感谢任何建议。谢谢!
- 从Github页面下载预训练的权重。
https://github.com/facebookresearch/SpanBERT
SpanBERT (base & cased): 12-layer, 768-hidden, 12-heads , 110M parameters
SpanBERT (large & cased): 24-layer, 1024-hidden, 16-heads, 340M parameters
将它们解压到一个文件夹,例如我解压到spanbert_hf_base文件夹,其中包含一个.bin
文件和一个config.json
文件。
您可以使用 AutoModel 加载模型和简单的 bert tokenizer。来自他们的回购:
These models have the same format as the HuggingFace BERT models, so you can easily replace them with our SpanBET models.
import torch
from transformers import AutoModel
model = AutoModel.from_pretrained('spanbert_hf_base/') # the path to .bin and config.json
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
b = torch.tensor(tokenizer.encode('hi this is me, mr. meeseeks', add_special_tokens=True, max_length = 512)).unsqueeze(0)
out = model(b)
输出:
(tensor([[[-0.1204, -0.0806, -0.0168, ..., -0.0599, -0.1932, -0.0967],
[-0.0851, -0.0980, 0.0039, ..., -0.0563, -0.1655, -0.0156],
[-0.1111, -0.0318, 0.0141, ..., -0.0518, -0.1068, -0.1271],
[-0.0317, -0.0441, -0.0306, ..., -0.1049, -0.1940, -0.1919],
[-0.1200, 0.0277, -0.0372, ..., -0.0930, -0.0627, 0.0143],
[-0.1204, -0.0806, -0.0168, ..., -0.0599, -0.1932, -0.0967]]],
grad_fn=<NativeLayerNormBackward>),
tensor([[-9.7530e-02, 1.6328e-01, 9.3202e-03, 1.1010e-01, 7.3047e-02,
-1.7635e-01, 1.0046e-01, -1.4826e-02, 9.2583e-
............
我对使用 transformers 包加载 SpanBert 有一些疑问。
我从 SpanBert GitHub Repo 下载了预训练文件,从 Bert vocab.txt
下载了预训练文件。这是我用于加载的代码:
model = BertModel.from_pretrained(config_file=config_file,
pretrained_model_name_or_path=model_file,
vocab_file=vocab_file)
model.to("cuda")
其中
config_file
->config.json
model_file
->pytorch_model.bin
vocab_file
->vocab.txt
但是我得到了 UnicodeDecoderError
上面的代码说 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
我也试过用提到的方法加载 SpanBert here。但它返回 OSError: file SpanBERT/spanbert-base-cased not found
.
您对正确加载预训练模型有什么建议吗?非常感谢任何建议。谢谢!
- 从Github页面下载预训练的权重。
https://github.com/facebookresearch/SpanBERT
SpanBERT (base & cased): 12-layer, 768-hidden, 12-heads , 110M parameters
SpanBERT (large & cased): 24-layer, 1024-hidden, 16-heads, 340M parameters
将它们解压到一个文件夹,例如我解压到spanbert_hf_base文件夹,其中包含一个
.bin
文件和一个config.json
文件。您可以使用 AutoModel 加载模型和简单的 bert tokenizer。来自他们的回购:
These models have the same format as the HuggingFace BERT models, so you can easily replace them with our SpanBET models.
import torch
from transformers import AutoModel
model = AutoModel.from_pretrained('spanbert_hf_base/') # the path to .bin and config.json
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
b = torch.tensor(tokenizer.encode('hi this is me, mr. meeseeks', add_special_tokens=True, max_length = 512)).unsqueeze(0)
out = model(b)
输出:
(tensor([[[-0.1204, -0.0806, -0.0168, ..., -0.0599, -0.1932, -0.0967],
[-0.0851, -0.0980, 0.0039, ..., -0.0563, -0.1655, -0.0156],
[-0.1111, -0.0318, 0.0141, ..., -0.0518, -0.1068, -0.1271],
[-0.0317, -0.0441, -0.0306, ..., -0.1049, -0.1940, -0.1919],
[-0.1200, 0.0277, -0.0372, ..., -0.0930, -0.0627, 0.0143],
[-0.1204, -0.0806, -0.0168, ..., -0.0599, -0.1932, -0.0967]]],
grad_fn=<NativeLayerNormBackward>),
tensor([[-9.7530e-02, 1.6328e-01, 9.3202e-03, 1.1010e-01, 7.3047e-02,
-1.7635e-01, 1.0046e-01, -1.4826e-02, 9.2583e-
............