ValueError: The state dictionary of the model you are training to load is corrupted. Are you sure it was properly saved?

ValueError: The state dictionary of the model you are training to load is corrupted. Are you sure it was properly saved?

目标:修改此 Notebook 以使用 albert-base-v2 模型

内核:conda_pytorch_p36.

第 1.2 节./MRPC/ 目录中的文件实例化模型。

但是,我认为它适用于 BERT 模型,而不是 Albert。所以,我从 here 下载了一个 Albert config.json 文件。正是这个 chnage 导致了错误。

为了实例化 Albert 模型,我还需要做什么?


./MRPC/ 目录:

!curl https://download.pytorch.org/tutorial/MRPC.zip --output MPRC.zip
!unzip -n MPRC.zip
from os import listdir
from os.path import isfile, join
​
mypath = './MRPC/'
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
onlyfiles
---

['tokenizer_config.json',
 'special_tokens_map.json',
 'pytorch_model.bin',
 'config.json',
 'training_args.bin',
 'added_tokens.json',
 'vocab.txt']

配置:

# The output directory for the fine-tuned model, $OUT_DIR.
configs.output_dir = "./MRPC/"

# The data directory for the MRPC task in the GLUE benchmark, $GLUE_DIR/$TASK_NAME.
configs.data_dir = "./glue_data/MRPC"

# The model name or path for the pre-trained model.
configs.model_name_or_path = "albert-base-v2"
# The maximum length of an input sequence
configs.max_seq_length = 128

# Prepare GLUE task.
configs.task_name = "MRPC".lower()
configs.processor = processors[configs.task_name]()
configs.output_mode = output_modes[configs.task_name]
configs.label_list = configs.processor.get_labels()
configs.model_type = "albert".lower()
configs.do_lower_case = True

# Set the device, batch size, topology, and caching flags.
configs.device = "cpu"
configs.eval_batch_size = 1
configs.n_gpu = 0
configs.local_rank = -1
configs.overwrite_cache = False

型号:

model = AlbertForSequenceClassification.from_pretrained(configs.output_dir)  # !
model.to(configs.device)

回溯:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-36-0936fd8cbb17> in <module>
      1 # load model
----> 2 model = AlbertForSequenceClassification.from_pretrained(configs.output_dir)
      3 model.to(configs.device)
      4 
      5 # quantize model

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/transformers/modeling_utils.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
   1460                     pretrained_model_name_or_path,
   1461                     ignore_mismatched_sizes=ignore_mismatched_sizes,
-> 1462                     _fast_init=_fast_init,
   1463                 )
   1464 

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/transformers/modeling_utils.py in _load_state_dict_into_model(cls, model, state_dict, pretrained_model_name_or_path, ignore_mismatched_sizes, _fast_init)
   1601             if any(key in expected_keys_not_prefixed for key in loaded_keys):
   1602                 raise ValueError(
-> 1603                     "The state dictionary of the model you are training to load is corrupted. Are you sure it was "
   1604                     "properly saved?"
   1605                 )

ValueError: The state dictionary of the model you are training to load is corrupted. Are you sure it was properly saved?

正是我要找的,textattack/albert-base-v2-MRPC

如何使用 /transformers 库

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("textattack/albert-base-v2-MRPC")

model = AutoModelForSequenceClassification.from_pretrained("textattack/albert-base-v2-MRPC")

或者只克隆模型存储库

git lfs install
git clone https://huggingface.co/textattack/albert-base-v2-MRPC
# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1