Huggingface Electra - Load model trained with google implementation error: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

Huggingface Electra - Load model trained with google implementation error: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

我使用 google implementation code 从零开始训练了一个 electra 模型。

python run_pretraining.py --data-dir gc://bucket-electra/dataset/ --model-name greek_electra --hparams hparams.json

使用这个 json 超参数:

{
"embedding_size": 768,
"max_seq_length": 512,
"train_batch_size": 128,
"vocab_size": 100000,
"model_size": "base",
"num_train_steps": 1500000
}

训练完模型后,我使用 transformers 库中的 convert_electra_original_tf_checkpoint_to_pytorch.py 脚本来转换检查点。

python convert_electra_original_tf_checkpoint_to_pytorch.py --tf_checkpoint_path output/models/transformer/greek_electra --config_file resources/hparams.json --pytorch_dump_path output/models/transformer/discriminator  --discriminator_or_generator "discriminator"

现在我正在尝试加载模型:

from transformers import ElectraForPreTraining

model = ElectraForPreTraining.from_pretrained('discriminator')

但我收到以下错误:

Traceback (most recent call last):
  File "~/.local/lib/python3.9/site-packages/transformers/configuration_utils.py", line 427, in get_config_dict
    config_dict = cls._dict_from_json_file(resolved_config_file)
  File "~/.local/lib/python3.9/site-packages/transformers/configuration_utils.py", line 510, in _dict_from_json_file
    text = reader.read()
  File "/usr/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

知道是什么原因造成的以及如何解决吗?

看来@npit是对的。 convert_electra_original_tf_checkpoint_to_pytorch.py 的输出不包含我提供的配置 (hparams.json),因此我创建了一个 ElectraConfig 对象——具有相同的参数——并将其提供给 from_pretrained 函数.那解决了问题。