如何使用 BertPreTrainedModel.from_pretrained() 加载经过微调的 AllenNLP BERT-SRL 模型?

How do I load a fine-tuned AllenNLP BERT-SRL model using BertPreTrainedModel.from_pretrained()?

我使用 AllenNLP 微调了用于语义角色标记的 BERT 模型。这会生成一个包含以下内容的模型目录(序列化目录,如果我没记错的话?):

best.th
config.json
meta.json
metrics_epoch_0.json
metrics_epoch_10.json
metrics_epoch_11.json
metrics_epoch_12.json
metrics_epoch_13.json
metrics_epoch_14.json
metrics_epoch_1.json
metrics_epoch_2.json
metrics_epoch_3.json
metrics_epoch_4.json
metrics_epoch_5.json
metrics_epoch_6.json
metrics_epoch_7.json
metrics_epoch_8.json
metrics_epoch_9.json
metrics.json
model_state_e14_b0.th
model_state_e15_b0.th
model.tar.gz
out.log
training_state_e14_b0.th
training_state_e15_b0.th
vocabulary

其中 vocabulary 是包含 labels.txtnon_padded_namespaces.txt 的文件夹。

我现在想在学习相关任务事件提取时使用这个微调模型 BERT 模型作为初始化,使用这个库:https://github.com/wilsonlau-uw/BERT-EE(即我想利用一些迁移学习) . config.ini 文件有一行 fine_tuned_path,我可以在其中指定我想在此处使用的已经微调的模型。我提供了 AllenNLP 序列化目录的路径,但出现以下错误:

2022-04-05 13:07:28,112 -  INFO - setting seed 23
2022-04-05 13:07:28,113 -  INFO - loading fine tuned model in /data/projects/SRL/ser_pure_clinical_bert-large_thyme_and_ontonotes/
Traceback (most recent call last):
  File "main.py", line 65, in <module>
    model = BERT_EE()
  File "/data/projects/SRL/BERT-EE/model.py", line 88, in __init__
    self.__build(self.use_fine_tuned)
  File "/data/projects/SRL/BERT-EE/model.py", line 118, in __build
    self.__get_pretrained(self.fine_tuned_path)
  File "/data/projects/SRL/BERT-EE/model.py", line 110, in __get_pretrained
    self.__model = BERT_EE_model.from_pretrained(path)
  File "/home/richier/anaconda3/envs/allennlp/lib/python3.7/site-packages/transformers/modeling_utils.py", line 1109, in from_pretrained
    f"Error no file named {[WEIGHTS_NAME, TF2_WEIGHTS_NAME, TF_WEIGHTS_NAME + '.index', FLAX_WEIGHTS_NAME]} found in "
OSError: Error no file named ['pytorch_model.bin', 'tf_model.h5', 'model.ckpt.index', 'flax_model.msgpack'] found in directory /data/projects/SRL/ser_pure_clinical_bert-large_thyme_and_ontonotes/ or `from_tf` and `from_flax` set to False.

当然,序列化目录没有任何这些文件,因此出现错误。我尝试解压缩 model.tar.gz 但它只有:

config.json
weights.th
vocabulary/
vocabulary/.lock
vocabulary/labels.txt
vocabulary/non_padded_namespaces.txt
meta.json

深入研究我上面链接的 GitHub 回购协议的代码库,我可以看到 BERT_EE_model 继承自 BertPreTrainedModel 从变形金刚图书馆,所以这个技巧似乎是越来越将 AllenNLP 模型转换为 BertPreTrainedModel.from_pretrained() 可以加载的格式...?

如有任何帮助,我们将不胜感激!

我相信我已经弄明白了。基本上,我必须 re-load 我的模型存档,访问底层模型和分词器,然后保存它们:

from allennlp.models.archival import load_archive
from allennlp_models.structured_prediction import SemanticRoleLabeler, srl, srl_bert

archive = load_archive('ser_pure_clinical_bert-large_thyme_and_ontonotes/model.tar.gz')

bert_model = archive.model.bert_model #type is transformers.models.bert.modeling_bert.BertModel
bert_model.save_pretrained('ser_pure_clinical_bert-large_thyme_and_ontonotes_save_pretrained/')

bert_tokenizer = archive.dataset_reader.bert_tokenizer
bert_tokenizer.save_pretrained('ser_pure_clinical_bert-large_thyme_and_ontonotes_save_pretrained/')

(对于大多数人来说,最后一部分可能不太有趣,而且,在我提到的 config.ini 中,需要将目录 'ser_pure_clinical_bert-large_thyme_and_ontonotes_save_pretrained' 传递到行 pretrained_model_name_or_path 而不是至 fine_tuned_path.)