用Huggingface从头训练语言模型时的问题
Questions when training language models from scratch with Huggingface
我正在按照此处的指南 (https://github.com/huggingface/blog/blob/master/how-to-train.md, https://huggingface.co/blog/how-to-train) 从头开始训练类似 RoBERTa 的模型。 (使用我自己的分词器和数据集)
然而,当我运行 run_mlm.py (https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py) 用掩蔽任务训练我的模型时,出现以下消息:
All model checkpoint weights were used when initializing RobertaForMaskedLM.
All the weights of RobertaForMaskedLM were initialized from the model checkpoint at roberta-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use RobertaForMaskedLM for predictions without further training.
我想知道这是否意味着我正在使用 RoBERTa 的 “预训练重量” 从头开始训练?如果它是从预训练的权重进行训练,有没有办法使用随机启动的权重而不是预训练的权重?
==== 2021/10/26 更新===
我正在通过以下命令使用 Masked Language Modeling 任务训练模型:
python transformer_run_mlm.py \
--model_name_or_path roberta-base \
--config_name ./my_dir/ \
--tokenizer_name ./my_dir/ \
--no_use_fast_tokenizer \
--train_file ./my_own_training_file.txt \
--validation_split_percentage 10 \
--line_by_line \
--output_dir /my_output_dir/ \
--do_train \
--do_eval \
--per_device_train_batch_size 64 \
--per_device_eval_batch_size 16 \
--learning_rate 1e-4 \
--max_seq_length 1024 \
--seed 42 \
--num_train_epochs 100
./my_dir/由三个文件组成:
config.json由以下代码生成:
from transformers import RobertaModel
model = RobertaModel.from_pretrained('roberta-base')
model.config.save_pretrained(MODEL_CONFIG_PATH)
内容如下:
{
"_name_or_path": "roberta-base",
"architectures": [
"RobertaForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"classifier_dropout": null,
"eos_token_id": 2,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-05,
"max_position_embeddings": 514,
"model_type": "roberta",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 1,
"position_embedding_type": "absolute",
"transformers_version": "4.12.0.dev0",
"type_vocab_size": 1,
"use_cache": true,
"vocab_size": 50265
}
vocab.json,merges.txt由以下代码生成:
from tokenizers.implementations import ByteLevelBPETokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=OUTPUT_DIR + "seed.txt", vocab_size=52_000, min_frequency=2, special_tokens=[
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>",
])
# Save files to disk
tokenizer.save_model(MODEL_CONFIG_PATH)
这里是vocab.json的内容(比例)
{"<s>":0,"<pad>":1,"</s>":2,"<unk>":3,"<mask>":4,"!":5,"\"":6,"#":7,"$":8,"%":9,"&":10,"'":11,"(":12,")":13,"*":14,"+":15,",":16,"-":17,".":18,"/":19,"0":20,"1":21,"2":22,"3":23,"4":24,"5":25,"6":26,"7":27,"8":28,"9":29,":":30,";":31,"<":32,"=":33,">":34,"?":35,"@":36,"A":37,"B":38,"C":39,"D":40,"E":41,"F":42,"G":43,"H":44,"I":45,"J":46,"K":47,"L":48,"M":49,"N":50,"O":51,"P":52,"Q":53,"R":54,"S":55,"T":56,"U":57,"V":58,"W":59,"X":60,"Y":61,"Z":62,"[":63,"\":64,"]":65,"^":66,"_":67,"`":68,"a":69,"b":70,"c":71,"d":72,"e":73,"f":74,"g":75,"h":76,"i":77,"j":78,"k":79,"l":80,"m":81,"n":82,"o":83,"p":84,"q":85,"r":86,"s":87,"t":88,"u":89,"v":90,"w":91,"x":92,"y":93,"z":94,"{":95,"|":96,"}":97,"~":98,"¡":99,"¢":100,"£":101,"¤":102,"¥":103,"¦":104,"§":105,"¨":106,"©":107,"ª":108,"«":109,"¬":110,"®":111,"¯":112,"°":113,"±":114,"²":115,"³":116,"´":117,"µ":118,"¶":119,"·":120,"¸":121,"¹":122,"º":123,"»":124,"¼":125,"½":126,"¾":12
这里是merges.txt的内容(比例)
#version: 0.2 - Trained by `huggingface/tokenizers`
e n
T o
k en
Ġ To
ĠTo ken
E R
V ER
VER B
a t
P R
PR O
P N
PRO PN
Ġ n
U N
N O
NO UN
E n
i t
t it
En tit
Entit y
b j
c o
Ġ a
我认为你混合了两个不同的动作。
- 您发布的第一个指南解释了如何从头开始创建模型
run_mlm.py
脚本用于微调(参见脚本的第 17 行)已经存在的模型
因此,如果您只想从头开始创建模型,那么第 1 步就足够了。如果您想微调刚刚创建的模型,则必须 运行 步骤 2。请注意,从头开始训练 RoBERTa 模型已经意味着 MLM 阶段,因此此步骤仅在您将拥有将来使用不同的数据集,并且您想通过进一步微调来改进模型。
但是,您没有加载刚刚创建的模型,而是从 Huggingface 存储库加载 roberta-base 模型:--model_name_or_path roberta-base \
出现警告时,它告诉您加载了一个模型(roberta-base
,已清除),该模型已针对 Masked Language Modeling (MaskedLM) 任务进行了预训练。这意味着您加载了模型的检查点
所以,引用:
If your task is similar to the task the model of the checkpoint was
trained on, you can already use RobertaForMaskedLM for predictions
without further training.
这意味着,如果您要执行 MaskedLM 任务,该模型就可以使用了。如果您想用于其他任务(例如,问答),您可能应该对其进行微调,因为按原样使用的模型无法提供令人满意的结果。
最后,如果您想从头开始创建一个模型来执行 MLM,请按照步骤 1。这将创建一个可以执行 MLM 的模型。
如果您想在 MLM 中微调一个已经存在的模型(请参阅 Huggingface repository),请执行步骤 2。
我正在按照此处的指南 (https://github.com/huggingface/blog/blob/master/how-to-train.md, https://huggingface.co/blog/how-to-train) 从头开始训练类似 RoBERTa 的模型。 (使用我自己的分词器和数据集)
然而,当我运行 run_mlm.py (https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py) 用掩蔽任务训练我的模型时,出现以下消息:
All model checkpoint weights were used when initializing RobertaForMaskedLM.
All the weights of RobertaForMaskedLM were initialized from the model checkpoint at roberta-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use RobertaForMaskedLM for predictions without further training.
我想知道这是否意味着我正在使用 RoBERTa 的 “预训练重量” 从头开始训练?如果它是从预训练的权重进行训练,有没有办法使用随机启动的权重而不是预训练的权重?
==== 2021/10/26 更新===
我正在通过以下命令使用 Masked Language Modeling 任务训练模型:
python transformer_run_mlm.py \
--model_name_or_path roberta-base \
--config_name ./my_dir/ \
--tokenizer_name ./my_dir/ \
--no_use_fast_tokenizer \
--train_file ./my_own_training_file.txt \
--validation_split_percentage 10 \
--line_by_line \
--output_dir /my_output_dir/ \
--do_train \
--do_eval \
--per_device_train_batch_size 64 \
--per_device_eval_batch_size 16 \
--learning_rate 1e-4 \
--max_seq_length 1024 \
--seed 42 \
--num_train_epochs 100
./my_dir/由三个文件组成:
config.json由以下代码生成:
from transformers import RobertaModel
model = RobertaModel.from_pretrained('roberta-base')
model.config.save_pretrained(MODEL_CONFIG_PATH)
内容如下:
{
"_name_or_path": "roberta-base",
"architectures": [
"RobertaForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"classifier_dropout": null,
"eos_token_id": 2,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-05,
"max_position_embeddings": 514,
"model_type": "roberta",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 1,
"position_embedding_type": "absolute",
"transformers_version": "4.12.0.dev0",
"type_vocab_size": 1,
"use_cache": true,
"vocab_size": 50265
}
vocab.json,merges.txt由以下代码生成:
from tokenizers.implementations import ByteLevelBPETokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=OUTPUT_DIR + "seed.txt", vocab_size=52_000, min_frequency=2, special_tokens=[
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>",
])
# Save files to disk
tokenizer.save_model(MODEL_CONFIG_PATH)
这里是vocab.json的内容(比例)
{"<s>":0,"<pad>":1,"</s>":2,"<unk>":3,"<mask>":4,"!":5,"\"":6,"#":7,"$":8,"%":9,"&":10,"'":11,"(":12,")":13,"*":14,"+":15,",":16,"-":17,".":18,"/":19,"0":20,"1":21,"2":22,"3":23,"4":24,"5":25,"6":26,"7":27,"8":28,"9":29,":":30,";":31,"<":32,"=":33,">":34,"?":35,"@":36,"A":37,"B":38,"C":39,"D":40,"E":41,"F":42,"G":43,"H":44,"I":45,"J":46,"K":47,"L":48,"M":49,"N":50,"O":51,"P":52,"Q":53,"R":54,"S":55,"T":56,"U":57,"V":58,"W":59,"X":60,"Y":61,"Z":62,"[":63,"\":64,"]":65,"^":66,"_":67,"`":68,"a":69,"b":70,"c":71,"d":72,"e":73,"f":74,"g":75,"h":76,"i":77,"j":78,"k":79,"l":80,"m":81,"n":82,"o":83,"p":84,"q":85,"r":86,"s":87,"t":88,"u":89,"v":90,"w":91,"x":92,"y":93,"z":94,"{":95,"|":96,"}":97,"~":98,"¡":99,"¢":100,"£":101,"¤":102,"¥":103,"¦":104,"§":105,"¨":106,"©":107,"ª":108,"«":109,"¬":110,"®":111,"¯":112,"°":113,"±":114,"²":115,"³":116,"´":117,"µ":118,"¶":119,"·":120,"¸":121,"¹":122,"º":123,"»":124,"¼":125,"½":126,"¾":12
这里是merges.txt的内容(比例)
#version: 0.2 - Trained by `huggingface/tokenizers`
e n
T o
k en
Ġ To
ĠTo ken
E R
V ER
VER B
a t
P R
PR O
P N
PRO PN
Ġ n
U N
N O
NO UN
E n
i t
t it
En tit
Entit y
b j
c o
Ġ a
我认为你混合了两个不同的动作。
- 您发布的第一个指南解释了如何从头开始创建模型
run_mlm.py
脚本用于微调(参见脚本的第 17 行)已经存在的模型
因此,如果您只想从头开始创建模型,那么第 1 步就足够了。如果您想微调刚刚创建的模型,则必须 运行 步骤 2。请注意,从头开始训练 RoBERTa 模型已经意味着 MLM 阶段,因此此步骤仅在您将拥有将来使用不同的数据集,并且您想通过进一步微调来改进模型。
但是,您没有加载刚刚创建的模型,而是从 Huggingface 存储库加载 roberta-base 模型:--model_name_or_path roberta-base \
出现警告时,它告诉您加载了一个模型(roberta-base
,已清除),该模型已针对 Masked Language Modeling (MaskedLM) 任务进行了预训练。这意味着您加载了模型的检查点
所以,引用:
If your task is similar to the task the model of the checkpoint was trained on, you can already use RobertaForMaskedLM for predictions without further training.
这意味着,如果您要执行 MaskedLM 任务,该模型就可以使用了。如果您想用于其他任务(例如,问答),您可能应该对其进行微调,因为按原样使用的模型无法提供令人满意的结果。
最后,如果您想从头开始创建一个模型来执行 MLM,请按照步骤 1。这将创建一个可以执行 MLM 的模型。
如果您想在 MLM 中微调一个已经存在的模型(请参阅 Huggingface repository),请执行步骤 2。