用Huggingface从头训练语言模型时的问题

Question

我正在按照此处的指南 (https://github.com/huggingface/blog/blob/master/how-to-train.md, https://huggingface.co/blog/how-to-train) 从头开始训练类似 RoBERTa 的模型。（使用我自己的分词器和数据集）

然而，当我运行 run_mlm.py (https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py) 用掩蔽任务训练我的模型时，出现以下消息：

All model checkpoint weights were used when initializing RobertaForMaskedLM.

All the weights of RobertaForMaskedLM were initialized from the model checkpoint at roberta-base.

If your task is similar to the task the model of the checkpoint was trained on, you can already use RobertaForMaskedLM for predictions without further training.

我想知道这是否意味着我正在使用 RoBERTa 的 “预训练重量” 从头开始训练？如果它是从预训练的权重进行训练，有没有办法使用随机启动的权重而不是预训练的权重？

==== 2021/10/26 更新===

我正在通过以下命令使用 Masked Language Modeling 任务训练模型：

python transformer_run_mlm.py \
--model_name_or_path roberta-base  \
--config_name ./my_dir/ \
--tokenizer_name ./my_dir/ \
--no_use_fast_tokenizer \
--train_file ./my_own_training_file.txt \
--validation_split_percentage 10 \
--line_by_line \
--output_dir /my_output_dir/ \
--do_train \
--do_eval \
--per_device_train_batch_size 64 \
--per_device_eval_batch_size 16 \
--learning_rate 1e-4 \
--max_seq_length 1024 \
--seed 42 \
--num_train_epochs 100

./my_dir/由三个文件组成：

config.json由以下代码生成：

from transformers import RobertaModel

model = RobertaModel.from_pretrained('roberta-base')
model.config.save_pretrained(MODEL_CONFIG_PATH)

内容如下：

{
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.12.0.dev0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

vocab.json,merges.txt由以下代码生成：

from tokenizers.implementations import ByteLevelBPETokenizer

tokenizer = ByteLevelBPETokenizer()

tokenizer.train(files=OUTPUT_DIR + "seed.txt", vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

# Save files to disk
tokenizer.save_model(MODEL_CONFIG_PATH)

这里是vocab.json的内容（比例）

{"<s>":0,"<pad>":1,"</s>":2,"<unk>":3,"<mask>":4,"!":5,"\"":6,"#":7,"$":8,"%":9,"&":10,"'":11,"(":12,")":13,"*":14,"+":15,",":16,"-":17,".":18,"/":19,"0":20,"1":21,"2":22,"3":23,"4":24,"5":25,"6":26,"7":27,"8":28,"9":29,":":30,";":31,"<":32,"=":33,">":34,"?":35,"@":36,"A":37,"B":38,"C":39,"D":40,"E":41,"F":42,"G":43,"H":44,"I":45,"J":46,"K":47,"L":48,"M":49,"N":50,"O":51,"P":52,"Q":53,"R":54,"S":55,"T":56,"U":57,"V":58,"W":59,"X":60,"Y":61,"Z":62,"[":63,"\":64,"]":65,"^":66,"_":67,"`":68,"a":69,"b":70,"c":71,"d":72,"e":73,"f":74,"g":75,"h":76,"i":77,"j":78,"k":79,"l":80,"m":81,"n":82,"o":83,"p":84,"q":85,"r":86,"s":87,"t":88,"u":89,"v":90,"w":91,"x":92,"y":93,"z":94,"{":95,"|":96,"}":97,"~":98,"¡":99,"¢":100,"£":101,"¤":102,"¥":103,"¦":104,"§":105,"¨":106,"©":107,"ª":108,"«":109,"¬":110,"®":111,"¯":112,"°":113,"±":114,"²":115,"³":116,"´":117,"µ":118,"¶":119,"·":120,"¸":121,"¹":122,"º":123,"»":124,"¼":125,"½":126,"¾":12

这里是merges.txt的内容（比例）

#version: 0.2 - Trained by `huggingface/tokenizers`
e n
T o
k en
Ġ To
ĠTo ken
E R
V ER
VER B
a t
P R
PR O
P N
PRO PN
Ġ n
U N
N O
NO UN
E n
i t
t it
En tit
Entit y
b j
c o
Ġ a

Answer 1

我认为你混合了两个不同的动作。

您发布的第一个指南解释了如何从头开始创建模型
run_mlm.py 脚本用于微调（参见脚本的第 17 行）已经存在的模型

因此，如果您只想从头开始创建模型，那么第 1 步就足够了。如果您想微调刚刚创建的模型，则必须运行步骤 2。请注意，从头开始训练 RoBERTa 模型已经意味着 MLM 阶段，因此此步骤仅在您将拥有将来使用不同的数据集，并且您想通过进一步微调来改进模型。

但是，您没有加载刚刚创建的模型，而是从 Huggingface 存储库加载 roberta-base 模型：--model_name_or_path roberta-base \

出现警告时，它告诉您加载了一个模型（roberta-base，已清除），该模型已针对 Masked Language Modeling (MaskedLM) 任务进行了预训练。这意味着您加载了模型的检查点所以，引用：

If your task is similar to the task the model of the checkpoint was trained on, you can already use RobertaForMaskedLM for predictions without further training.

这意味着，如果您要执行 MaskedLM 任务，该模型就可以使用了。如果您想用于其他任务（例如，问答），您可能应该对其进行微调，因为按原样使用的模型无法提供令人满意的结果。

最后，如果您想从头开始创建一个模型来执行 MLM，请按照步骤 1。这将创建一个可以执行 MLM 的模型。

如果您想在 MLM 中微调一个已经存在的模型（请参阅 Huggingface repository），请执行步骤 2。

用Huggingface从头训练语言模型时的问题

Questions when training language models from scratch with Huggingface

python

nlp

transformer

roberta

huggingface-transformers