未初始化预训练 BERT 模型的权重

Question

我正在使用 Language Interpretability Toolkit (LIT) 加载和分析我在 NER 任务上预训练的 BERT 模型。

但是，当我启动 LIT 脚本并将我的预训练模型的路径传递给它时，它无法初始化权重并告诉我：

    modeling_utils.py:648] loading weights file bert_remote/examples/token-classification/Data/Models/results_21_03_04_cleaned_annotations/04.03._8_16_5e-5_cleaned_annotations/04-03-2021 (15.22.23)/pytorch_model.bin
    modeling_utils.py:739] Weights of BertForTokenClassification not initialized from pretrained model: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
    modeling_utils.py:745] Weights from pretrained model not used in BertForTokenClassification: ['bert.embeddings.position_ids']

然后它简单地使用 bert-base-german-cased 版本的 BERT，它当然没有我的自定义标签，因此无法预测任何东西。我想这可能与 PyTorch 有关，但我找不到错误。

如果相关，这里是我如何将我的数据集加载到 CoNLL 2003 格式（修改发现的数据加载器脚本 here）：

    def __init__(self):

        # Read ConLL Test Files

        self._examples = []

        data_path = "lit_remote/lit_nlp/examples/datasets/NER_Data"
        with open(os.path.join(data_path, "test.txt"), "r", encoding="utf-8") as f:
            lines = f.readlines()

        for line in lines[:2000]:
            if line != "\n":
                token, label = line.split(" ")
                self._examples.append({
                    'token': token,
                    'label': label,
                })
            else:
                self._examples.append({
                    'token': "\n",
                    'label': "O"
                })

    def spec(self):
        return {
            'token': lit_types.Tokens(),
            'label': lit_types.SequenceTags(align="token"),
        }

这就是我初始化模型和启动 LIT 服务器的方式（发现 here 对 simple_pytorch_demo.py 脚本的修改）：

    def __init__(self, model_name_or_path):
        self.tokenizer = transformers.AutoTokenizer.from_pretrained(
            model_name_or_path)
        model_config = transformers.AutoConfig.from_pretrained(
            model_name_or_path,
            num_labels=15,  # FIXME CHANGE
            output_hidden_states=True,
            output_attentions=True,
        )
        # This is a just a regular PyTorch model.
        self.model = _from_pretrained(
            transformers.AutoModelForTokenClassification,
            model_name_or_path,
            config=model_config)
        self.model.eval()

## Some omitted snippets here

    def input_spec(self) -> lit_types.Spec:
        return {
            "token": lit_types.Tokens(),
            "label": lit_types.SequenceTags(align="token")
        }

    def output_spec(self) -> lit_types.Spec:
        return {
            "tokens": lit_types.Tokens(),
            "probas": lit_types.MulticlassPreds(parent="label", vocab=self.LABELS),
            "cls_emb": lit_types.Embeddings()

Answer 1

这实际上似乎是预期的行为。在 documentation of the GPT models 中，HuggingFace 团队写道：

This will issue a warning about some of the pretrained weights not being used and some weights being randomly initialized. That’s because we are throwing away the pretraining head of the BERT model to replace it with a classification head which is randomly initialized.

所以微调似乎不是问题。在我上面描述的用例中，尽管有警告，它也能正常工作。

未初始化预训练 BERT 模型的权重

Weights of pre-trained BERT model not initialized

nlp

tensorflow

pytorch

bert-language-model

huggingface-transformers