与 TFBertModel 相比，TFBertMainLayer 的准确性较低

Question

我在保存包裹在 Keras 中的 TFBertModel 的权重时遇到了问题。问题描述为here in GitHub issue and here in Stack Overflow。在这两种情况下提出的解决方案是使用

 config = BertConfig.from_pretrained(transformer_model_name)
 bert = TFBertMainLayer(config=config,trainable=False)

而不是

 bert = TFBertModel.from_pretrained(transformer_model_name, trainable=False)

问题是，当我将模型更改为以前的代码时，准确度降低了 10 percent.While 两种情况下的参数计数相同。请问是什么原因，如何预防？

Answer 1

似乎直接实例化 MainLayer 的代码片段中的性能下降是因为未加载 pre-trained 权重。您可以通过以下任一方式加载权重：

正在调用 TFBertModel.from_pretrained 并从已加载的 TFBertModel

MainLayer

直接创建MainLayer，然后以类似于from_pretrained

为什么会这样

当您调用 TFBertModel.from_pretrained 时，它使用函数 TFPreTrainedModel.from_pretrained（通过继承）处理一些事情，包括下载、缓存和加载模型权重。

class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin, TFGenerationMixin):
    ...
    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
        ...
        # Load model
        if pretrained_model_name_or_path is not None:
            if os.path.isfile(os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME)):
            # Load from a TF 2.0 checkpoint
            archive_file = os.path.join(pretrained_model_name_or_path, TF2_WEIGHTS_NAME)
            ...
            resolved_archive_file = cached_path(
                    archive_file,
                    cache_dir=cache_dir,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    local_files_only=local_files_only,
            )
            ...
            model.load_weights(resolved_archive_file, by_name=True)

（如果你阅读了实际的代码，上面已经 ... 了很多）。

但是，当您直接实例化 TFBertMainLayer 时，它不会执行任何此类设置工作。

@keras_serializable
class TFBertMainLayer(tf.keras.layers.Layer):
    config_class = BertConfig

    def __init__(self, config, **kwargs):
        super().__init__(**kwargs)
        self.num_hidden_layers = config.num_hidden_layers
        self.initializer_range = config.initializer_range
        self.output_attentions = config.output_attentions
        self.output_hidden_states = config.output_hidden_states
        self.return_dict = config.use_return_dict
        self.embeddings = TFBertEmbeddings(config, name="embeddings")
        self.encoder = TFBertEncoder(config, name="encoder")
        self.pooler = TFBertPooler(config, name="pooler")
   
   ... rest of the class

基本上，您需要确保正在加载这些权重。

解决方案

(1) 使用 TFAutoModel.from_pretrained

你可以依靠 transformers.TFAutoModel.from_pretrained 加载模型，然后从 TFPreTrainedModel 的特定子 class 中获取 MainLayer 字段。例如，如果你想访问一个 distilbert 主层，它看起来像：

    model = transformers.TFAutoModel.from_pretrained(`distilbert-base-uncased`)
    assert isinstance(model, TFDistilBertModel)
    main_layer = transformer_model.distilbert

你可以在modeling_tf_distilbert.html中看到表示MainLayer是模型的一个字段。这是更少的代码和更少的重复，但有一些缺点。更改要使用的 pre-trained 模型不太容易，因为现在您依赖于字段名，如果更改模型类型，则必须更改字段名称（例如 TFAlbertModel MainLayer 字段称为 albert)。此外，这似乎不是使用 huggingface 的预期方式，因此这可能会在您的眼皮底下发生变化，并且您的代码可能会因 huggingface 更新而中断。

class TFDistilBertModel(TFDistilBertPreTrainedModel):
    def __init__(self, config, *inputs, **kwargs):
        super().__init__(config, *inputs, **kwargs)
        self.distilbert = TFDistilBertMainLayer(config, name="distilbert")  # Embeddings

[DOCS]    @add_start_docstrings_to_callable(DISTILBERT_INPUTS_DOCSTRING)
    @add_code_sample_docstrings(
        tokenizer_class=_TOKENIZER_FOR_DOC,
        checkpoint="distilbert-base-uncased",
        output_type=TFBaseModelOutput,
        config_class=_CONFIG_FOR_DOC,
    )
    def call(self, inputs, **kwargs):
        outputs = self.distilbert(inputs, **kwargs)
        return outputs

(2) Re-implementing 来自 `from_pretrained`

的权重加载逻辑

您基本上可以通过 copy/pasting from_pretrained 中与加载重量相关的部分来做到这一点。这也有一些严重的缺点，您将复制可能与 huggingface 库不同步的逻辑。尽管您可能会以一种对底层模型名称更改更灵活、更健壮的方式来编写它。

结论

理想情况下，huggingface 团队会在内部解决这个问题，方法是提供一个标准函数来创建 MainLayer，将权重加载逻辑包装到它自己的可以调用的函数中，或者支持序列化型号 class.

与 TFBertModel 相比，TFBertMainLayer 的准确性较低

TFBertMainLayer gets less accuracy compared to TFBertModel

transformer

keras

bert-language-model

为什么会这样

解决方案

(1) 使用 TFAutoModel.from_pretrained

(2) Re-implementing 来自 `from_pretrained`

结论

与 TFBertModel 相比，TFBertMainLayer 的准确性较低

TFBertMainLayer gets less accuracy compared to TFBertModel

transformer

keras

bert-language-model

为什么会这样

解决方案

(1) 使用 TFAutoModel.from_pretrained

(2) Re-implementing 来自 from_pretrained

结论

(2) Re-implementing 来自 `from_pretrained`