BertModel 和 BertForMaskedLM 权重计数

Question

我想了解 BertForMaskedLM 模型，在 huggingface github 代码中，BertForMaskedLM 是一个带有额外 2 个形状为（输入 768，输出 768）和（输入 768，输出 30522）的线性层的 bert 模型。所有权重的计数将是 BertModel 的权重 + 768 * 768 + 768 * 30522，但是当我检查数字不匹配时。

from transformers import BertModel, BertForMaskedLM
import torch

bertmodel = BertModel.from_pretrained('bert-base-uncased')
bertForMaskedLM = BertForMaskedLM.from_pretrained('bert-base-uncased')

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

count_parameters(bertmodel)
#output 109482240
count_parameters(bertForMaskedLM)
#output 109514298

109482240 + 768 * 768 + 768 * 30522 != 109514298

我做错了什么？

Answer 1

使用 numel() 和 model.parameters() 不是计算参数总数的可靠方法，并且可能无法进行层的递归配置。这正是您的情况。相反，请尝试以下操作：

from torchinfo import summary

print(summary(bertmodel))

输出：

print(summary(bertForMaskedLM))

输出：

从上面的输出我们可以看到两个模型的可训练参数总数是：
伯特模型：109,482,240
bertForMaskedLM: 132,955,194

为了了解差异，让我们看一下两个模型的最后一个模块（基本模型的其余部分完全相同）：

bertmodel:

(pooler): BertPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh())

bertForMaskedLM：

(cls): BertOnlyMLMHead((predictions): BertLMPredictionHead(
(transform): BertPredictionHeadTransform(
  (dense): Linear(in_features=768, out_features=768, bias=True)
  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(decoder): Linear(in_features=768, out_features=30522, bias=True)))

唯一添加的是 LayerNorm 层（2 * 768 层伽玛和贝塔参数）和 解码器 层（769 * 30522 ，使用 y=A*X + B，其中 A 的大小为 (nxm)，B 的大小为 (nx1)，总参数为 nx(m+1)。

bertForMaskedLM 的参数 = 109482240 + 2 * 768 + 769 * 30522 = 132955194

BertModel 和 BertForMaskedLM 权重计数

BertModel and BertForMaskedLM weights count

nlp

machine-learning

deep-learning

pytorch

bert-language-model

bertmodel:

bertForMaskedLM：