BertForSequenceClassification 是否对 CLS 向量进行分类？

Question

我正在使用 Huggingface Transformer package and BERT with PyTorch. I'm trying to do 4-way sentiment classification and am using BertForSequenceClassification 构建一个最终导致 4 向 softmax 的模型。

我阅读 BERT 论文的理解是，输入 CLS 标记的最终密集向量用作整个文本字符串的表示：

The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks.

那么，BertForSequenceClassification 是否真的训练并使用这个向量来执行最终分类？

我问的原因是因为当我 print(model) 时，CLS 向量被使用对我来说并不明显。

model = BertForSequenceClassification.from_pretrained(
    model_config,
    num_labels=num_labels,
    output_attentions=False,
    output_hidden_states=False
)

print(model)

这是输出的底部：

        (11): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (pooler): BertPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (classifier): Linear(in_features=768, out_features=4, bias=True)

我看到有一个池化层 BertPooler 导致 Dropout 导致 Linear 可能执行最终的 4 向 softmax。但是，BertPooler 的用法我不清楚。它是仅对 CLS 的隐藏状态进行操作，还是对所有输入标记的隐藏状态进行某种池化？

感谢您的帮助。

Answer 1

简短的回答：是的，你是对的。实际上，他们将 CLS 令牌（并且仅用于）用于 BertForSequenceClassification.

查看 BertPooler 的实现表明它正在使用第一个隐藏状态，对应于 [CLS] 标记。我简要地检查了另一个模型 (RoBERTa)，看看这在不同模型中是否一致。在这里，分类也仅基于 [CLS] 标记进行，尽管不太明显（检查第 539-542 行 here）。

BertForSequenceClassification 是否对 CLS 向量进行分类？

Does BertForSequenceClassification classify on the CLS vector?

python

machine-learning

pytorch

bert-language-model

huggingface-transformers