BertForSequenceClassification 如何对 CLS 向量进行分类?

How does BertForSequenceClassification classify on the CLS vector?

背景:

随着此 使用 bert 对序列进行分类时,模型使用“[CLS]”标记表示分类任务。根据论文:

The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks.

查看 huggingfaces repo 他们的 BertForSequenceClassification 使用了 bert pooler 方法:

class BertPooler(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.activation = nn.Tanh()

    def forward(self, hidden_states):
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token.
        first_token_tensor = hidden_states[:, 0]
        pooled_output = self.dense(first_token_tensor)
        pooled_output = self.activation(pooled_output)
        return pooled_output

我们可以看到他们使用第一个标记 (CLS) 并将其用作整个句子的表示。具体来说,他们执行 hidden_states[:, 0] 这看起来很像从每个状态中获取第一个元素而不是获取第一个标记隐藏状态?

我的问题:

我不明白的是,他们是如何将整个句子中的信息编码到这个标记中的? CLS 标记是一个常规标记,它有自己的嵌入向量来“学习”句子级别的表示吗?为什么我们不能只使用隐藏状态的平均值(编码器的输出)并用它来分类?

编辑:稍加思考后:因为我们使用 CLS 令牌隐藏状态来预测,所以 CLS 令牌嵌入是在分类任务上训练的吗?用于分类的令牌(因此是传播到其权重的错误的主要贡献者?)

Is the CLS token a regular token which has its own embedding vector that "learns" the sentence level representation?

是:

from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

clsToken = tokenizer.convert_tokens_to_ids('[CLS]') 
print(clsToken)
#or
print(tokenizer.cls_token, tokenizer.cls_token_id)

print(model.get_input_embeddings()(torch.tensor(clsToken)))

输出:

101
[CLS] 101
tensor([ 1.3630e-02, -2.6490e-02, -2.3503e-02, -7.7876e-03,  8.5892e-03,
        -7.6645e-03, -9.8808e-03,  6.0184e-03,  4.6921e-03, -3.0984e-02,
         1.8883e-02, -6.0093e-03, -1.6652e-02,  1.1684e-02, -3.6245e-02,
         ...
         5.4162e-03, -3.0037e-02,  8.6773e-03, -1.7942e-03,  6.6826e-03,
        -1.1929e-02, -1.4076e-02,  1.6709e-02,  1.6860e-03, -3.3842e-03,
         8.6805e-03,  7.1340e-03,  1.5147e-02], grad_fn=<EmbeddingBackward>)

您可以通过以下方式获取模型的所有其他特殊标记的列表:

print(tokenizer.all_special_tokens)

输出:

['[CLS]', '[UNK]', '[PAD]', '[SEP]', '[MASK]']

What I don't understand is how do they encode the information from the entire sentence into this token?

Because we use the CLS tokens hidden state to predict, is the CLS tokens embedding being trained on the task of classification as this is the token being used to classify (thus being the major contributor to the error which gets propagated to its weights?)

也是。正如您在问题 BertForSequenceClassification utilizes the BertPooler 中所述,在 Bert 之上训练线性层:

#outputs contains the output of BertModel and the second element is the pooler output
pooled_output = outputs[1]

pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)

#...loss calculation based on logits and the given labels

Why can't we just use the average of the hidden states (the output of the encoder) and use this to classify?

我真的不能笼统地回答这个问题,但为什么你认为这作为一个线性层会更容易或更好?您还需要训练隐藏层以生成输出,其中平均值映射到您的 class。因此你还需要一个“平均层”来作为你损失的主要贡献者。一般来说,当你能证明它比目前的方法能带来更好的结果时,没有人会拒绝它。