在微调 BERT 时是否绝对需要特殊标记 [CLS] [SEP]？

Question

我正在按照教程 https://www.depends-on-the-definition.com/named-entity-recognition-with-bert/ 使用 BERT 进行命名实体识别。

微调时，在将令牌输入模型之前，作者做了：

input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts],
                          maxlen=MAX_LEN, dtype="long", value=0.0,
                          truncating="post", padding="post")

根据我的测试，这不会向 ID 添加特殊标记。那么我是不是遗漏了什么或者我并不总是需要包含 [CLS] (101) [SEP] (102)？

Answer 1

我也在学习本教程。它在不添加这些标记的情况下对我有用，但是，我在另一个教程 (https://vamvas.ch/bert-for-ner) 中发现添加它们更好，因为模型是以这种格式训练的。

[更新] 其实刚查了一下，原来加了token之后准确率提高了20%。但请注意，我在不同的数据集上使用它

在微调 BERT 时是否绝对需要特殊标记 [CLS] [SEP]？

Are special tokens [CLS] [SEP] absolutely necessary while fine tuning BERT?

named-entity-recognition

cls

bert-language-model