特殊令牌有什么特别之处？

Question

“令牌”和“特殊令牌”到底有什么区别？

我了解以下内容：

什么是典型的代币
什么是典型的特殊令牌：MASK、UNK、SEP等
你什么时候添加一个标记（当你想扩展你的词汇量时）

我不明白的是，你想在什么样的容量下创建一个新的特殊令牌，我们需要它的任何例子以及我们什么时候想要创建一个特殊令牌而不是那些默认的特殊令牌？如果一个例子使用了一个特殊的标记，为什么一个普通的标记不能达到相同的效果objective？

tokenizer.add_tokens(['[EOT]'], special_tokens=True)

而且我也不是很理解源文档中的以下描述。如果我们将 add_special_tokens 设置为 False，它对我们的模型有什么不同？

add_special_tokens (bool, optional, defaults to True) — Whether or not to encode the sequences with the special tokens relative to their model.

Answer 1

特殊标记之所以称为特殊标记，是因为它们不是从您的输入中派生出来的。它们是为了某种目的而添加的，与具体的输入无关。

What I don't understand is, under what kind of capacity will you want to create a new special token, any examples what we need it for and when we want to create a special token other than those default special tokens?

举个例子，在提取式对话中 question-answering 将前面 dialog-turn 的问题和答案添加到您的输入中以为您的模型提供一些上下文并不罕见。那些先前的对话轮次用特殊标记与当前问题分开。有时人们使用模型的分隔符或引入新的特殊标记。以下是一个带有新的特殊标记 [Q]

的示例

#first dialog turn - no conversation history
[CLS] current question [SEP] text [EOS]
#second dialog turn - with previous question to have some context
[CLS] previous question [Q] current question [SEP] text [EOS]

And I also dont quite understand the following description in the source documentation. what difference does it do to our model if we set add_special_tokens to False?

from transformers import RobertaTokenizer
t = RobertaTokenizer.from_pretrained("roberta-base")

t("this is an example")
#{'input_ids': [0, 9226, 16, 41, 1246, 2], 'attention_mask': [1, 1, 1, 1, 1, 1]}

t("this is an example", add_special_tokens=False)
#{'input_ids': [9226, 16, 41, 1246], 'attention_mask': [1, 1, 1, 1]}

正如您在此处看到的，输入缺少两个标记（特殊标记）。这些特殊标记对您的模型有意义，因为它是用它训练的。由于缺少这两个标记，last_hidden_state 会有所不同，因此会导致下游任务的结果不同。

一些任务，比如序列分类，经常使用 [CLS] 标记来进行预测。当您删除它们时，pre-trained 带有 [CLS] 标记的模型将会出现问题。

特殊令牌有什么特别之处？

what is so special about special tokens?

nlp

tokenize

bert-language-model

huggingface-transformers

huggingface-tokenizers