tokenizer.encode 和 tokenizer.encode_plus 在 Hugging Face 之间有什么区别

Question

下面是一个使用模型进行序列分类以确定两个序列是否互为释义的示例。这两个例子给出了两个不同的结果。你能帮我解释一下为什么 tokenizer.encode 和 tokenizer.encode_plus 给出不同的结果吗？

示例 1（.encode_plus()）：

paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, return_tensors="pt")

paraphrase_classification_logits = model(**paraphrase)[0]
not_paraphrase_classification_logits = model(**not_paraphrase)[0]

示例 2（.encode()）：

paraphrase = tokenizer.encode(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer.encode(sequence_0, sequence_1, return_tensors="pt")

paraphrase_classification_logits = model(paraphrase)[0]
not_paraphrase_classification_logits = model(not_paraphrase)[0]

Answer 1

主要区别在于 encode_plus 提供的附加信息。如果您阅读有关各个功能的文档，那么 encode():

略有不同

Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary. Same as doing self.convert_tokens_to_ids(self.tokenize(text)).

和encode_plus()的描述：

Returns a dictionary containing the encoded sequence or sequence pair and additional information: the mask for sequence classification and the overflowing elements if a max_length is specified.

根据您指定的模型和输入语句，不同之处在于额外编码的信息，特别是输入掩码。由于您一次输入两个句子，因此 BERT（可能还有其他模型变体）需要某种形式的掩蔽，从而使模型能够辨别两个序列，请参阅 here。由于 encode_plus 是提供此信息，但 encode 不是，您会得到不同的输出结果。

tokenizer.encode 和 tokenizer.encode_plus 在 Hugging Face 之间有什么区别

what's difference between tokenizer.encode and tokenizer.encode_plus in Hugging Face

huggingface-transformers