batch_encode_plus 分词器方法的问题
Problem with batch_encode_plus method of tokenizer
我在分词器的 batch_encode_plus
方法中遇到了一个奇怪的问题。我最近从 transformer 版本 3.3.0 切换到 4.5.1。 (我正在为 NER 创建我的数据束)。
我有 2 个句子需要编码,我有一个句子已经被标记化的情况,但由于两个句子的长度不同所以我需要 pad [PAD]
较短的句子以便有我这批统一长度的。
下面是我用 3.3.0 版本的变形金刚做的代码
from transformers import AutoTokenizer
pretrained_model_name = 'distilbert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name, add_prefix_space=True)
sentences = ["He is an uninvited guest.", "The host of the party didn't sent him the invite."]
# here we have the complete sentences
encodings = tokenizer.batch_encode_plus(sentences, max_length=20, padding=True)
batch_token_ids, attention_masks = encodings["input_ids"], encodings["attention_mask"]
print(batch_token_ids[0])
print(tokenizer.convert_ids_to_tokens(batch_token_ids[0]))
# And the output
# [101, 1124, 1110, 1126, 8362, 1394, 5086, 1906, 3648, 119, 102, 0, 0, 0, 0]
# ['[CLS]', 'He', 'is', 'an', 'un', '##in', '##vi', '##ted', 'guest', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
# here we have the already tokenized sentences
encodings = tokenizer.batch_encode_plus(batch_token_ids, max_length=20, padding=True, truncation=True, is_split_into_words=True, add_special_tokens=False, return_tensors="pt")
batch_token_ids, attention_masks = encodings["input_ids"], encodings["attention_mask"]
print(batch_token_ids[0])
print(tokenizer.convert_ids_to_tokens(batch_token_ids[0]))
# And the output
tensor([ 101, 1124, 1110, 1126, 8362, 1394, 5086, 1906, 3648, 119, 102, 0, 0, 0, 0])
['[CLS]', 'He', 'is', 'an', 'un', '##in', '##vi', '##ted', 'guest', '.', '[SEP]', '[PAD]', [PAD]', '[PAD]', '[PAD]']
但是如果我尝试在 transformer 4.5.1 版中模仿相同的行为,我会得到不同的输出
from transformers import AutoTokenizer
pretrained_model_name = 'distilbert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name, add_prefix_space=True)
sentences = ["He is an uninvited guest.", "The host of the party didn't sent him the invite."]
# here we have the complete sentences
encodings = tokenizer.batch_encode_plus(sentences, max_length=20, padding=True)
batch_token_ids, attention_masks = encodings["input_ids"], encodings["attention_mask"]
print(batch_token_ids[0])
print(tokenizer.convert_ids_to_tokens(batch_token_ids[0]))
# And the output
#[101, 1124, 1110, 1126, 8362, 1394, 5086, 1906, 3648, 119, 102, 0, 0, 0, 0]
#['[CLS]', 'He', 'is', 'an', 'un', '##in', '##vi', '##ted', 'guest', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
# here we have the already tokenized sentences, Note we cannot pass the batch_token_ids
# to the batch_encode_plus method in the newer version, so need to convert them to token first
tokens1 = tokenizer.tokenize(sentences[0], add_special_tokens=True)
tokens2 = tokenizer.tokenize(sentences[1], add_special_tokens=True)
encodings = tokenizer.batch_encode_plus([tokens1, tokens2], max_length=20, padding=True, truncation=True, is_split_into_words=True, add_special_tokens=False, return_tensors="pt")
batch_token_ids, attention_masks = encodings["input_ids"], encodings["attention_mask"]
print(batch_token_ids[0])
print(tokenizer.convert_ids_to_tokens(batch_token_ids[0]))
# And the output (not the desired one)
tensor([ 101, 1124, 1110, 1126, 8362, 108, 108, 1107, 108, 108,
191, 1182, 108, 108, 21359, 1181, 3648, 119, 102])
['[CLS]', 'He', 'is', 'an', 'un', '#', '#', 'in', '#', '#', 'v', '##i', '#', '#', 'te', '##d', 'guest', '.', '[SEP]']
不确定如何处理这个问题,或者我在这里做错了什么。
我在这里写信是因为我无法对问题本身发表评论。我建议查看每个标记化(token1 和 token2)的输出并将其与 batch_token_ids 进行比较。奇怪的是输出不包含第二句话中的标记。可能那里有问题。
您需要非快速分词器才能使用整数分词列表。
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name, add_prefix_space=True, use_fast=False)
use_fast
标志在以后的版本中默认启用。
来自 HuggingFace 文档,
batch_encode_plus(batch_text_or_text_pairs: ...)
batch_text_or_text_pairs (List[str], List[Tuple[str, str]],
List[List[str]], List[Tuple[List[str], List[str]]], and for not-fast
tokenizers, also List[List[int]], List[Tuple[List[int], List[int]]])
我在分词器的 batch_encode_plus
方法中遇到了一个奇怪的问题。我最近从 transformer 版本 3.3.0 切换到 4.5.1。 (我正在为 NER 创建我的数据束)。
我有 2 个句子需要编码,我有一个句子已经被标记化的情况,但由于两个句子的长度不同所以我需要 pad [PAD]
较短的句子以便有我这批统一长度的。
下面是我用 3.3.0 版本的变形金刚做的代码
from transformers import AutoTokenizer
pretrained_model_name = 'distilbert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name, add_prefix_space=True)
sentences = ["He is an uninvited guest.", "The host of the party didn't sent him the invite."]
# here we have the complete sentences
encodings = tokenizer.batch_encode_plus(sentences, max_length=20, padding=True)
batch_token_ids, attention_masks = encodings["input_ids"], encodings["attention_mask"]
print(batch_token_ids[0])
print(tokenizer.convert_ids_to_tokens(batch_token_ids[0]))
# And the output
# [101, 1124, 1110, 1126, 8362, 1394, 5086, 1906, 3648, 119, 102, 0, 0, 0, 0]
# ['[CLS]', 'He', 'is', 'an', 'un', '##in', '##vi', '##ted', 'guest', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
# here we have the already tokenized sentences
encodings = tokenizer.batch_encode_plus(batch_token_ids, max_length=20, padding=True, truncation=True, is_split_into_words=True, add_special_tokens=False, return_tensors="pt")
batch_token_ids, attention_masks = encodings["input_ids"], encodings["attention_mask"]
print(batch_token_ids[0])
print(tokenizer.convert_ids_to_tokens(batch_token_ids[0]))
# And the output
tensor([ 101, 1124, 1110, 1126, 8362, 1394, 5086, 1906, 3648, 119, 102, 0, 0, 0, 0])
['[CLS]', 'He', 'is', 'an', 'un', '##in', '##vi', '##ted', 'guest', '.', '[SEP]', '[PAD]', [PAD]', '[PAD]', '[PAD]']
但是如果我尝试在 transformer 4.5.1 版中模仿相同的行为,我会得到不同的输出
from transformers import AutoTokenizer
pretrained_model_name = 'distilbert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name, add_prefix_space=True)
sentences = ["He is an uninvited guest.", "The host of the party didn't sent him the invite."]
# here we have the complete sentences
encodings = tokenizer.batch_encode_plus(sentences, max_length=20, padding=True)
batch_token_ids, attention_masks = encodings["input_ids"], encodings["attention_mask"]
print(batch_token_ids[0])
print(tokenizer.convert_ids_to_tokens(batch_token_ids[0]))
# And the output
#[101, 1124, 1110, 1126, 8362, 1394, 5086, 1906, 3648, 119, 102, 0, 0, 0, 0]
#['[CLS]', 'He', 'is', 'an', 'un', '##in', '##vi', '##ted', 'guest', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
# here we have the already tokenized sentences, Note we cannot pass the batch_token_ids
# to the batch_encode_plus method in the newer version, so need to convert them to token first
tokens1 = tokenizer.tokenize(sentences[0], add_special_tokens=True)
tokens2 = tokenizer.tokenize(sentences[1], add_special_tokens=True)
encodings = tokenizer.batch_encode_plus([tokens1, tokens2], max_length=20, padding=True, truncation=True, is_split_into_words=True, add_special_tokens=False, return_tensors="pt")
batch_token_ids, attention_masks = encodings["input_ids"], encodings["attention_mask"]
print(batch_token_ids[0])
print(tokenizer.convert_ids_to_tokens(batch_token_ids[0]))
# And the output (not the desired one)
tensor([ 101, 1124, 1110, 1126, 8362, 108, 108, 1107, 108, 108,
191, 1182, 108, 108, 21359, 1181, 3648, 119, 102])
['[CLS]', 'He', 'is', 'an', 'un', '#', '#', 'in', '#', '#', 'v', '##i', '#', '#', 'te', '##d', 'guest', '.', '[SEP]']
不确定如何处理这个问题,或者我在这里做错了什么。
我在这里写信是因为我无法对问题本身发表评论。我建议查看每个标记化(token1 和 token2)的输出并将其与 batch_token_ids 进行比较。奇怪的是输出不包含第二句话中的标记。可能那里有问题。
您需要非快速分词器才能使用整数分词列表。
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name, add_prefix_space=True, use_fast=False)
use_fast
标志在以后的版本中默认启用。
来自 HuggingFace 文档,
batch_encode_plus(batch_text_or_text_pairs: ...)
batch_text_or_text_pairs (List[str], List[Tuple[str, str]], List[List[str]], List[Tuple[List[str], List[str]]], and for not-fast tokenizers, also List[List[int]], List[Tuple[List[int], List[int]]])