抱脸 - GPT2 中未知令牌的高效令牌化
Hugging face - Efficient tokenization of unknown token in GPT2
我正在尝试使用 GPT2 训练对话系统。对于标记化,我使用以下配置来添加特殊标记。
from transformers import (
AdamW,
AutoConfig,
AutoTokenizer,
PreTrainedModel,
PreTrainedTokenizer,
get_linear_schedule_with_warmup,
)
SPECIAL_TOKENS = {
"bos_token": "<|endoftext|>",
"eos_token": "<|endoftext|>",
"pad_token": "[PAD]",
"additional_special_tokens": ["[SYS]", "[USR]", "[KG]", "[SUB]", "[PRED]", "[OBJ]", "[TRIPLE]", "[SEP]", "[Q]","[DOM]"]
}
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
tokenizer.add_special_tokens(SPECIAL_TOKENS)
接下来,当我尝试标记一个序列(对话的话语)并稍后转换为 ID 时,我的序列中一些最重要的标记被映射为未知标记,因为这些重要标记的 ID 变成了与 bos 和 eos 相同,因为它们都映射到 <|endoftext|>,就像 GPT2 的 source code.
这是一个工作示例 -
tokenized_sequence = ['[PRED]', 'name', '[SUB]', 'frankie_and_bennys', '[PRED]', 'address', '[SUB]', 'cambridge_leisure_park_clifton_way_cherry_hinton', '[PRED]', 'area', '[SUB]', 'south', '[PRED]', 'food', '[SUB]', 'italian', '[PRED]', 'phone', '[SUB]', '01223_412430', '[PRED]', 'pricerange', '[SUB]', 'expensive', '[PRED]', 'postcode', '[SUB]', 'cb17dy']
important_tokens = ['frankie_and_bennys','cambridge_leisure_park_clifton_way_cherry_hinton','italian','postcode', 'cb17dy']
tokens_to_ids = [50262, 3672, 50261, 50256, 50262, 21975, 50261, 50256, 50262, 20337, 50261, 35782, 50262, 19425, 50261, 50256, 50262, 4862, 50261, 50256, 50262, 50256, 50261, 22031, 50262, 50256, 50261, 50256]
ids_to_tokens = [PRED]name[SUB]<|endoftext|>[PRED]address[SUB]<|endoftext|>[PRED]area[SUB]south[PRED]food[SUB]<|endoftext|>[PRED]phone[SUB]<|endoftext|>[PRED]<|endoftext|>[SUB]expensive[PRED]<|endoftext|>[SUB]<|endoftext|>
如您所见,important_tokens 被映射到 ID 50256(即 |endoftext|),该模型无法看到和学习这些重要的标记,因此生成非常糟糕且经常出现幻觉的响应.
有什么方法可以快速有效地解决此问题?
对于包含几个实际单词的important_tokens(如frankie_and_bennys
),您可以将underscore
替换为space
并正常喂养它们,或者将它们添加为一个特殊的令牌。我更喜欢第一个选项,因为这样你就可以为它们的子标记使用预训练的嵌入。对于那些不是实际单词的(如 cb17dy
),您必须将它们添加为特殊标记。
from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
your_string = '[PRED] name [SUB] frankie and bennys frankie_and_bennys [PRED] cb17dy'
SPECIAL_TOKENS = {
"bos_token": "<|endoftext|>",
"eos_token": "<|endoftext|>",
"pad_token": "[PAD]",
"additional_special_tokens": ["[SYS]", "[USR]", "[KG]", "[SUB]", "[PRED]", "[OBJ]", "[TRIPLE]", "[SEP]", "[Q]","[DOM]", 'frankie_and_bennys', 'cb17dy']
}
tokenizer.add_special_tokens(SPECIAL_TOKENS)
print(tokenizer(your_string)['input_ids'])
print(tokenizer.convert_ids_to_tokens(tokenizer(your_string)['input_ids']))
输出
[50262, 1438, 220, 50261, 14346, 494, 290, 275, 1697, 893, 220, 50268, 220, 50262, 220, 220, 50269]
['[PRED]', 'Ġname', 'Ġ', '[SUB]', 'Ġfrank', 'ie', 'Ġand', 'Ġb', 'enn', 'ys', 'Ġ', 'frankie_and_bennys', 'Ġ', '[PRED]', 'Ġ', 'Ġ', 'cb17dy']
我正在尝试使用 GPT2 训练对话系统。对于标记化,我使用以下配置来添加特殊标记。
from transformers import (
AdamW,
AutoConfig,
AutoTokenizer,
PreTrainedModel,
PreTrainedTokenizer,
get_linear_schedule_with_warmup,
)
SPECIAL_TOKENS = {
"bos_token": "<|endoftext|>",
"eos_token": "<|endoftext|>",
"pad_token": "[PAD]",
"additional_special_tokens": ["[SYS]", "[USR]", "[KG]", "[SUB]", "[PRED]", "[OBJ]", "[TRIPLE]", "[SEP]", "[Q]","[DOM]"]
}
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
tokenizer.add_special_tokens(SPECIAL_TOKENS)
接下来,当我尝试标记一个序列(对话的话语)并稍后转换为 ID 时,我的序列中一些最重要的标记被映射为未知标记,因为这些重要标记的 ID 变成了与 bos 和 eos 相同,因为它们都映射到 <|endoftext|>,就像 GPT2 的 source code.
这是一个工作示例 -
tokenized_sequence = ['[PRED]', 'name', '[SUB]', 'frankie_and_bennys', '[PRED]', 'address', '[SUB]', 'cambridge_leisure_park_clifton_way_cherry_hinton', '[PRED]', 'area', '[SUB]', 'south', '[PRED]', 'food', '[SUB]', 'italian', '[PRED]', 'phone', '[SUB]', '01223_412430', '[PRED]', 'pricerange', '[SUB]', 'expensive', '[PRED]', 'postcode', '[SUB]', 'cb17dy']
important_tokens = ['frankie_and_bennys','cambridge_leisure_park_clifton_way_cherry_hinton','italian','postcode', 'cb17dy']
tokens_to_ids = [50262, 3672, 50261, 50256, 50262, 21975, 50261, 50256, 50262, 20337, 50261, 35782, 50262, 19425, 50261, 50256, 50262, 4862, 50261, 50256, 50262, 50256, 50261, 22031, 50262, 50256, 50261, 50256]
ids_to_tokens = [PRED]name[SUB]<|endoftext|>[PRED]address[SUB]<|endoftext|>[PRED]area[SUB]south[PRED]food[SUB]<|endoftext|>[PRED]phone[SUB]<|endoftext|>[PRED]<|endoftext|>[SUB]expensive[PRED]<|endoftext|>[SUB]<|endoftext|>
如您所见,important_tokens 被映射到 ID 50256(即 |endoftext|),该模型无法看到和学习这些重要的标记,因此生成非常糟糕且经常出现幻觉的响应.
有什么方法可以快速有效地解决此问题?
对于包含几个实际单词的important_tokens(如frankie_and_bennys
),您可以将underscore
替换为space
并正常喂养它们,或者将它们添加为一个特殊的令牌。我更喜欢第一个选项,因为这样你就可以为它们的子标记使用预训练的嵌入。对于那些不是实际单词的(如 cb17dy
),您必须将它们添加为特殊标记。
from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
your_string = '[PRED] name [SUB] frankie and bennys frankie_and_bennys [PRED] cb17dy'
SPECIAL_TOKENS = {
"bos_token": "<|endoftext|>",
"eos_token": "<|endoftext|>",
"pad_token": "[PAD]",
"additional_special_tokens": ["[SYS]", "[USR]", "[KG]", "[SUB]", "[PRED]", "[OBJ]", "[TRIPLE]", "[SEP]", "[Q]","[DOM]", 'frankie_and_bennys', 'cb17dy']
}
tokenizer.add_special_tokens(SPECIAL_TOKENS)
print(tokenizer(your_string)['input_ids'])
print(tokenizer.convert_ids_to_tokens(tokenizer(your_string)['input_ids']))
输出
[50262, 1438, 220, 50261, 14346, 494, 290, 275, 1697, 893, 220, 50268, 220, 50262, 220, 220, 50269]
['[PRED]', 'Ġname', 'Ġ', '[SUB]', 'Ġfrank', 'ie', 'Ġand', 'Ġb', 'enn', 'ys', 'Ġ', 'frankie_and_bennys', 'Ġ', '[PRED]', 'Ġ', 'Ġ', 'cb17dy']