HuggingFace Tokenizer:如何获取 unicodes 字符串的标记?
HuggingFace Tokenizer: how to get a token for unicodes strings?
以下代码没有为 unicode 字符串 '\uf0b7' 提供标记:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased',
do_lower_case=True)
test_words = ['crazy', 'character', '\uf0b7']
input_ids = tokenizer(test_words,is_split_into_words=True)
print(f'token ids: {input_ids["input_ids"]}')
# token ids: [101, 4689, 2839, 102] # <- where is the token for the third word?
print(f'word ids: {input_ids.word_ids()}')
# word ids: [None, 0, 1, None] # <- where is the third word (indice 2)?
有没有办法告诉分词器给 unicode 词一个标记(例如未知的 [UKN] 标记或其他任何东西)?
我试过添加规范化器,但输出是一样的:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
normalizer = normalizers.Sequence([NFD(), StripAccents()])
tokenizer.normalizer = normalizer
input_ids = tokenizer(test_words,is_split_into_words=True)
print(f'token ids: {input_ids["input_ids"]}')
# token ids: [101, 4689, 2839, 102]
添加您想要的 Unicode 作为特殊标记?
special_tokens_dict = {'additional_special_tokens': ['\uf0b7']}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
test_words = ['crazy', 'character', '\uf0b7']
tokenizer(test_words, is_split_into_words=True)
输出:
{'input_ids': [101, 4689, 2839, 30522, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}
以下代码没有为 unicode 字符串 '\uf0b7' 提供标记:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased',
do_lower_case=True)
test_words = ['crazy', 'character', '\uf0b7']
input_ids = tokenizer(test_words,is_split_into_words=True)
print(f'token ids: {input_ids["input_ids"]}')
# token ids: [101, 4689, 2839, 102] # <- where is the token for the third word?
print(f'word ids: {input_ids.word_ids()}')
# word ids: [None, 0, 1, None] # <- where is the third word (indice 2)?
有没有办法告诉分词器给 unicode 词一个标记(例如未知的 [UKN] 标记或其他任何东西)?
我试过添加规范化器,但输出是一样的:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
normalizer = normalizers.Sequence([NFD(), StripAccents()])
tokenizer.normalizer = normalizer
input_ids = tokenizer(test_words,is_split_into_words=True)
print(f'token ids: {input_ids["input_ids"]}')
# token ids: [101, 4689, 2839, 102]
添加您想要的 Unicode 作为特殊标记?
special_tokens_dict = {'additional_special_tokens': ['\uf0b7']}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
test_words = ['crazy', 'character', '\uf0b7']
tokenizer(test_words, is_split_into_words=True)
输出:
{'input_ids': [101, 4689, 2839, 30522, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}