HuggingFace Tokenizer：如何获取 unicodes 字符串的标记？

Question

以下代码没有为 unicode 字符串 '\uf0b7' 提供标记:

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased',
do_lower_case=True) 
test_words = ['crazy', 'character', '\uf0b7']
input_ids = tokenizer(test_words,is_split_into_words=True)
print(f'token ids: {input_ids["input_ids"]}')
# token ids: [101, 4689, 2839, 102]  # <- where is the token for the third word?

print(f'word ids:  {input_ids.word_ids()}')
# word ids:  [None, 0, 1, None]   # <- where is the third word (indice 2)?

有没有办法告诉分词器给 unicode 词一个标记（例如未知的 [UKN] 标记或其他任何东西）？

我试过添加规范化器，但输出是一样的：

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
normalizer = normalizers.Sequence([NFD(), StripAccents()])
tokenizer.normalizer = normalizer
input_ids = tokenizer(test_words,is_split_into_words=True)
print(f'token ids: {input_ids["input_ids"]}')
# token ids: [101, 4689, 2839, 102]

Answer 1

添加您想要的 Unicode 作为特殊标记？

    special_tokens_dict = {'additional_special_tokens': ['\uf0b7']}
    num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
    test_words = ['crazy', 'character', '\uf0b7']
    tokenizer(test_words, is_split_into_words=True)

输出：

{'input_ids': [101, 4689, 2839, 30522, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}

HuggingFace Tokenizer：如何获取 unicodes 字符串的标记？

HuggingFace Tokenizer: how to get a token for unicodes strings?

python

nlp

huggingface-tokenizers