使用 autotokenizer 进行问答任务

Using autotokenizer for question answering task

我训练过这个tokenizer

我有一个使用 T5 的问答任务,我需要像 T5Tokenizer 那样对问题和上下文进行标记化。我的意思是 quesion_ids</s>context_ids</s><pad> 我做了以下事情

tokenizer = AutoTokenizer.from_pretrained(path_to_tokenizer_json_file, config=AutoConfig.from_pretrained(path_to_model_config))
query_tok = 'Do you go to school?'
doc_tok = 'What could be a second sequence to the uninteresting text, What could \
           be a second sequence to the uninteresting text'

query_tok = " ".join(query_tok.split()[:min(5,len(query_tok.split()))])
doc_tok = " ".join(doc_tok.split()[:min(9,len(doc_tok.split()))])
print("query_tok: ", query_tok)
print("doc_tok: ", doc_tok)

ids = tokenizer(
      query_tok+'</s>',
      doc_tok+'</s>',
      max_length=15,
      padding='max_length',
      truncation="only_second",   #   need to experiement
      return_attention_mask=True,
      add_special_tokens=True,
      return_tensors="pt"
      )
print("ids: ",ids)
print("ids shape: ",ids['input_ids'].shape)

print("Len ids: ",len(ids['input_ids'].flatten()), len(ids['input_ids'].flatten()) - 9)

print(tokenizer.decode([id for id in ids['input_ids'].flatten()]))

结果是

> query_tok:  Do you go to school?
> doc_tok:  What could be a second sequence to the uninteresting
> ids:  {'input_ids': tensor([[   3,   11,   36,  316,  475,   16,  156,  187,    1,  219,  312,   63,
>             3,    8,  114, 1906,   16,    4,  352, 6347,  483,   38,    0,    0,
>             0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
>          0]])}
> ids shape:  torch.Size([1, 25])
> Len ids:  25 16
> do you go to school?</s> what could be a second sequence to the uninteresting<pad><pad><pad>
> 

如何让分词器在填充前或截断后添加到上下文的末尾?

好的,对于那些想在问答任务中使用预训练分词器的人,可以通过以下示例来完成:

if (ids['input_ids'] == 0).nonzero(as_tuple=True)[1].nelement() == 0:
    ids['input_ids'][0][-1] = 1
else :
    idx = (ids['input_ids'] == 0).nonzero(as_tuple=True)[1][0] 
    ids['input_ids'][0][idx] = 1
    ids['attention_mask'][0][idx] = 1