使用 autotokenizer 进行问答任务
Using autotokenizer for question answering task
我训练过这个tokenizer
我有一个使用 T5 的问答任务,我需要像 T5Tokenizer 那样对问题和上下文进行标记化。我的意思是 quesion_ids</s>context_ids</s><pad>
我做了以下事情
tokenizer = AutoTokenizer.from_pretrained(path_to_tokenizer_json_file, config=AutoConfig.from_pretrained(path_to_model_config))
query_tok = 'Do you go to school?'
doc_tok = 'What could be a second sequence to the uninteresting text, What could \
be a second sequence to the uninteresting text'
query_tok = " ".join(query_tok.split()[:min(5,len(query_tok.split()))])
doc_tok = " ".join(doc_tok.split()[:min(9,len(doc_tok.split()))])
print("query_tok: ", query_tok)
print("doc_tok: ", doc_tok)
ids = tokenizer(
query_tok+'</s>',
doc_tok+'</s>',
max_length=15,
padding='max_length',
truncation="only_second", # need to experiement
return_attention_mask=True,
add_special_tokens=True,
return_tensors="pt"
)
print("ids: ",ids)
print("ids shape: ",ids['input_ids'].shape)
print("Len ids: ",len(ids['input_ids'].flatten()), len(ids['input_ids'].flatten()) - 9)
print(tokenizer.decode([id for id in ids['input_ids'].flatten()]))
结果是
> query_tok: Do you go to school?
> doc_tok: What could be a second sequence to the uninteresting
> ids: {'input_ids': tensor([[ 3, 11, 36, 316, 475, 16, 156, 187, 1, 219, 312, 63,
> 3, 8, 114, 1906, 16, 4, 352, 6347, 483, 38, 0, 0,
> 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
> 0]])}
> ids shape: torch.Size([1, 25])
> Len ids: 25 16
> do you go to school?</s> what could be a second sequence to the uninteresting<pad><pad><pad>
>
如何让分词器在填充前或截断后添加到上下文的末尾?
好的,对于那些想在问答任务中使用预训练分词器的人,可以通过以下示例来完成:
if (ids['input_ids'] == 0).nonzero(as_tuple=True)[1].nelement() == 0:
ids['input_ids'][0][-1] = 1
else :
idx = (ids['input_ids'] == 0).nonzero(as_tuple=True)[1][0]
ids['input_ids'][0][idx] = 1
ids['attention_mask'][0][idx] = 1
我训练过这个tokenizer
我有一个使用 T5 的问答任务,我需要像 T5Tokenizer 那样对问题和上下文进行标记化。我的意思是 quesion_ids</s>context_ids</s><pad>
我做了以下事情
tokenizer = AutoTokenizer.from_pretrained(path_to_tokenizer_json_file, config=AutoConfig.from_pretrained(path_to_model_config))
query_tok = 'Do you go to school?'
doc_tok = 'What could be a second sequence to the uninteresting text, What could \
be a second sequence to the uninteresting text'
query_tok = " ".join(query_tok.split()[:min(5,len(query_tok.split()))])
doc_tok = " ".join(doc_tok.split()[:min(9,len(doc_tok.split()))])
print("query_tok: ", query_tok)
print("doc_tok: ", doc_tok)
ids = tokenizer(
query_tok+'</s>',
doc_tok+'</s>',
max_length=15,
padding='max_length',
truncation="only_second", # need to experiement
return_attention_mask=True,
add_special_tokens=True,
return_tensors="pt"
)
print("ids: ",ids)
print("ids shape: ",ids['input_ids'].shape)
print("Len ids: ",len(ids['input_ids'].flatten()), len(ids['input_ids'].flatten()) - 9)
print(tokenizer.decode([id for id in ids['input_ids'].flatten()]))
结果是
> query_tok: Do you go to school?
> doc_tok: What could be a second sequence to the uninteresting
> ids: {'input_ids': tensor([[ 3, 11, 36, 316, 475, 16, 156, 187, 1, 219, 312, 63,
> 3, 8, 114, 1906, 16, 4, 352, 6347, 483, 38, 0, 0,
> 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
> 0]])}
> ids shape: torch.Size([1, 25])
> Len ids: 25 16
> do you go to school?</s> what could be a second sequence to the uninteresting<pad><pad><pad>
>
如何让分词器在填充前或截断后添加到上下文的末尾?
好的,对于那些想在问答任务中使用预训练分词器的人,可以通过以下示例来完成:
if (ids['input_ids'] == 0).nonzero(as_tuple=True)[1].nelement() == 0:
ids['input_ids'][0][-1] = 1
else :
idx = (ids['input_ids'] == 0).nonzero(as_tuple=True)[1][0]
ids['input_ids'][0][idx] = 1
ids['attention_mask'][0][idx] = 1