如何在 huggingface 模型中获得令牌的概率分布?
How to get a probability distribution over tokens in a huggingface model?
我正在学习 this 教程,了解如何通过掩码词进行预测。我使用这个的原因是因为它似乎同时处理多个屏蔽词,而我尝试的其他方法一次只能使用 1 个屏蔽词。
代码:
from transformers import RobertaTokenizer, RobertaForMaskedLM
import torch
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaForMaskedLM.from_pretrained('roberta-base')
sentence = "Tom has fully ___ ___ ___ illness."
def get_prediction (sent):
token_ids = tokenizer.encode(sent, return_tensors='pt')
masked_position = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero()
masked_pos = [mask.item() for mask in masked_position ]
with torch.no_grad():
output = model(token_ids)
last_hidden_state = output[0].squeeze()
list_of_list =[]
for index,mask_index in enumerate(masked_pos):
mask_hidden_state = last_hidden_state[mask_index]
idx = torch.topk(mask_hidden_state, k=5, dim=0)[1]
words = [tokenizer.decode(i.item()).strip() for i in idx]
list_of_list.append(words)
print ("Mask ",index+1,"Guesses : ",words)
best_guess = ""
for j in list_of_list:
best_guess = best_guess+" "+j[0]
return best_guess
print ("Original Sentence: ",sentence)
sentence = sentence.replace("___","<mask>")
print ("Original Sentence replaced with mask: ",sentence)
print ("\n")
predicted_blanks = get_prediction(sentence)
print ("\nBest guess for fill in the blank :::",predicted_blanks)
如何获得 5 个标记的概率分布而不是它们的索引?也就是说,类似于 this 方法(我之前使用过,但是一旦我更改为多个屏蔽标记,我就会收到错误)将分数作为输出:
from transformers import pipeline
# Initialize MLM pipeline
mlm = pipeline('fill-mask')
# Get mask token
mask = mlm.tokenizer.mask_token
# Get result for particular masked phrase
phrase = f'Read the rest of this {mask} to understand things in more detail'
result = mlm(phrase)
# Print result
print(result)
[{
'sequence': 'Read the rest of this article to understand things in more detail',
'score': 0.35419148206710815,
'token': 1566,
'token_str': ' article'
},...
变量last_hidden_state[mask_index]
是masked token预测的logits。因此,要获得令牌概率,您可以对此使用 softmax
,即
probs = torch.nn.functional.softmax(last_hidden_state[mask_index])
然后您可以使用
获得 topk 的概率
word_probs = [probs[i] for i in idx]
PS 我假设您知道应该使用 而不是 ___,即 sent = "Tom has fully illness." ,我得到以下信息:
Mask 1 Guesses : ['recovered', 'returned', 'cleared', 'recover', 'healed']
[tensor(0.9970), tensor(0.0007), tensor(0.0003), tensor(0.0003), tensor(0.0002)]
Mask 2 Guesses : ['from', 'his', 'with', 'to', 'the']
[tensor(0.5066), tensor(0.2048), tensor(0.0684), tensor(0.0513), tensor(0.0399)]
Mask 3 Guesses : ['his', 'the','mental', 'serious', 'this']
[tensor(0.5152), tensor(0.2371), tensor(0.0407), tensor(0.0257), tensor(0.0199)]
我正在学习 this 教程,了解如何通过掩码词进行预测。我使用这个的原因是因为它似乎同时处理多个屏蔽词,而我尝试的其他方法一次只能使用 1 个屏蔽词。
代码:
from transformers import RobertaTokenizer, RobertaForMaskedLM
import torch
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaForMaskedLM.from_pretrained('roberta-base')
sentence = "Tom has fully ___ ___ ___ illness."
def get_prediction (sent):
token_ids = tokenizer.encode(sent, return_tensors='pt')
masked_position = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero()
masked_pos = [mask.item() for mask in masked_position ]
with torch.no_grad():
output = model(token_ids)
last_hidden_state = output[0].squeeze()
list_of_list =[]
for index,mask_index in enumerate(masked_pos):
mask_hidden_state = last_hidden_state[mask_index]
idx = torch.topk(mask_hidden_state, k=5, dim=0)[1]
words = [tokenizer.decode(i.item()).strip() for i in idx]
list_of_list.append(words)
print ("Mask ",index+1,"Guesses : ",words)
best_guess = ""
for j in list_of_list:
best_guess = best_guess+" "+j[0]
return best_guess
print ("Original Sentence: ",sentence)
sentence = sentence.replace("___","<mask>")
print ("Original Sentence replaced with mask: ",sentence)
print ("\n")
predicted_blanks = get_prediction(sentence)
print ("\nBest guess for fill in the blank :::",predicted_blanks)
如何获得 5 个标记的概率分布而不是它们的索引?也就是说,类似于 this 方法(我之前使用过,但是一旦我更改为多个屏蔽标记,我就会收到错误)将分数作为输出:
from transformers import pipeline
# Initialize MLM pipeline
mlm = pipeline('fill-mask')
# Get mask token
mask = mlm.tokenizer.mask_token
# Get result for particular masked phrase
phrase = f'Read the rest of this {mask} to understand things in more detail'
result = mlm(phrase)
# Print result
print(result)
[{
'sequence': 'Read the rest of this article to understand things in more detail',
'score': 0.35419148206710815,
'token': 1566,
'token_str': ' article'
},...
变量last_hidden_state[mask_index]
是masked token预测的logits。因此,要获得令牌概率,您可以对此使用 softmax
,即
probs = torch.nn.functional.softmax(last_hidden_state[mask_index])
然后您可以使用
获得 topk 的概率word_probs = [probs[i] for i in idx]
PS 我假设您知道应该使用
Mask 1 Guesses : ['recovered', 'returned', 'cleared', 'recover', 'healed']
[tensor(0.9970), tensor(0.0007), tensor(0.0003), tensor(0.0003), tensor(0.0002)]
Mask 2 Guesses : ['from', 'his', 'with', 'to', 'the']
[tensor(0.5066), tensor(0.2048), tensor(0.0684), tensor(0.0513), tensor(0.0399)]
Mask 3 Guesses : ['his', 'the','mental', 'serious', 'this']
[tensor(0.5152), tensor(0.2371), tensor(0.0407), tensor(0.0257), tensor(0.0199)]