如何使用 huggingface masked 语言模型计算句子的困惑度?
How to calculate perplexity of a sentence using huggingface masked language models?
我有几个masked语言模型(主要是Bert, Roberta, Albert, Electra)。我还有一个句子数据集。如何获取每句话的困惑度?
从 huggingface 文档 here 他们提到困惑“对于像 BERT 这样的掩码语言模型没有很好的定义”,尽管我仍然看到人们以某种方式计算它。
例如,在这个 问题中,他们使用函数
计算了它
def score(model, tokenizer, sentence, mask_token_id=103):
tensor_input = tokenizer.encode(sentence, return_tensors='pt')
repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)
mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]
masked_input = repeat_input.masked_fill(mask == 1, 103)
labels = repeat_input.masked_fill( masked_input != 103, -100)
loss,_ = model(masked_input, masked_lm_labels=labels)
result = np.exp(loss.item())
return result
score(model, tokenizer, '我爱你') # returns 45.63794545581973
但是,当我尝试使用我得到的代码时 TypeError: forward() got an unexpected keyword argument 'masked_lm_labels'
。
我用我的几个模型试过了:
from transformers import pipeline, BertForMaskedLM, BertForMaskedLM, AutoTokenizer, RobertaForMaskedLM, AlbertForMaskedLM, ElectraForMaskedLM
import torch
1)
tokenizer = AutoTokenizer.from_pretrained("bioformers/bioformer-cased-v1.0")
model = BertForMaskedLM.from_pretrained("bioformers/bioformer-cased-v1.0")
2)
tokenizer = AutoTokenizer.from_pretrained("sultan/BioM-ELECTRA-Large-Generator")
model = ElectraForMaskedLM.from_pretrained("sultan/BioM-ELECTRA-Large-Generator")
This SO 问题也使用 masked_lm_labels
作为输入,它似乎以某种方式工作。
有一篇论文 Masked Language Model Scoring 从屏蔽语言模型中探索了伪困惑,并表明伪困惑虽然在理论上没有得到很好的证明,但在比较文本的“自然性”方面仍然表现良好。
至于代码,你的代码片段是完全正确的,但有一个细节:在最近的 Huggingface BERT 实现中,masked_lm_labels
被重命名为简单的 labels
,以使各种模型的接口更加兼容.我还将硬编码 103
替换为通用 tokenizer.mask_token_id
。所以下面的代码片段应该有效:
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch
import numpy as np
model_name = 'cointegrated/rubert-tiny'
model = AutoModelForMaskedLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
def score(model, tokenizer, sentence):
tensor_input = tokenizer.encode(sentence, return_tensors='pt')
repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)
mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]
masked_input = repeat_input.masked_fill(mask == 1, tokenizer.mask_token_id)
labels = repeat_input.masked_fill( masked_input != tokenizer.mask_token_id, -100)
with torch.inference_mode():
loss = model(masked_input, labels=labels).loss
return np.exp(loss.item())
print(score(sentence='London is the capital of Great Britain.', model=model, tokenizer=tokenizer))
# 4.541251105675365
print(score(sentence='London is the capital of South America.', model=model, tokenizer=tokenizer))
# 6.162017238332462
您可以通过 运行 this gist 在 Google Colab 中尝试此代码。
我有几个masked语言模型(主要是Bert, Roberta, Albert, Electra)。我还有一个句子数据集。如何获取每句话的困惑度?
从 huggingface 文档 here 他们提到困惑“对于像 BERT 这样的掩码语言模型没有很好的定义”,尽管我仍然看到人们以某种方式计算它。
例如,在这个
def score(model, tokenizer, sentence, mask_token_id=103):
tensor_input = tokenizer.encode(sentence, return_tensors='pt')
repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)
mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]
masked_input = repeat_input.masked_fill(mask == 1, 103)
labels = repeat_input.masked_fill( masked_input != 103, -100)
loss,_ = model(masked_input, masked_lm_labels=labels)
result = np.exp(loss.item())
return result
score(model, tokenizer, '我爱你') # returns 45.63794545581973
但是,当我尝试使用我得到的代码时 TypeError: forward() got an unexpected keyword argument 'masked_lm_labels'
。
我用我的几个模型试过了:
from transformers import pipeline, BertForMaskedLM, BertForMaskedLM, AutoTokenizer, RobertaForMaskedLM, AlbertForMaskedLM, ElectraForMaskedLM
import torch
1)
tokenizer = AutoTokenizer.from_pretrained("bioformers/bioformer-cased-v1.0")
model = BertForMaskedLM.from_pretrained("bioformers/bioformer-cased-v1.0")
2)
tokenizer = AutoTokenizer.from_pretrained("sultan/BioM-ELECTRA-Large-Generator")
model = ElectraForMaskedLM.from_pretrained("sultan/BioM-ELECTRA-Large-Generator")
This SO 问题也使用 masked_lm_labels
作为输入,它似乎以某种方式工作。
有一篇论文 Masked Language Model Scoring 从屏蔽语言模型中探索了伪困惑,并表明伪困惑虽然在理论上没有得到很好的证明,但在比较文本的“自然性”方面仍然表现良好。
至于代码,你的代码片段是完全正确的,但有一个细节:在最近的 Huggingface BERT 实现中,masked_lm_labels
被重命名为简单的 labels
,以使各种模型的接口更加兼容.我还将硬编码 103
替换为通用 tokenizer.mask_token_id
。所以下面的代码片段应该有效:
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch
import numpy as np
model_name = 'cointegrated/rubert-tiny'
model = AutoModelForMaskedLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
def score(model, tokenizer, sentence):
tensor_input = tokenizer.encode(sentence, return_tensors='pt')
repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)
mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]
masked_input = repeat_input.masked_fill(mask == 1, tokenizer.mask_token_id)
labels = repeat_input.masked_fill( masked_input != tokenizer.mask_token_id, -100)
with torch.inference_mode():
loss = model(masked_input, labels=labels).loss
return np.exp(loss.item())
print(score(sentence='London is the capital of Great Britain.', model=model, tokenizer=tokenizer))
# 4.541251105675365
print(score(sentence='London is the capital of South America.', model=model, tokenizer=tokenizer))
# 6.162017238332462
您可以通过 运行 this gist 在 Google Colab 中尝试此代码。