Input/output 格式微调 Huggingface RobertaForQuestionAnswering

Input/output format for Fine Tuning Huggingface RobertaForQuestionAnswering

我正在尝试在我的自定义数据集上微调“RobertaForQuestionAnswering”,但我对它需要的输入参数感到困惑。这是示例代码。

>>> from transformers import RobertaTokenizer, RobertaForQuestionAnswering
>>> import torch

>>> tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
>>> model = RobertaForQuestionAnswering.from_pretrained('roberta-base')

>>> question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
>>> inputs = tokenizer(question, text, return_tensors='pt')
>>> start_positions = torch.tensor([1])
>>> end_positions = torch.tensor([3])

>>> outputs = model(**inputs, start_positions=start_positions, end_positions=end_positions)
>>> loss = outputs.loss
>>> start_scores = outputs.start_logits
>>> end_scores = outputs.end_logits

我无法理解变量 start_positions & end_positions模型作为正在生成的输入和变量 start_scores & end_scores

问答 ot 基本上是一个 DL 模型,它通过 提取部分上下文 (在你的情况下称为 text)来创建答案。这意味着 QAbot 的目标是识别答案的 startend


QAbot 的基本功能:

首先,问题和上下文的每个词都被标记化。这意味着它(可能分为 characters/subwords 然后)转换为数字。它实际上取决于分词器的类型(这意味着它取决于您使用的模型,因为您将使用相同的分词器——这是您代码的第三行所做的)。我建议 this very useful guide.

然后,标记化的 question + text 被传递到执行其内部操作的模型中。还记得一开始我告诉过模型会识别答案的 startend 吗?好吧,它通过计算 question + text 的每个标记来计算该特定标记是答案开始的概率。这个概率是 start_logits 的 softmaxed 版本。之后,对结束令牌进行相同的操作。

因此,这就是 start_scoresend_scores 的含义:每个标记分别是答案的开始和结束的 pre-softmax 分数。


那么,start_positionstop_position 是什么?

如前所述here,它们是:

start_positions (torch.LongTensor of shape (batch_size,), optional) – Labels for position (index) of the start of the labelled span for computing the token classification loss. Positions are clamped to the length of the sequence (sequence_length). Position outside of the sequence are not taken into account for computing the loss.

end_positions (torch.LongTensor of shape (batch_size,), optional) – Labels for position (index) of the end of the labelled span for computing the token classification loss. Positions are clamped to the length of the sequence (sequence_length). Position outside of the sequence are not taken into account for computing the loss.


此外,您正在使用的模型(roberta-base,请参阅 the model on the HuggingFace repository and the RoBERTa official paper 针对 QuestionAnswering 进行微调。它“只是”一个使用 MaskedLanguageModeling 训练的模型,这意味着该模型对英语有一般的理解,但不适合回答问题。您当然可以使用它,但它可能会产生非最佳结果。

我建议你使用相同的模型,在问题回答上专门微调的版本:roberta-base-squad2,参见it on HuggingFace

实际上,您必须将加载模型和分词器的行替换为:

tokenizer = RobertaTokenizer.from_pretrained('roberta-base-squad2')
model = RobertaForQuestionAnswering.from_pretrained('roberta-base-squad2')

这将提供更准确的结果。

阅读奖金:what fine-tuning is and how it works