Input/output 格式微调 Huggingface RobertaForQuestionAnswering

Question

我正在尝试在我的自定义数据集上微调“RobertaForQuestionAnswering”，但我对它需要的输入参数感到困惑。这是示例代码。

>>> from transformers import RobertaTokenizer, RobertaForQuestionAnswering
>>> import torch

>>> tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
>>> model = RobertaForQuestionAnswering.from_pretrained('roberta-base')

>>> question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
>>> inputs = tokenizer(question, text, return_tensors='pt')
>>> start_positions = torch.tensor([1])
>>> end_positions = torch.tensor([3])

>>> outputs = model(**inputs, start_positions=start_positions, end_positions=end_positions)
>>> loss = outputs.loss
>>> start_scores = outputs.start_logits
>>> end_scores = outputs.end_logits

我无法理解变量 start_positions & end_positions模型作为正在生成的输入和变量 start_scores & end_scores。

Answer 1

问答 ot 基本上是一个 DL 模型，它通过 提取部分上下文 （在你的情况下称为 text）来创建答案。这意味着 QAbot 的目标是识别答案的 start 和 end。

QAbot 的基本功能：

首先，问题和上下文的每个词都被标记化。这意味着它（可能分为 characters/subwords 然后）转换为数字。它实际上取决于分词器的类型（这意味着它取决于您使用的模型，因为您将使用相同的分词器——这是您代码的第三行所做的）。我建议 this very useful guide.

然后，标记化的 question + text 被传递到执行其内部操作的模型中。还记得一开始我告诉过模型会识别答案的 start 和 end 吗？好吧，它通过计算 question + text 的每个标记来计算该特定标记是答案开始的概率。这个概率是 start_logits 的 softmaxed 版本。之后，对结束令牌进行相同的操作。

因此，这就是 start_scores 和 end_scores 的含义：每个标记分别是答案的开始和结束的 pre-softmax 分数。

那么，start_position 和 stop_position 是什么？

如前所述here，它们是：

start_positions (torch.LongTensor of shape (batch_size,), optional) – Labels for position (index) of the start of the labelled span for computing the token classification loss. Positions are clamped to the length of the sequence (sequence_length). Position outside of the sequence are not taken into account for computing the loss.

end_positions (torch.LongTensor of shape (batch_size,), optional) – Labels for position (index) of the end of the labelled span for computing the token classification loss. Positions are clamped to the length of the sequence (sequence_length). Position outside of the sequence are not taken into account for computing the loss.

此外，您正在使用的模型（roberta-base，请参阅 the model on the HuggingFace repository and the RoBERTa official paper）未针对 QuestionAnswering 进行微调。它“只是”一个使用 MaskedLanguageModeling 训练的模型，这意味着该模型对英语有一般的理解，但不适合回答问题。您当然可以使用它，但它可能会产生非最佳结果。

我建议你使用相同的模型，在问题回答上专门微调的版本：roberta-base-squad2，参见it on HuggingFace。

实际上，您必须将加载模型和分词器的行替换为：

tokenizer = RobertaTokenizer.from_pretrained('roberta-base-squad2')
model = RobertaForQuestionAnswering.from_pretrained('roberta-base-squad2')

这将提供更准确的结果。

阅读奖金：what fine-tuning is and how it works

Input/output 格式微调 Huggingface RobertaForQuestionAnswering

Input/output format for Fine Tuning Huggingface RobertaForQuestionAnswering

nlp

question-answering

bert-language-model

huggingface-transformers

roberta-language-model