哪种 huggingface 模型最适合将句子作为输入并将该句子中的单词作为输出？

Question

针对此类任务进行微调的最佳抱脸模型是什么：

示例输入 1：

If there's one person you don't want to interrupt in the middle of a sentence it's a judge.

示例输出 1：

sentence

示例输入 2：

A good baker will rise to the occasion, it's the yeast he can do.

示例输出 2：

yeast

Answer 1

建筑

这看起来像是问答类型的任务，其中输入是一个句子，输出是输入句子的一个跨度。在 transformers 中，这对应于 AutoModelForQuestionAnswering class。请参阅 original BERT paper 中的以下插图：

唯一的区别是输入将仅由“问题”组成。换句话说，您不会有问题、[SEP] 标记和段落，如图所示。

在不太了解您的任务的情况下，您可能希望将其建模为令牌分类类型的任务。在这里，您的输出将被标记为一些积极的标签，而其余的单词将被标记为其他一些消极的标签。如果这对您来说更有意义，请查看 AutoModelForTokenClassification class。我将在 question-answering 的基础上进行其余的讨论，但这些概念可以很容易地进行调整。

型号

因为你似乎在处理英文句子，你可能可以使用 pre-trained 模型，例如 bert-base-uncased。根据数据分布，您对语言模型的选择可能会发生变化。

不确定您正在执行的任务是什么，但除非有一些可用的 fine-tuned 模型正在执行您的任务（您可以尝试搜索 HuggingFace model hub），否则您将必须 fine-tune 自己的模型。为此，您需要有一个由句子组成的数据集，这些句子标有与答案范围相对应的开始和结束索引。见 documentation for more information on how to train.

评价

一旦你有了一个 fine-tuned 模型，你只需要运行你的测试句子通过模型来提取答案。以下代码改编自 HuggingFace documentation，可执行此操作：

from transformers import AutoModelForQuestionAnswering, AutoTokenizer
import torch


model = AutoModelForQuestionAnswering.from_pretrained(name)
tokenizer = AutoTokenizer.from_pretrained(name)

input = "A good baker will rise to the occasion, it's the yeast he can do."
inputs = tokenizer(input, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]

outputs = model(**inputs)
start_scores = outputs.start_logits
end_scores = outputs.end_logits
start_index = torch.argmax(start_scores)
end_index = torch.argmax(end_scores) + 1

answer = tokenizer.convert_tokens_to_string(
    tokenizer.convert_ids_to_tokens(input_ids[start_index:end_index])
)  # "yeast", hopefully!

哪种 huggingface 模型最适合将句子作为输入并将该句子中的单词作为输出？

Which huggingface model is the best for sentence as input and a word from that sentence as the output?

nlp

huggingface

建筑

型号

评价