Bert模型输出解释

Question

我为此搜索了很多，但仍然没有一个清晰的想法，所以我希望你能帮助我：

我正在尝试将德语文本翻译成英语！我使用这段代码：


tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-de-en")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-de-en")

batch = tokenizer(
    list(data_bert[:100]),
    padding=True,
    truncation=True,
    max_length=250,
    return_tensors="pt")["input_ids"]

results = model(batch)

返回尺寸错误！我通过将最后一行代码切换为：

解决了这个问题（感谢社区：https://github.com/huggingface/transformers/issues/5480）

results = model(input_ids = batch,decoder_input_ids=batch)

现在我的输出看起来像一个很长的数组。这个输出到底是什么？这些是某种词嵌入吗？如果是：我该如何继续将这些嵌入转换为英语文本？非常感谢！

Answer 1

我认为这个问题为您的困境提供了一个可能的答案： .

实际上，对于 BERT 的输出，您会得到每个单词的矢量化表示。从本质上讲，将输出用于其他任务更容易，但在机器翻译的情况下更棘手。

在机器翻译的上下文中使用来自 transformers 库的 seq2seq 模型的一个很好的起点如下：https://github.com/huggingface/notebooks/blob/master/examples/translation.ipynb.

以上示例提供了如何将英语翻译成罗马尼亚语。

Answer 2

添加到 Timbus 的回答中，

What is this output precisely? Are these some sort of word embeddings?

results 是 <class 'transformers.modeling_outputs.Seq2SeqLMOutput'> 类型，你可以做

results.__dict__.keys()

检查 results 是否包含以下内容：

dict_keys(['loss', 'logits', 'past_key_values', 'decoder_hidden_states', 'decoder_attentions', 'cross_attentions', 'encoder_last_hidden_state', 'encoder_hidden_states', 'encoder_attentions'])

您可以在 huggingface documentation.

中阅读有关此 class 的更多信息

How shall I go on with converting these embeddings to the texts in the english language?

要用英语解释文本，您可以使用 model.generate，它可以通过以下方式轻松解码：

predictions = model.generate(batch)
english_text = tokenizer.batch_decode(predictions)

Bert模型输出解释

Bert model output interpretation

translation

word-embedding

bert-language-model

huggingface-transformers