Bert Transformer "Size Error" 而机器翻译
Bert Transformer "Size Error" while Machine Traslation
我很绝望,因为我不知道这里出了什么问题。我想将一个句子列表从德语翻译成英语。这是我的代码:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-de-en")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-de-en")
batch = tokenizer(
list(data_bert[:100]),
padding=True,
truncation=True,
max_length=250,
return_tensors="pt"
)
results = model(batch)
我收到这个错误:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/miniconda3/envs/textmallet/lib/python3.9/site-packages/transformers/tokenization_utils_base.py in __getattr__(self, item)
247 try:
--> 248 return self.data[item]
249 except KeyError:
KeyError: 'size'
During handling of the above exception, another exception occurred:
AttributeError Traceback (most recent call last)
/tmp/ipykernel_26502/2652187977.py in <module>
14
15
---> 16 results = model(batch)
17
~/miniconda3/envs/textmallet/lib/python3.9/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
1049 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1050 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051 return forward_call(*input, **kwargs)
1052 # Do not call functions when jit is used
1053 full_backward_hooks, non_full_backward_hooks = [], []
~/miniconda3/envs/textmallet/lib/python3.9/site-packages/transformers/models/marian/modeling_marian.py in forward(self, input_ids, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, past_key_values, inputs_embeds, decoder_inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
1274 )
1275
-> 1276 outputs = self.model(
1277 input_ids,
1278 attention_mask=attention_mask,
我不知道这里的确切问题是什么。如果有人能帮助我,我将不胜感激。
在此处描述的问题中(归功于
LysandreJik): https://github.com/huggingface/transformers/issues/5480, 问题似乎是 dict
的数据类型而不是 tensor
.
这可能是您需要更改分词器输出的情况:
batch = tokenizer(
list(data_bert[:100]),
padding=True,
truncation=True,
max_length=250,
return_tensors="pt"
)
收件人:
batch = tokenizer(
list(data_bert[:100]),
padding=True,
truncation=True,
max_length=250,
return_tensors="pt")["input_ids"]
我很绝望,因为我不知道这里出了什么问题。我想将一个句子列表从德语翻译成英语。这是我的代码:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-de-en")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-de-en")
batch = tokenizer(
list(data_bert[:100]),
padding=True,
truncation=True,
max_length=250,
return_tensors="pt"
)
results = model(batch)
我收到这个错误:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/miniconda3/envs/textmallet/lib/python3.9/site-packages/transformers/tokenization_utils_base.py in __getattr__(self, item)
247 try:
--> 248 return self.data[item]
249 except KeyError:
KeyError: 'size'
During handling of the above exception, another exception occurred:
AttributeError Traceback (most recent call last)
/tmp/ipykernel_26502/2652187977.py in <module>
14
15
---> 16 results = model(batch)
17
~/miniconda3/envs/textmallet/lib/python3.9/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
1049 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1050 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051 return forward_call(*input, **kwargs)
1052 # Do not call functions when jit is used
1053 full_backward_hooks, non_full_backward_hooks = [], []
~/miniconda3/envs/textmallet/lib/python3.9/site-packages/transformers/models/marian/modeling_marian.py in forward(self, input_ids, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, past_key_values, inputs_embeds, decoder_inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
1274 )
1275
-> 1276 outputs = self.model(
1277 input_ids,
1278 attention_mask=attention_mask,
我不知道这里的确切问题是什么。如果有人能帮助我,我将不胜感激。
在此处描述的问题中(归功于
LysandreJik): https://github.com/huggingface/transformers/issues/5480, 问题似乎是 dict
的数据类型而不是 tensor
.
这可能是您需要更改分词器输出的情况:
batch = tokenizer(
list(data_bert[:100]),
padding=True,
truncation=True,
max_length=250,
return_tensors="pt"
)
收件人:
batch = tokenizer(
list(data_bert[:100]),
padding=True,
truncation=True,
max_length=250,
return_tensors="pt")["input_ids"]