HuggingFace: ValueError: expected sequence of length 165 at dim 1 (got 128)
HuggingFace: ValueError: expected sequence of length 165 at dim 1 (got 128)
我正在尝试根据自己的数据微调 BERT 语言模型。我已经浏览了他们的文档,但他们的任务似乎并不是我所需要的,因为我的最终目标是嵌入文本。这是我的代码:
from datasets import load_dataset
from transformers import BertTokenizerFast, AutoModel, TrainingArguments, Trainer
import glob
import os
base_path = '../data/'
model_name = 'bert-base-uncased'
max_length = 512
checkpoints_dir = 'checkpoints'
tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=True)
def tokenize_function(examples):
return tokenizer(examples['text'], padding=True, truncation=True, max_length=max_length)
dataset = load_dataset('text',
data_files={
'train': f'{base_path}train.txt',
'test': f'{base_path}test.txt',
'validation': f'{base_path}valid.txt'
}
)
print('Tokenizing data. This may take a while...')
tokenized_dataset = dataset.map(tokenize_function, batched=True)
train_dataset = tokenized_dataset['train']
eval_dataset = tokenized_dataset['test']
model = AutoModel.from_pretrained(model_name)
training_args = TrainingArguments(checkpoints_dir)
print('Training the model...')
trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset)
trainer.train()
我收到以下错误:
File "train_lm_hf.py", line 44, in <module>
trainer.train()
...
File "/opt/conda/lib/python3.7/site-packages/transformers/data/data_collator.py", line 130, in torch_default_data_collator
batch[k] = torch.tensor([f[k] for f in features])
ValueError: expected sequence of length 165 at dim 1 (got 128)
我做错了什么?
我通过将 tokenize 函数更改为:
来修复此解决方案
def tokenize_function(examples):
return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=max_length)
(注意 padding
参数)。另外,我使用了这样的数据整理器:
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
我正在尝试根据自己的数据微调 BERT 语言模型。我已经浏览了他们的文档,但他们的任务似乎并不是我所需要的,因为我的最终目标是嵌入文本。这是我的代码:
from datasets import load_dataset
from transformers import BertTokenizerFast, AutoModel, TrainingArguments, Trainer
import glob
import os
base_path = '../data/'
model_name = 'bert-base-uncased'
max_length = 512
checkpoints_dir = 'checkpoints'
tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=True)
def tokenize_function(examples):
return tokenizer(examples['text'], padding=True, truncation=True, max_length=max_length)
dataset = load_dataset('text',
data_files={
'train': f'{base_path}train.txt',
'test': f'{base_path}test.txt',
'validation': f'{base_path}valid.txt'
}
)
print('Tokenizing data. This may take a while...')
tokenized_dataset = dataset.map(tokenize_function, batched=True)
train_dataset = tokenized_dataset['train']
eval_dataset = tokenized_dataset['test']
model = AutoModel.from_pretrained(model_name)
training_args = TrainingArguments(checkpoints_dir)
print('Training the model...')
trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset)
trainer.train()
我收到以下错误:
File "train_lm_hf.py", line 44, in <module>
trainer.train()
...
File "/opt/conda/lib/python3.7/site-packages/transformers/data/data_collator.py", line 130, in torch_default_data_collator
batch[k] = torch.tensor([f[k] for f in features])
ValueError: expected sequence of length 165 at dim 1 (got 128)
我做错了什么?
我通过将 tokenize 函数更改为:
来修复此解决方案def tokenize_function(examples):
return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=max_length)
(注意 padding
参数)。另外,我使用了这样的数据整理器:
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)