Huggingface error: AttributeError: 'ByteLevelBPETokenizer' object has no attribute 'pad_token_id'
Huggingface error: AttributeError: 'ByteLevelBPETokenizer' object has no attribute 'pad_token_id'
我正在尝试使用 WordLevel
/BPE
分词器对一些数字字符串进行分词,创建数据整理器并最终在 PyTorch DataLoader 中使用它从头开始训练新模型。
但是,我遇到了一个错误
AttributeError: 'ByteLevelBPETokenizer' object has no attribute 'pad_token_id'
当运行以下代码
from transformers import DataCollatorForLanguageModeling
from tokenizers import ByteLevelBPETokenizer
from tokenizers.pre_tokenizers import Whitespace
from torch.utils.data import DataLoader, TensorDataset
data = ['4814 4832 4761 4523 4999 4860 4699 5024 4788 <unk>']
# Tokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.pre_tokenizer = Whitespace()
tokenizer.train_from_iterator(data, vocab_size=1000, min_frequency=1,
special_tokens=[
"<s>",
"</s>",
"<unk>",
"<mask>",
])
# Data Collator
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=False
)
train_dataset = TensorDataset(torch.tensor(tokenizer(data, ......)))
# DataLoader
train_dataloader = DataLoader(
train_dataset,
collate_fn=data_collator
)
这个错误是因为没有为 tokenizer 配置 pad_token_id
吗?如果是这样,我们该怎么做?
谢谢!
错误跟踪:
AttributeError: Caught AttributeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/opt/anaconda3/envs/x/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
data = fetcher.fetch(index)
File "/opt/anaconda3/envs/x/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/opt/anaconda3/envs/x/lib/python3.8/site-packages/transformers/data/data_collator.py", line 351, in __call__
if self.tokenizer.pad_token_id is not None:
AttributeError: 'ByteLevelBPETokenizer' object has no attribute 'pad_token_id'
Conda 包
pytorch 1.7.0 py3.8_cuda10.2.89_cudnn7.6.5_0 pytorch
pytorch-lightning 1.2.5 pyhd8ed1ab_0 conda-forge
tokenizers 0.10.1 pypi_0 pypi
transformers 4.4.2 pypi_0 pypi
错误告诉您分词器需要一个名为 pad_token_id
的属性。您可以将 ByteLevelBPETokenizer
包装到具有此类属性的 class 中(...并在路上遇到其他缺失的属性)或使用转换器库中的包装器 class:
from transformers import PreTrainedTokenizerFast
#your code
tokenizer.save(SOMEWHERE)
tokenizer = PreTrainedTokenizerFast(tokenizer_file=tokenizer_path)
我正在尝试使用 WordLevel
/BPE
分词器对一些数字字符串进行分词,创建数据整理器并最终在 PyTorch DataLoader 中使用它从头开始训练新模型。
但是,我遇到了一个错误
AttributeError: 'ByteLevelBPETokenizer' object has no attribute 'pad_token_id'
当运行以下代码
from transformers import DataCollatorForLanguageModeling
from tokenizers import ByteLevelBPETokenizer
from tokenizers.pre_tokenizers import Whitespace
from torch.utils.data import DataLoader, TensorDataset
data = ['4814 4832 4761 4523 4999 4860 4699 5024 4788 <unk>']
# Tokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.pre_tokenizer = Whitespace()
tokenizer.train_from_iterator(data, vocab_size=1000, min_frequency=1,
special_tokens=[
"<s>",
"</s>",
"<unk>",
"<mask>",
])
# Data Collator
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=False
)
train_dataset = TensorDataset(torch.tensor(tokenizer(data, ......)))
# DataLoader
train_dataloader = DataLoader(
train_dataset,
collate_fn=data_collator
)
这个错误是因为没有为 tokenizer 配置 pad_token_id
吗?如果是这样,我们该怎么做?
谢谢!
错误跟踪:
AttributeError: Caught AttributeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/opt/anaconda3/envs/x/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
data = fetcher.fetch(index)
File "/opt/anaconda3/envs/x/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/opt/anaconda3/envs/x/lib/python3.8/site-packages/transformers/data/data_collator.py", line 351, in __call__
if self.tokenizer.pad_token_id is not None:
AttributeError: 'ByteLevelBPETokenizer' object has no attribute 'pad_token_id'
Conda 包
pytorch 1.7.0 py3.8_cuda10.2.89_cudnn7.6.5_0 pytorch
pytorch-lightning 1.2.5 pyhd8ed1ab_0 conda-forge
tokenizers 0.10.1 pypi_0 pypi
transformers 4.4.2 pypi_0 pypi
错误告诉您分词器需要一个名为 pad_token_id
的属性。您可以将 ByteLevelBPETokenizer
包装到具有此类属性的 class 中(...并在路上遇到其他缺失的属性)或使用转换器库中的包装器 class:
from transformers import PreTrainedTokenizerFast
#your code
tokenizer.save(SOMEWHERE)
tokenizer = PreTrainedTokenizerFast(tokenizer_file=tokenizer_path)