要求截断为 max_length 但未提供最大长度,并且模型没有预定义的最大长度。默认为不截断
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation
我正在按照 HuggingFace 的序列分类教程学习 NLP https://huggingface.co/transformers/custom_datasets.html#sequence-classification-with-imdb-reviews
原始代码运行没有问题。但是当我尝试加载不同的分词器时,例如来自 google/bert_uncased_L-4_H-256_A-4
的分词器,会出现以下警告:
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
from transformers import AutoTokenizer
from pathlib import Path
def read_imdb_split(split_dir):
split_dir = Path(split_dir)
texts = []
labels = []
for label_dir in ["pos", "neg"]:
for text_file in (split_dir/label_dir).iterdir():
texts.append(text_file.read_text())
labels.append(0 if label_dir is "neg" else 1)
return texts[:50], labels[:50]
if __name__ == '__main__':
test_texts, test_labels = read_imdb_split('aclImdb/test')
tokenizer = AutoTokenizer.from_pretrained('google/bert_uncased_L-4_H-256_A-4')
test_encodings = tokenizer(test_texts, truncation=True, padding=True)
for input_id in test_encodings["input_ids"]:
print(len(input_id))
输出显示所有 input_id
的 len = 1288。似乎它们都已填充到 1288。但是我如何指定截断目标长度,例如 512?
加载分词器时指定model_max_length
。
tokenizer = AutoTokenizer.from_pretrained('google/bert_uncased_L-4_H-256_A-4', model_max_length=512)
我正在按照 HuggingFace 的序列分类教程学习 NLP https://huggingface.co/transformers/custom_datasets.html#sequence-classification-with-imdb-reviews
原始代码运行没有问题。但是当我尝试加载不同的分词器时,例如来自 google/bert_uncased_L-4_H-256_A-4
的分词器,会出现以下警告:
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
from transformers import AutoTokenizer
from pathlib import Path
def read_imdb_split(split_dir):
split_dir = Path(split_dir)
texts = []
labels = []
for label_dir in ["pos", "neg"]:
for text_file in (split_dir/label_dir).iterdir():
texts.append(text_file.read_text())
labels.append(0 if label_dir is "neg" else 1)
return texts[:50], labels[:50]
if __name__ == '__main__':
test_texts, test_labels = read_imdb_split('aclImdb/test')
tokenizer = AutoTokenizer.from_pretrained('google/bert_uncased_L-4_H-256_A-4')
test_encodings = tokenizer(test_texts, truncation=True, padding=True)
for input_id in test_encodings["input_ids"]:
print(len(input_id))
输出显示所有 input_id
的 len = 1288。似乎它们都已填充到 1288。但是我如何指定截断目标长度,例如 512?
加载分词器时指定model_max_length
。
tokenizer = AutoTokenizer.from_pretrained('google/bert_uncased_L-4_H-256_A-4', model_max_length=512)