如何使用数据集库构建用于语言建模的数据集,就像使用 transformers 库中的旧 TextDataset 一样
How to build a dataset for language modeling with the datasets library as with the old TextDataset from the transformers library
我正在尝试加载我将用于语言建模的自定义数据集。数据集由一个文本文件组成,每行都有一个完整的文档,这意味着每一行都超过了大多数标记器的正常 512 个标记限制。
我想了解构建对每一行进行标记的文本数据集的过程是什么,之前已将数据集中的文档拆分为“可标记”大小的行,因为旧 TextDataset class 会这样做,您只需执行以下操作,并且没有文本丢失的标记化数据集将可用于传递给 DataCollator:
model_checkpoint = 'distilbert-base-uncased'
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
from transformers import TextDataset
dataset = TextDataset(
tokenizer=tokenizer,
file_path="path/to/text_file.txt",
block_size=512,
)
我不想使用这种即将被弃用的方式,而是使用 datasets 库。现在,我有以下内容,当然会抛出错误,因为每行都长于分词器中的最大块大小:
import datasets
dataset = datasets.load_dataset('path/to/text_file.txt')
model_checkpoint = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
def tokenize_function(examples):
return tokenizer(examples["text"])
tokenized_datasets = dataset.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])
那么,使用数据集库创建数据集的“标准”方法是什么?
非常感谢您的帮助:))
我在 HuggingFace Datasets forum 上收到了 @lhoestq
对这个问题的回答
Hi !
If you want to tokenize line by line, you can use this:
max_seq_length = 512
num_proc = 4
def tokenize_function(examples):
# Remove empty lines
examples["text"] = [line for line in examples["text"] if len(line) > 0 and not line.isspace()]
return tokenizer(
examples["text"],
truncation=True,
max_length=max_seq_length,
)
tokenized_dataset = dataset.map(
tokenize_function,
batched=True,
num_proc=num_proc,
remove_columns=["text"],
)
Though the TextDataset was doing a different processing by
concatenating all the texts and building blocks of size 512. If you
need this behavior, then you must apply an additional map function
after the tokenization:
# Main data processing function that will concatenate all texts from
# our dataset and generate chunks of max_seq_length.
def group_texts(examples):
# Concatenate all texts.
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
# We drop the small remainder, we could add padding if the model supported it instead of this drop,
# you can customize this part to your needs.
total_length = (total_length // max_seq_length) * max_seq_length
# Split by chunks of max_len.
result = {
k: [t[i : i + max_seq_length] for i in range(0, total_length, max_seq_length)]
for k, t in concatenated_examples.items()
}
return result
# Note that with `batched=True`, this map processes 1,000 texts together,
# so group_texts throws away a remainder for each of those groups of 1,000 texts.
# You can adjust that batch_size here but a higher value might be slower to preprocess.
tokenized_dataset = tokenized_dataset.map(
group_texts,
batched=True,
num_proc=num_proc,
)
This code comes from the processing of the run_mlm.py example script
of transformers
我正在尝试加载我将用于语言建模的自定义数据集。数据集由一个文本文件组成,每行都有一个完整的文档,这意味着每一行都超过了大多数标记器的正常 512 个标记限制。
我想了解构建对每一行进行标记的文本数据集的过程是什么,之前已将数据集中的文档拆分为“可标记”大小的行,因为旧 TextDataset class 会这样做,您只需执行以下操作,并且没有文本丢失的标记化数据集将可用于传递给 DataCollator:
model_checkpoint = 'distilbert-base-uncased'
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
from transformers import TextDataset
dataset = TextDataset(
tokenizer=tokenizer,
file_path="path/to/text_file.txt",
block_size=512,
)
我不想使用这种即将被弃用的方式,而是使用 datasets 库。现在,我有以下内容,当然会抛出错误,因为每行都长于分词器中的最大块大小:
import datasets
dataset = datasets.load_dataset('path/to/text_file.txt')
model_checkpoint = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
def tokenize_function(examples):
return tokenizer(examples["text"])
tokenized_datasets = dataset.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])
那么,使用数据集库创建数据集的“标准”方法是什么?
非常感谢您的帮助:))
我在 HuggingFace Datasets forum 上收到了 @lhoestq
对这个问题的回答Hi !
If you want to tokenize line by line, you can use this:
max_seq_length = 512 num_proc = 4 def tokenize_function(examples): # Remove empty lines examples["text"] = [line for line in examples["text"] if len(line) > 0 and not line.isspace()] return tokenizer( examples["text"], truncation=True, max_length=max_seq_length, ) tokenized_dataset = dataset.map( tokenize_function, batched=True, num_proc=num_proc, remove_columns=["text"], )
Though the TextDataset was doing a different processing by concatenating all the texts and building blocks of size 512. If you need this behavior, then you must apply an additional map function after the tokenization:
# Main data processing function that will concatenate all texts from # our dataset and generate chunks of max_seq_length. def group_texts(examples): # Concatenate all texts. concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()} total_length = len(concatenated_examples[list(examples.keys())[0]]) # We drop the small remainder, we could add padding if the model supported it instead of this drop, # you can customize this part to your needs. total_length = (total_length // max_seq_length) * max_seq_length # Split by chunks of max_len. result = { k: [t[i : i + max_seq_length] for i in range(0, total_length, max_seq_length)] for k, t in concatenated_examples.items() } return result # Note that with `batched=True`, this map processes 1,000 texts together, # so group_texts throws away a remainder for each of those groups of 1,000 texts. # You can adjust that batch_size here but a higher value might be slower to preprocess. tokenized_dataset = tokenized_dataset.map( group_texts, batched=True, num_proc=num_proc, )
This code comes from the processing of the run_mlm.py example script of transformers