如何以最佳方式使用 SciBERT?
How to use SciBERT in the best manner?
我正在尝试使用 BERT 模型进行文本分类。由于文本是关于科学文本的,我打算使用SicBERT预训练模型:https://github.com/allenai/scibert
我遇到了一些限制,我想知道是否有解决方案:
当我想做标记化和批处理时,它只允许我使用<=512的max_length
。有没有办法使用更多的令牌。这个 512 的限制是不是意味着我在训练时实际上并没有使用所有的文本信息?使用所有文本的任何解决方案?
我尝试将这个预训练库与其他模型(例如 DeBERTa 或 RoBERTa)一起使用。但它不允许我。我只和 BERT 合作过。无论如何我可以做到吗?
我知道这是一个一般性问题,但有什么建议可以改进我的微调(从数据到超参数等)?目前,我的准确率约为 75%。谢谢
代码:
tokenizer = BertTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
encoded_data_train = tokenizer.batch_encode_plus(
df_train.text.values,
add_special_tokens=True,
return_attention_mask=True,
padding=True,
max_length=256
)
input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df_train.label.values)
dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataloader_train = DataLoader(dataset_train,
sampler=RandomSampler(dataset_train),
batch_size=batch_size)
model = BertForSequenceClassification.from_pretrained('allenai/scibert_scivocab_uncased',
num_labels=len(labels),
output_attentions=False,
output_hidden_states=False)
epochs = 1
optimizer = AdamW(model.parameters(), lr=1e-5, eps=1e-8)
scheduler = get_linear_schedule_with_warmup(optimizer,
num_training_steps=len(dataloader_train)*epochs)
When I want to do tokenization and batching, it only allows me to use max_length of <=512. Is there any way to use more tokens. Doen't this limitation of 512 mean that I am actually not using all the text information during training? Any solution to use all the text?
是的,您没有使用完整的文本。这是 BERT 和 T5 模型的局限性之一,它们分别限制使用 512 和 1024 个令牌。据我所知。
我建议您使用 Longformer
或 Bigbird
或 Reformer
模型,它们可以处理最长 16k
、4096
、64k
令牌分别。这些对于处理较长的文本(如科学文档)非常有用。
I have tried to use this pretrained library with other models such as DeBERTa or RoBERTa. But it doesn't let me. I has only worked with BERT. Is there anyway I can do that?
SciBERT
其实就是预训练的BERT模型。
有关详细信息,请参阅此 issue,其中提到将 BERT 转换为 ROBERTa 的可行性:
Since you're working with a BERT model that was pre-trained, you unfortunately won't be able to change the tokenizer now from a WordPiece (BERT) to a Byte-level BPE (RoBERTa).
I know this is a general question, but any suggestion that I can
improve my fine tuning (from data to hyper parameter, etc)? Currently,
I'm getting ~79% accuracy.
我会首先尝试调整最重要的超参数 learning_rate
。然后,我将探索 AdamW
优化器的超参数和 num_warmup_steps
调度程序的超参数的不同值。
我正在尝试使用 BERT 模型进行文本分类。由于文本是关于科学文本的,我打算使用SicBERT预训练模型:https://github.com/allenai/scibert
我遇到了一些限制,我想知道是否有解决方案:
当我想做标记化和批处理时,它只允许我使用<=512的
max_length
。有没有办法使用更多的令牌。这个 512 的限制是不是意味着我在训练时实际上并没有使用所有的文本信息?使用所有文本的任何解决方案?我尝试将这个预训练库与其他模型(例如 DeBERTa 或 RoBERTa)一起使用。但它不允许我。我只和 BERT 合作过。无论如何我可以做到吗?
我知道这是一个一般性问题,但有什么建议可以改进我的微调(从数据到超参数等)?目前,我的准确率约为 75%。谢谢
代码:
tokenizer = BertTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
encoded_data_train = tokenizer.batch_encode_plus(
df_train.text.values,
add_special_tokens=True,
return_attention_mask=True,
padding=True,
max_length=256
)
input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df_train.label.values)
dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataloader_train = DataLoader(dataset_train,
sampler=RandomSampler(dataset_train),
batch_size=batch_size)
model = BertForSequenceClassification.from_pretrained('allenai/scibert_scivocab_uncased',
num_labels=len(labels),
output_attentions=False,
output_hidden_states=False)
epochs = 1
optimizer = AdamW(model.parameters(), lr=1e-5, eps=1e-8)
scheduler = get_linear_schedule_with_warmup(optimizer,
num_training_steps=len(dataloader_train)*epochs)
When I want to do tokenization and batching, it only allows me to use max_length of <=512. Is there any way to use more tokens. Doen't this limitation of 512 mean that I am actually not using all the text information during training? Any solution to use all the text?
是的,您没有使用完整的文本。这是 BERT 和 T5 模型的局限性之一,它们分别限制使用 512 和 1024 个令牌。据我所知。
我建议您使用 Longformer
或 Bigbird
或 Reformer
模型,它们可以处理最长 16k
、4096
、64k
令牌分别。这些对于处理较长的文本(如科学文档)非常有用。
I have tried to use this pretrained library with other models such as DeBERTa or RoBERTa. But it doesn't let me. I has only worked with BERT. Is there anyway I can do that?
SciBERT
其实就是预训练的BERT模型。
有关详细信息,请参阅此 issue,其中提到将 BERT 转换为 ROBERTa 的可行性:
Since you're working with a BERT model that was pre-trained, you unfortunately won't be able to change the tokenizer now from a WordPiece (BERT) to a Byte-level BPE (RoBERTa).
I know this is a general question, but any suggestion that I can improve my fine tuning (from data to hyper parameter, etc)? Currently, I'm getting ~79% accuracy.
我会首先尝试调整最重要的超参数 learning_rate
。然后,我将探索 AdamW
优化器的超参数和 num_warmup_steps
调度程序的超参数的不同值。