如何以最佳方式使用 SciBERT？

Question

我正在尝试使用 BERT 模型进行文本分类。由于文本是关于科学文本的，我打算使用SicBERT预训练模型：https://github.com/allenai/scibert

我遇到了一些限制，我想知道是否有解决方案：

当我想做标记化和批处理时，它只允许我使用<=512的max_length。有没有办法使用更多的令牌。这个 512 的限制是不是意味着我在训练时实际上并没有使用所有的文本信息？使用所有文本的任何解决方案？
我尝试将这个预训练库与其他模型（例如 DeBERTa 或 RoBERTa）一起使用。但它不允许我。我只和 BERT 合作过。无论如何我可以做到吗？
我知道这是一个一般性问题，但有什么建议可以改进我的微调（从数据到超参数等）？目前，我的准确率约为 75%。谢谢

代码：

tokenizer = BertTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')

encoded_data_train = tokenizer.batch_encode_plus(
    df_train.text.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    padding=True,
    max_length=256
)

input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df_train.label.values)

dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)

dataloader_train = DataLoader(dataset_train, 
                              sampler=RandomSampler(dataset_train), 
                              batch_size=batch_size)

model = BertForSequenceClassification.from_pretrained('allenai/scibert_scivocab_uncased',
                                                      num_labels=len(labels),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

epochs = 1

optimizer = AdamW(model.parameters(), lr=1e-5, eps=1e-8)

scheduler = get_linear_schedule_with_warmup(optimizer,
num_training_steps=len(dataloader_train)*epochs)

Answer 1

When I want to do tokenization and batching, it only allows me to use max_length of <=512. Is there any way to use more tokens. Doen't this limitation of 512 mean that I am actually not using all the text information during training? Any solution to use all the text?

是的，您没有使用完整的文本。这是 BERT 和 T5 模型的局限性之一，它们分别限制使用 512 和 1024 个令牌。据我所知。

我建议您使用 Longformer 或 Bigbird 或 Reformer 模型，它们可以处理最长 16k、4096、64k 令牌分别。这些对于处理较长的文本（如科学文档）非常有用。

I have tried to use this pretrained library with other models such as DeBERTa or RoBERTa. But it doesn't let me. I has only worked with BERT. Is there anyway I can do that?

SciBERT其实就是预训练的BERT模型。有关详细信息，请参阅此 issue，其中提到将 BERT 转换为 ROBERTa 的可行性：

Since you're working with a BERT model that was pre-trained, you unfortunately won't be able to change the tokenizer now from a WordPiece (BERT) to a Byte-level BPE (RoBERTa).

I know this is a general question, but any suggestion that I can improve my fine tuning (from data to hyper parameter, etc)? Currently, I'm getting ~79% accuracy.

我会首先尝试调整最重要的超参数 learning_rate。然后，我将探索 AdamW 优化器的超参数和 num_warmup_steps 调度程序的超参数的不同值。

如何以最佳方式使用 SciBERT？

How to use SciBERT in the best manner?

nlp

text-classification

pytorch

bert-language-model

huggingface-transformers