如何对设置为 1700 的令牌 max_length 使用 BertForSequenceClassification？

Question

我想对 Reuters 50 50 数据集进行作者分类，其中最大标记长度为 1600+ 个标记，总共有 50 classes/authors。

max_length=1700 和 batch_size=1，我得到 RuntimeError: CUDA out of memory。可以通过设置 max_length=512 来防止此错误，但这会产生截断文本的不良影响。

分词和编码：

from keras.preprocessing.sequence import pad_sequences
MAX_LEN = 1700
def get_encodings(texts):
    token_ids = []
    attention_masks = []
    for text in texts:
        token_id = tokenizer.encode(text, add_special_tokens=True, max_length=MAX_LEN)
        token_ids.append(token_id)
    return token_ids

def pad_encodings(encodings):
    return pad_sequences(encodings, maxlen=MAX_LEN, dtype="long", 
                          value=0, truncating="post", padding="post")

def get_attention_masks(padded_encodings):
    attention_masks = []
    for encoding in padded_encodings:
        attention_mask = [int(token_id > 0) for token_id in encoding]
        attention_masks.append(attention_mask)
    return attention_masks


train_encodings = get_encodings(train_df.text.values)
train_encodings = pad_encodings(train_encodings)
train_attention_masks = get_attention_masks(train_encodings)

test_encodings = get_encodings(test_df.text.values)
test_encodings = pad_encodings(test_encodings)
test_attention_masks = get_attention_masks(test_encodings)

打包到数据集和数据加载器中：

X_train = torch.tensor(train_encodings)
y_train = torch.tensor(train_df.author_id.values)
train_masks = torch.tensor(train_attention_masks)

X_test = torch.tensor(test_encodings)
y_test = torch.tensor(test_df.author_id.values)
test_masks = torch.tensor(test_attention_masks)

batch_size = 1

# Create the DataLoader for our training set.
train_data = TensorDataset(X_train, train_masks, y_train)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

validation_data = TensorDataset(X_test, test_masks, y_test)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

模型设置：

if torch.cuda.is_available():    
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

config = BertConfig.from_pretrained(
    'bert-base-uncased',
    num_labels = 50,
    output_attentions = False,
    output_hidden_states = False,
    max_position_embeddings=MAX_LEN
)

model = BertForSequenceClassification(config)

model.to(device)


optimizer = AdamW(model.parameters(),
                  lr = 2e-5, 
                  eps = 1e-8 
                )

训练：

for epoch_i in range(0, epochs):

    model.train()

    for step, batch in enumerate(train_dataloader):

        b_texts = batch[0].to(device)
        b_attention_masks = batch[1].to(device)
        b_authors = batch[2].to(device)

        model.zero_grad()        

        outputs = model(b_texts, 
                        token_type_ids=None, 
                        attention_mask=b_attention_masks, 
                        labels=b_authors) <------- ERROR HERE

错误：

RuntimeError: CUDA out of memory. Tried to allocate 6.00 GiB (GPU 0; 7.93 GiB total capacity; 1.96 GiB already allocated; 5.43 GiB free; 536.50 KiB cached)

Answer 1

除非你在 TPU 上训练，否则你现在拥有足够的 GPU RAM 和任何可用的 GPU 的机会是极低的。
对于某些 BERT 模型，仅该模型就占用了 10GB 以上的 RAM，而超过 512 个令牌的序列长度加倍需要更多的内存。作为参考，具有 24 GB GPU RAM（目前可用于单个 GPU 的大部分内存）的 Titan RTX 几乎不能同时容纳 512 个令牌长度的 24 个样本。

幸运的是，大多数网络在截断样本时仍然会产生非常不错的性能，但这当然是特定于任务的。另请记住 - 除非您是从头开始训练 - 所有预训练模型通常都在 512 个令牌限制上进行训练。据我所知，目前唯一支持更长序列的模型是 Bart，它允许长度最多为 1024 个标记。

如何对设置为 1700 的令牌 max_length 使用 BertForSequenceClassification？

How to use BertForSequenceClassification for token max_length set at 1700?

huggingface-transformers