如何对设置为 1700 的令牌 max_length 使用 BertForSequenceClassification?

How to use BertForSequenceClassification for token max_length set at 1700?

我想对 Reuters 50 50 数据集进行作者分类,其中最大标记长度为 1600+ 个标记,总共有 50 classes/authors。

max_length=1700batch_size=1,我得到 RuntimeError: CUDA out of memory。可以通过设置 max_length=512 来防止此错误,但这会产生截断文本的不良影响。

分词和编码:

from keras.preprocessing.sequence import pad_sequences
MAX_LEN = 1700
def get_encodings(texts):
    token_ids = []
    attention_masks = []
    for text in texts:
        token_id = tokenizer.encode(text, add_special_tokens=True, max_length=MAX_LEN)
        token_ids.append(token_id)
    return token_ids

def pad_encodings(encodings):
    return pad_sequences(encodings, maxlen=MAX_LEN, dtype="long", 
                          value=0, truncating="post", padding="post")

def get_attention_masks(padded_encodings):
    attention_masks = []
    for encoding in padded_encodings:
        attention_mask = [int(token_id > 0) for token_id in encoding]
        attention_masks.append(attention_mask)
    return attention_masks


train_encodings = get_encodings(train_df.text.values)
train_encodings = pad_encodings(train_encodings)
train_attention_masks = get_attention_masks(train_encodings)

test_encodings = get_encodings(test_df.text.values)
test_encodings = pad_encodings(test_encodings)
test_attention_masks = get_attention_masks(test_encodings)

打包到数据集和数据加载器中:

X_train = torch.tensor(train_encodings)
y_train = torch.tensor(train_df.author_id.values)
train_masks = torch.tensor(train_attention_masks)

X_test = torch.tensor(test_encodings)
y_test = torch.tensor(test_df.author_id.values)
test_masks = torch.tensor(test_attention_masks)

batch_size = 1

# Create the DataLoader for our training set.
train_data = TensorDataset(X_train, train_masks, y_train)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

validation_data = TensorDataset(X_test, test_masks, y_test)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

模型设置:

if torch.cuda.is_available():    
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

config = BertConfig.from_pretrained(
    'bert-base-uncased',
    num_labels = 50,
    output_attentions = False,
    output_hidden_states = False,
    max_position_embeddings=MAX_LEN
)

model = BertForSequenceClassification(config)

model.to(device)


optimizer = AdamW(model.parameters(),
                  lr = 2e-5, 
                  eps = 1e-8 
                )

训练:

for epoch_i in range(0, epochs):

    model.train()

    for step, batch in enumerate(train_dataloader):

        b_texts = batch[0].to(device)
        b_attention_masks = batch[1].to(device)
        b_authors = batch[2].to(device)

        model.zero_grad()        

        outputs = model(b_texts, 
                        token_type_ids=None, 
                        attention_mask=b_attention_masks, 
                        labels=b_authors) <------- ERROR HERE

错误:

RuntimeError: CUDA out of memory. Tried to allocate 6.00 GiB (GPU 0; 7.93 GiB total capacity; 1.96 GiB already allocated; 5.43 GiB free; 536.50 KiB cached)

除非你在 TPU 上训练,否则你现在拥有足够的 GPU RAM 和任何可用的 GPU 的机会是极低的。
对于某些 BERT 模型,仅该模型就占用了 10GB 以上的 RAM,而超过 512 个令牌的序列长度加倍需要更多的内存。作为参考,具有 24 GB GPU RAM(目前可用于单个 GPU 的大部分内存)的 Titan RTX 几乎不能同时容纳 512 个令牌长度的 24 个样本。

幸运的是,大多数网络在截断样本时仍然会产生非常不错的性能,但这当然是特定于任务的。另请记住 - 除非您是从头开始训练 - 所有预训练模型通常都在 512 个令牌限制上进行训练。据我所知,目前唯一支持更长序列的模型是 Bart,它允许长度最多为 1024 个标记。