Bert 文本分类损失为 Nan

Question

我正在尝试制作一个模型，将文本分为 3 个类别。（负面、神经、正面）

我有一个 csv 文件，其中包含对不同应用的评论及其评分。

首先我导入所有必要的库

!pip install transformers
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%tensorflow_version 2.x
import tensorflow as tf

from transformers import TFBertForSequenceClassification, BertTokenizer,DistilBertTokenizer,glue_convert_examples_to_features, InputExample,BertConfig,InputFeatures
from sklearn.model_selection import train_test_split
from tqdm import tqdm


%matplotlib inline

然后我会得到我的 csv 文件

!gdown --id 1S6qMioqPJjyBLpLVz4gmRTnJHnjitnuV
!gdown --id 1zdmewp7ayS4js4VtrJEHzAheSW-5NBZv
df = pd.read_csv("reviews.csv")
print(df[['content','score']].head())
                                         content  score
0  Update: After getting a response from the deve...      1
1  Used it for a fair amount of time without any ...      1
2  Your app sucks now!!!!! Used to be good but no...      1
3  It seems OK, but very basic. Recurring tasks n...      1
4  Absolutely worthless. This app runs a prohibit...      1

将分数转换为情绪

def to_sentiment(rating):
  rating = int(rating)
  if rating <= 2:
    return 0
  elif rating == 3:
    return 1
  else: 
    return 2

df['sentiment'] = df.score.apply(to_sentiment)

tokenizer = BertTokenizer.from_pretrained('bert-base-cased',do_lower_case = True)

创建辅助方法以将数据拟合到模型中

def convert_example_to_feature(review):
  return tokenizer.encode_plus(
            review,
            add_special_tokens=True,
            max_length=160, # truncates if len(s) > max_length
            return_token_type_ids=True,
            return_attention_mask=True,
            pad_to_max_length=True, # pads to the right by default
        )

def map_example_to_dict(input_ids,attention_mask,token_type_ids,label):
  return {
      "input_ids": input_ids,
      "attention_mask": attention_mask,
      "token_type_ids" : token_type_ids
  },label

def encode_examples(ds):
  # prepare list, so that we can build up final TensorFlow dataset from slices.
  input_ids_list = []
  token_type_ids_list = []
  attention_mask_list = []
  label_list = []

  for index, row in tqdm(ds.iterrows()):
    bert_input = convert_example_to_feature(row['content'])

    input_ids_list.append(bert_input['input_ids'])
    token_type_ids_list.append(bert_input['token_type_ids'])
    attention_mask_list.append(bert_input['attention_mask'])
    label_list.append([row['sentiment']])
  return tf.data.Dataset.from_tensor_slices((input_ids_list, attention_mask_list, token_type_ids_list, label_list)).map(map_example_to_dict)

df_train, df_test = train_test_split(df,test_size=0.1)

正在创建模型

model = TFBertForSequenceClassification.from_pretrained('bert-base-cased')
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5, epsilon=1e-08)
loss = tf.keras.losses.SparseCategoricalCrossentropy()
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss,metrics=metric)

history = model.fit(ds_train_encoded,epochs=1)
14/443 [..............................] - ETA: 3:58 - loss: nan - accuracy: 0.3438

如果我改变情绪的计数，让它只是正面的和负面的，那么它就会起作用。但是使用 3 个或更多标签会产生这个问题。

Answer 1

标签类索引应该从 0 开始，而不是 1。

TFBertForSequenceClassification requires labels in the range [0,1,...]

labels (tf.Tensor of shape (batch_size,), optional, defaults to None) – Labels for computing the sequence classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]. If config.num_labels == 1 a regression loss is computed (Mean-Square loss), If config.num_labels > 1 a classification loss is computed (Cross-Entropy).

来源：https://huggingface.co/transformers/model_doc/bert.html#tfbertforsequenceclassification

Bert 文本分类损失为 Nan

Bert Text Classification Loss is Nan

python

sentiment-analysis

text-classification

tensorflow

bert-language-model