Bert 文本分类损失为 Nan
Bert Text Classification Loss is Nan
我正在尝试制作一个模型,将文本分为 3 个类别。(负面、神经、正面)
我有一个 csv 文件,其中包含对不同应用的评论及其评分。
首先我导入所有必要的库
!pip install transformers
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%tensorflow_version 2.x
import tensorflow as tf
from transformers import TFBertForSequenceClassification, BertTokenizer,DistilBertTokenizer,glue_convert_examples_to_features, InputExample,BertConfig,InputFeatures
from sklearn.model_selection import train_test_split
from tqdm import tqdm
%matplotlib inline
然后我会得到我的 csv 文件
!gdown --id 1S6qMioqPJjyBLpLVz4gmRTnJHnjitnuV
!gdown --id 1zdmewp7ayS4js4VtrJEHzAheSW-5NBZv
df = pd.read_csv("reviews.csv")
print(df[['content','score']].head())
content score
0 Update: After getting a response from the deve... 1
1 Used it for a fair amount of time without any ... 1
2 Your app sucks now!!!!! Used to be good but no... 1
3 It seems OK, but very basic. Recurring tasks n... 1
4 Absolutely worthless. This app runs a prohibit... 1
将分数转换为情绪
def to_sentiment(rating):
rating = int(rating)
if rating <= 2:
return 0
elif rating == 3:
return 1
else:
return 2
df['sentiment'] = df.score.apply(to_sentiment)
tokenizer = BertTokenizer.from_pretrained('bert-base-cased',do_lower_case = True)
创建辅助方法以将数据拟合到模型中
def convert_example_to_feature(review):
return tokenizer.encode_plus(
review,
add_special_tokens=True,
max_length=160, # truncates if len(s) > max_length
return_token_type_ids=True,
return_attention_mask=True,
pad_to_max_length=True, # pads to the right by default
)
def map_example_to_dict(input_ids,attention_mask,token_type_ids,label):
return {
"input_ids": input_ids,
"attention_mask": attention_mask,
"token_type_ids" : token_type_ids
},label
def encode_examples(ds):
# prepare list, so that we can build up final TensorFlow dataset from slices.
input_ids_list = []
token_type_ids_list = []
attention_mask_list = []
label_list = []
for index, row in tqdm(ds.iterrows()):
bert_input = convert_example_to_feature(row['content'])
input_ids_list.append(bert_input['input_ids'])
token_type_ids_list.append(bert_input['token_type_ids'])
attention_mask_list.append(bert_input['attention_mask'])
label_list.append([row['sentiment']])
return tf.data.Dataset.from_tensor_slices((input_ids_list, attention_mask_list, token_type_ids_list, label_list)).map(map_example_to_dict)
df_train, df_test = train_test_split(df,test_size=0.1)
正在创建模型
model = TFBertForSequenceClassification.from_pretrained('bert-base-cased')
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5, epsilon=1e-08)
loss = tf.keras.losses.SparseCategoricalCrossentropy()
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss,metrics=metric)
history = model.fit(ds_train_encoded,epochs=1)
14/443 [..............................] - ETA: 3:58 - loss: nan - accuracy: 0.3438
如果我改变情绪的计数,让它只是正面的和负面的,那么它就会起作用。
但是使用 3 个或更多标签会产生这个问题。
标签 类 索引应该从 0 开始,而不是 1。
TFBertForSequenceClassification requires labels in the range [0,1,...]
labels (tf.Tensor of shape (batch_size,), optional, defaults to None)
– Labels for computing the sequence classification/regression loss.
Indices should be in [0, ..., config.num_labels - 1]. If
config.num_labels == 1 a regression loss is computed (Mean-Square
loss), If config.num_labels > 1 a classification loss is computed
(Cross-Entropy).
来源:https://huggingface.co/transformers/model_doc/bert.html#tfbertforsequenceclassification
我正在尝试制作一个模型,将文本分为 3 个类别。(负面、神经、正面)
我有一个 csv 文件,其中包含对不同应用的评论及其评分。
首先我导入所有必要的库
!pip install transformers
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%tensorflow_version 2.x
import tensorflow as tf
from transformers import TFBertForSequenceClassification, BertTokenizer,DistilBertTokenizer,glue_convert_examples_to_features, InputExample,BertConfig,InputFeatures
from sklearn.model_selection import train_test_split
from tqdm import tqdm
%matplotlib inline
然后我会得到我的 csv 文件
!gdown --id 1S6qMioqPJjyBLpLVz4gmRTnJHnjitnuV
!gdown --id 1zdmewp7ayS4js4VtrJEHzAheSW-5NBZv
df = pd.read_csv("reviews.csv")
print(df[['content','score']].head())
content score
0 Update: After getting a response from the deve... 1
1 Used it for a fair amount of time without any ... 1
2 Your app sucks now!!!!! Used to be good but no... 1
3 It seems OK, but very basic. Recurring tasks n... 1
4 Absolutely worthless. This app runs a prohibit... 1
将分数转换为情绪
def to_sentiment(rating):
rating = int(rating)
if rating <= 2:
return 0
elif rating == 3:
return 1
else:
return 2
df['sentiment'] = df.score.apply(to_sentiment)
tokenizer = BertTokenizer.from_pretrained('bert-base-cased',do_lower_case = True)
创建辅助方法以将数据拟合到模型中
def convert_example_to_feature(review):
return tokenizer.encode_plus(
review,
add_special_tokens=True,
max_length=160, # truncates if len(s) > max_length
return_token_type_ids=True,
return_attention_mask=True,
pad_to_max_length=True, # pads to the right by default
)
def map_example_to_dict(input_ids,attention_mask,token_type_ids,label):
return {
"input_ids": input_ids,
"attention_mask": attention_mask,
"token_type_ids" : token_type_ids
},label
def encode_examples(ds):
# prepare list, so that we can build up final TensorFlow dataset from slices.
input_ids_list = []
token_type_ids_list = []
attention_mask_list = []
label_list = []
for index, row in tqdm(ds.iterrows()):
bert_input = convert_example_to_feature(row['content'])
input_ids_list.append(bert_input['input_ids'])
token_type_ids_list.append(bert_input['token_type_ids'])
attention_mask_list.append(bert_input['attention_mask'])
label_list.append([row['sentiment']])
return tf.data.Dataset.from_tensor_slices((input_ids_list, attention_mask_list, token_type_ids_list, label_list)).map(map_example_to_dict)
df_train, df_test = train_test_split(df,test_size=0.1)
正在创建模型
model = TFBertForSequenceClassification.from_pretrained('bert-base-cased')
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5, epsilon=1e-08)
loss = tf.keras.losses.SparseCategoricalCrossentropy()
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss,metrics=metric)
history = model.fit(ds_train_encoded,epochs=1)
14/443 [..............................] - ETA: 3:58 - loss: nan - accuracy: 0.3438
如果我改变情绪的计数,让它只是正面的和负面的,那么它就会起作用。 但是使用 3 个或更多标签会产生这个问题。
标签 类 索引应该从 0 开始,而不是 1。
TFBertForSequenceClassification requires labels in the range [0,1,...]
labels (tf.Tensor of shape (batch_size,), optional, defaults to None) – Labels for computing the sequence classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]. If config.num_labels == 1 a regression loss is computed (Mean-Square loss), If config.num_labels > 1 a classification loss is computed (Cross-Entropy).
来源:https://huggingface.co/transformers/model_doc/bert.html#tfbertforsequenceclassification