Bert 预测形状不等于 num_samples

Bert prediction shape not equal to num_samples

我正在尝试使用 BERT 进行文本分类。下面是我正在使用的代码。模型训练代码(下面)工作正常,但我面临预测部分的问题

from transformers import TFBertForSequenceClassification
import tensorflow as tf

# recommended learning rate for Adam 5e-5, 3e-5, 2e-5
learning_rate = 5e-5
nlabels = 26

# we will do just 1 epoch for illustration, though multiple epochs might be better as long as we will not overfit the model
number_of_epochs = 1


# model initialization
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=nlabels,
                                                      output_attentions=False,
                                                      output_hidden_states=False)

# optimizer Adam
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, epsilon=1e-08)

# we do not have one-hot vectors, we can use sparce categorical cross entropy and accuracy
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

bert_history = model.fit(ds_tr_encoded, epochs=number_of_epochs)

我正在使用以下方法获取输出

preds = model.predict(ds_te_encoded)
pred_labels_idx = np.argmax(preds['logits'], axis=1)

我面临的问题是 pred_labels_idx 的形状与 ds_te_encoded

的形状不一样
len(pred_labels_idx) #426820
tf.data.experimental.cardinality(ds_te_encoded) #<tf.Tensor: shape=(), dtype=int64, numpy=21341>

不确定为什么会这样。

由于 ds_te_encoded 的类型为 tf.data.Dataset 而您调用 cardinality(...),您的情况下的基数只是 批次 的舍入数不是样本数。所以我假设您使用的批量大小为 20,因为 426820/20 = 21341。这可能是造成混乱的原因。