使用 tf.data.Dataset.from_tensor_slices 为 BERT 模型输入一些数据

Inputting some data for BERT model, using tf.data.Dataset.from_tensor_slices

这是我的模型:

def build_classifier_model():
    text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='features')
    preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')
    encoder_inputs = preprocessing_layer(text_input)
    encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder')
    outputs = encoder(encoder_inputs)
    net = outputs['pooled_output']
    net = tf.keras.layers.Dropout(0.1)(net)
    net = tf.keras.layers.Dense(3, activation="softmax", name='classifier')(net)
    return tf.keras.Model(text_input, net)

在预处理层,我使用了来自 TF-Hub 的 BERT 预处理器。

我已经把数据分成corpus_train, corpus_test, labels_train, labels_test了。 语料库是带有将用作特征的文本的熊猫数据框,标签是 NumPy 数组。

corpus=df_speech_EN_merged["contents"]
corpus.shape
(1768,)

labels=np.asarray(df_speech_EN_merged["Classes"].astype("int"))
labels.shape
(1768,)

为了创建训练和测试数据集,我使用了以下内容:

train_dataset = (
    tf.data.Dataset.from_tensor_slices(
        {
            "features":tf.cast(corpus_train.values, tf.string),
            "labels":tf.cast(labels_train, tf.int32) #labels is already an array, no need for .values
        }
    )
test_dataset = tf.data.Dataset.from_tensor_slices(
    {"features":tf.cast(corpus_test.values, tf.string),
     "labels":tf.cast(labels_test, tf.int32)
    } #labels is already an array, no need for .values
    )
)

在没有任何错误消息的情况下构建和编译模型后,当我使用以下参数拟合模型时:

classifier_model.fit(x=train_dataset,
                               validation_data=test_dataset,
                               epochs=2)

我收到以下错误:

ValueError: Could not find matching function to call loaded from the SavedModel. Got:
      Positional arguments (3 total):
        * Tensor("inputs:0", shape=(), dtype=string)
        * False
        * None
      Keyword arguments: {}

Expected these arguments to match one of the following 4 option(s):

Option 1:
  Positional arguments (3 total):
    * TensorSpec(shape=(None,), dtype=tf.string, name='sentences')
    * False
    * None
  Keyword arguments: {}

Option 2:
  Positional arguments (3 total):
    * TensorSpec(shape=(None,), dtype=tf.string, name='sentences')
    * True
    * None
  Keyword arguments: {}

Option 3:
  Positional arguments (3 total):
    * TensorSpec(shape=(None,), dtype=tf.string, name='inputs')
    * False
    * None
  Keyword arguments: {}

Option 4:
  Positional arguments (3 total):
    * TensorSpec(shape=(None,), dtype=tf.string, name='inputs')
    * True
    * None
  Keyword arguments: {}

我认为发生此错误是因为我构建 train_dataset/test_dataset 错误或因为 text_input 层需要错误类型的数据。任何帮助将不胜感激。

使用 tf.data.Dataset.from_tensor_slices 时,请尝试提供 batch_size,因为 Bert 预处理层需要非常具体的形状。这是一个基于此 tutorial 中使用的 Bert 模型和您的具体细节的简化的工作示例:

def build_classifier_model():
    text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='features')
    preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')
    encoder_inputs = preprocessing_layer(text_input)
    encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder')
    outputs = encoder(encoder_inputs)
    net = outputs['pooled_output']
    net = tf.keras.layers.Dropout(0.1)(net)
    net = tf.keras.layers.Dense(3, activation="softmax", name='classifier')(net)
    return tf.keras.Model(text_input, net)

sentences = tf.constant([
"Improve the physical fitness of your goldfish by getting him a bicycle",
"You are unsure whether or not to trust him but very thankful that you wore a turtle neck",
"Not all people who wander are lost", 
"There is a reason that roses have thorns",
"Charles ate the french fries knowing they would be his last meal",
"He hated that he loved what she hated about hate",
])

labels = tf.random.uniform((6, ), minval=0, maxval=2, dtype=tf.dtypes.int32)

classifier_model = build_classifier_model()
classifier_model.compile(optimizer=tf.keras.optimizers.Adam(),
                         loss=tf.keras.losses.SparseCategoricalCrossentropy(),
                         metrics=tf.keras.metrics.SparseCategoricalAccuracy())
BATCH_SIZE = 1
train_dataset = tf.data.Dataset.from_tensor_slices(
        (sentences, labels)).shuffle(
        sentences.shape[0]).batch(
        BATCH_SIZE)
    
classifier_model.fit(x=train_dataset, epochs=2)
Epoch 1/2
6/6 [==============================] - 7s 446ms/step - loss: 2.4348 - sparse_categorical_accuracy: 0.5000
Epoch 2/2
6/6 [==============================] - 3s 447ms/step - loss: 1.3977 - sparse_categorical_accuracy: 0.5000