使用来自 HuggingFace 的转换器的 TFBertModel 和 AutoTokenizer 构建模型时出现输入问题

Question

我正在尝试构建图中所示的模型：

我通过以下方式从 HuggingFace 的 transformers 获得了预训练的 BERT 和相应的分词器：

from transformers import AutoTokenizer, TFBertModel
model_name = "dbmdz/bert-base-italian-xxl-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
bert = TFBertModel.from_pretrained(model_name)

模型将输入一系列意大利推文，并且需要确定它们是否具有讽刺意味。

我在构建模型的初始部分时遇到问题，该部分接受输入并将其提供给分词器以获得可以提供给 BERT 的表示。

我可以在模型构建环境之外进行：

my_phrase = "Ciao, come va?"
# an equivalent version is tokenizer(my_phrase, other parameters)
bert_input = tokenizer.encode(my_phrase, add_special_tokens=True, return_tensors='tf', max_length=110, padding='max_length', truncation=True) 
attention_mask = bert_input > 0
outputs = bert(bert_input, attention_mask)['pooler_output']

但是我在构建执行此操作的模型时遇到了问题。下面是构建这样一个模型的代码（问题出在前 4 行）：

def build_classifier_model():
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
  encoder_inputs = tokenizer(text_input, return_tensors='tf', add_special_tokens=True, max_length=110, padding='max_length', truncation=True)
  outputs = bert(encoder_inputs)
  net = outputs['pooler_output']
  
  X = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(net)
  X = tf.keras.layers.Concatenate(axis=-1)([X, input_layer])
  X = tf.keras.layers.MaxPooling1D(20)(X)
  X = tf.keras.layers.SpatialDropout1D(0.4)(X)
  X = tf.keras.layers.Flatten()(X)
  X = tf.keras.layers.Dense(128, activation="relu")(X)
  X = tf.keras.layers.Dropout(0.25)(X)
  X = tf.keras.layers.Dense(2, activation='softmax')(X)

  model = tf.keras.Model(inputs=text_input, outputs = X) 
  
  return model

当我调用创建这个模型的函数时，我得到了这个错误：

text input must of type str (single example), List[str] (batch or single pretokenized example) or List[List[str]] (batch of pretokenized examples).

我想到的一件事是，也许我必须使用 tokenizer.batch_encode_plus 函数来处理字符串列表：

class BertPreprocessingLayer(tf.keras.layers.Layer):
  def __init__(self, tokenizer, maxlength):
    super().__init__()
    self._tokenizer = tokenizer
    self._maxlength = maxlength
  
  def call(self, inputs):
    print(type(inputs))
    print(inputs)
    tokenized = tokenizer.batch_encode_plus(inputs, add_special_tokens=True, return_tensors='tf', max_length=self._maxlength, padding='max_length', truncation=True)
    return tokenized

def build_classifier_model():
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
  encoder_inputs = BertPreprocessingLayer(tokenizer, 100)(text_input)
  outputs = bert(encoder_inputs)
  net = outputs['pooler_output']
  # ... same as above

但是我得到这个错误：

batch_text_or_text_pairs has to be a list (got <class 'keras.engine.keras_tensor.KerasTensor'>)

除此之外，我还没有找到通过快速 google 搜索将该张量转换为列表的方法，我必须以这种方式进出 tensorflow 似乎很奇怪。

我也查看了 huggingface 的 documentation 但只有一个用法示例，只有一个短语，它们所做的类似于我的“脱离模型构建上下文”示例.

编辑：

我也用这种方式尝试过 Lambdas:

tf.executing_eagerly()

def tokenize_tensor(tensor):
  t = tensor.numpy()
  t = np.array([str(s, 'utf-8') for s in t])
  return tokenizer(t.tolist(), return_tensors='tf', add_special_tokens=True, max_length=110, padding='max_length', truncation=True)

def build_classifier_model():
  text_input = tf.keras.layers.Input(shape=(1,), dtype=tf.string, name='text')
  
  encoder_inputs = tf.keras.layers.Lambda(tokenize_tensor, name='tokenize')(text_input)
  ...
  
  outputs = bert(encoder_inputs)

但出现以下错误：

'Tensor' object has no attribute 'numpy'

编辑 2：

我也尝试了@mdaoust 建议的将所有内容包装在 tf.py_function 中的方法并得到了这个错误。

def py_func_tokenize_tensor(tensor):
  return tf.py_function(tokenize_tensor, [tensor], Tout=[tf.int32, tf.int32, tf.int32])

eager_py_func() missing 1 required positional argument: 'Tout'

然后我将 Tout 定义为分词器返回值的类型：

transformers.tokenization_utils_base.BatchEncoding

并出现以下错误：

Expected DataType for argument 'Tout' not <class 'transformers.tokenization_utils_base.BatchEncoding'>

最后我按以下方式解压了 BatchEncoding 中的值：

def tokenize_tensor(tensor):
  t = tensor.numpy()
  t = np.array([str(s, 'utf-8') for s in t])
  dictionary = tokenizer(t.tolist(), return_tensors='tf', add_special_tokens=True, max_length=110, padding='max_length', truncation=True)
  #unpacking
  input_ids = dictionary['input_ids']
  tok_type = dictionary['token_type_ids']
  attention_mask = dictionary['attention_mask']
  return input_ids, tok_type, attention_mask

并在下面的行中得到一个错误：

...
outputs = bert(encoder_inputs)

ValueError: Cannot take the length of shape with unknown rank.

Answer 1

text input must of type str (single example), List[str] (batch or single pretokenized example) or List[List[str]] (batch of pretokenized examples).

上述错误的解决方法：

只需使用text_input = 'text'

而不是

text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')

Answer 2

看起来这与 TensorFlow 不兼容。

https://huggingface.co/dbmdz/bert-base-italian-xxl-cased#model-weights

Currently only PyTorch-Transformers compatible weights are available. If you need access to TensorFlow checkpoints, please raise an issue!

但请记住，如果您不使用 keras 的功能模型-api，有些事情会更容易。这就是 got <class 'keras.engine.keras_tensor.KerasTensor'> 所抱怨的。

尝试传递 tf.Tensor 看看是否有效。当你尝试时会发生什么：

text_input = tf.constant('text')

尝试将您的模型编写为模型的子类。

Answer 3

是的，我的第一个答案是错误的。

问题是tensorflow有两种张量。 Eager 张量（它们有一个值）。以及没有值的“符号张量”或“图形张量”，仅用于构建计算。

您的 tokenize_tensor 函数需要一个急切的张量。只有 eager 张量有 .numpy() 方法。

def tokenize_tensor(tensor):
  t = tensor.numpy()
  t = np.array([str(s, 'utf-8') for s in t])
  return tokenizer(t.tolist(), return_tensors='tf', add_special_tokens=True, max_length=110, padding='max_length', truncation=True)

但是kerasInput是符号张量

text_input = tf.keras.layers.Input(shape=(1,), dtype=tf.string, name='text')  
encoder_inputs = tf.keras.layers.Lambda(tokenize_tensor, name='tokenize')(text_input)

要解决此问题，您可以使用 tf.py_function。它在图形模式下工作，并且会在执行图形时使用急切的张量调用包装函数，而不是在构造图形时将图形张量传递给它。

def py_func_tokenize_tensor(tensor):
  return tf.py_function(tokenize_tensor, [tensor])

...

encoder_inputs = tf.keras.layers.Lambda(py_func_tokenize_tensor, name='tokenize')(text_input)

Answer 4

现在我通过从模型中取出标记化步骤解决了问题：

def tokenize(sentences, tokenizer):
    input_ids, input_masks, input_segments = [],[],[]
    for sentence in sentences:
        inputs = tokenizer.encode_plus(sentence, add_special_tokens=True, max_length=128, pad_to_max_length=True, return_attention_mask=True, return_token_type_ids=True)
        input_ids.append(inputs['input_ids'])
        input_masks.append(inputs['attention_mask'])
        input_segments.append(inputs['token_type_ids'])        
        
    return np.asarray(input_ids, dtype='int32'), np.asarray(input_masks, dtype='int32'), np.asarray(input_segments, dtype='int32')

该模型采用两个输入，它们是标记化函数返回的前两个值。

def build_classifier_model():
   input_ids_in = tf.keras.layers.Input(shape=(128,), name='input_token', dtype='int32')
   input_masks_in = tf.keras.layers.Input(shape=(128,), name='masked_token', dtype='int32') 

   embedding_layer = bert(input_ids_in, attention_mask=input_masks_in)[0]
...
   model = tf.keras.Model(inputs=[input_ids_in, input_masks_in], outputs = X)

   for layer in model.layers[:3]:
     layer.trainable = False
   return model

我仍然想知道是否有人有将标记化步骤集成到模型构建上下文中的解决方案，以便模型的用户可以简单地向其输入短语以获得预测或训练模型.

使用来自 HuggingFace 的转换器的 TFBertModel 和 AutoTokenizer 构建模型时出现输入问题

Problem with inputs when building a model with TFBertModel and AutoTokenizer from HuggingFace's transformers

keras

tensorflow

bert-language-model

huggingface-transformers

huggingface-tokenizers