Tensorflow 2.0 Hugging Face Transformers、TFBertForSequenceClassification、推理中的意外输出维度
Tensorflow 2.0 Hugging Face Transformers, TFBertForSequenceClassification, Unexpected Output Dimensions in Inference
总结:
我想针对自定义数据集上的句子 classification 微调 BERT。我遵循了一些我发现的例子,比如 this one, which was very helpful. I have also looked at this gist。
我遇到的问题是,当运行对某些样本进行推理时,输出的维度与我预期的不同。
当我对 23 个样本进行 运行 推理时,我得到一个元组,其中包含维度为 (1472, 42) 的 numpy 数组,其中 42 是 classes 的数量。我希望尺寸 (23, 42).
代码和其他详细信息:
我运行使用Keras对训练模型的推断是这样的:
preds = model.predict(features)
其中 features 被标记化并转换为数据集:
for sample, ground_truth in tests:
test_examples.append(InputExample(text=sample, category_index=ground_truth))
features = convert_examples_to_tf_dataset(test_examples, tokenizer)
其中 sample
可以是例如"A test sentence I want classified"
和 ground_truth
可以是例如12
这是编码标签。因为我进行推理,所以我提供的基本事实当然无关紧要。
convert_examples_to_tf_dataset
-函数如下所示(我在this gist中找到):
def convert_examples_to_tf_dataset(
examples: List[Tuple[str, int]],
tokenizer,
max_length=64,
):
"""
Loads data into a tf.data.Dataset for finetuning a given model.
Args:
examples: List of tuples representing the examples to be fed
tokenizer: Instance of a tokenizer that will tokenize the examples
max_length: Maximum string length
Returns:
a ``tf.data.Dataset`` containing the condensed features of the provided sentences
"""
features = [] # -> will hold InputFeatures to be converted later
for e in examples:
# Documentation is really strong for this method, so please take a look at it
input_dict = tokenizer.encode_plus(
e.text,
add_special_tokens=True,
max_length=max_length, # truncates if len(s) > max_length
return_token_type_ids=True,
return_attention_mask=True,
pad_to_max_length=True, # pads to the right by default
)
# input ids = token indices in the tokenizer's internal dict
# token_type_ids = binary mask identifying different sequences in the model
# attention_mask = binary mask indicating the positions of padded tokens so the model does not attend to them
input_ids, token_type_ids, attention_mask = (input_dict["input_ids"],
input_dict["token_type_ids"], input_dict['attention_mask'])
features.append(
InputFeatures(
input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=e.category_index
)
)
def gen():
for f in features:
yield (
{
"input_ids": f.input_ids,
"attention_mask": f.attention_mask,
"token_type_ids": f.token_type_ids,
},
f.label,
)
return tf.data.Dataset.from_generator(
gen,
({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
(
{
"input_ids": tf.TensorShape([None]),
"attention_mask": tf.TensorShape([None]),
"token_type_ids": tf.TensorShape([None]),
},
tf.TensorShape([]),
),
)
with tf.device('/cpu:0'):
train_data = convert_examples_to_tf_dataset(train_examples, tokenizer)
train_data = train_data.shuffle(buffer_size=len(train_examples), reshuffle_each_iteration=True) \
.batch(BATCH_SIZE) \
.repeat(-1)
val_data = convert_examples_to_tf_dataset(val_examples, tokenizer)
val_data = val_data.shuffle(buffer_size=len(val_examples), reshuffle_each_iteration=True) \
.batch(BATCH_SIZE) \
.repeat(-1)
它按我的预期工作,运行ning print(list(features.as_numpy_iterator())[1])
产生以下结果:
({'input_ids': array([ 101, 11639, 19962, 23288, 13264, 35372, 10410, 102, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0], dtype=int32), 'attention_mask': array([1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
dtype=int32), 'token_type_ids': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
dtype=int32)}, 6705)
到目前为止,一切看起来都符合我的预期。似乎分词器正在按预期工作; 3 个长度为 64 的数组(对应于我设置的最大长度),以及一个整数标签。
模型训练如下:
config = BertConfig.from_pretrained(
'bert-base-multilingual-cased',
num_labels=len(label_encoder.classes_),
output_hidden_states=False,
output_attentions=False
)
model = TFBertForSequenceClassification.from_pretrained('bert-base-multilingual-cased', config=config)
# train_data is then a tf.data.Dataset we can pass to model.fit()
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-05, epsilon=1e-08)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy(name='accuracy')
model.compile(optimizer=optimizer,
loss=loss,
metrics=[metric])
model.summary()
history = model.fit(train_data,
epochs=EPOCHS,
steps_per_epoch=train_steps,
validation_data=val_data,
validation_steps=val_steps,
shuffle=True,
)
结果
现在的问题是,当 运行 预测 preds = model.predict(features)
时,输出维度与 documentation 所说的不一致:logits (Numpy array or tf.Tensor of shape (batch_size, config.num_labels)):
。 我得到的是一个元组,其中包含一个 numpy 数组,维度为:(1472,42).
42 是有道理的,因为这是我的 classes。我发送了 23 个样本进行测试,23 x 64 = 1472。64 是我的最大句子长度,所以听起来有点耳熟。这个输出不正确吗?如何将此输出转换为每个输入样本的实际 class 预测?我得到了 1472 个预测,而我预计会有 23 个。
如果我可以提供更多有助于解决此问题的详细信息,请告诉我。
我报告了我的示例,其中我尝试预测 3 个文本样本并获得 (3, 42) 作为输出形状
### define model
config = BertConfig.from_pretrained(
'bert-base-multilingual-cased',
num_labels=42,
output_hidden_states=False,
output_attentions=False
)
model = TFBertForSequenceClassification.from_pretrained('bert-base-multilingual-cased', config=config)
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-05, epsilon=1e-08)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy(name='accuracy')
model.compile(optimizer=optimizer,
loss=loss,
metrics=[metric])
### import tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")
### utility functions for text encoding
def return_id(str1, str2, length):
inputs = tokenizer.encode_plus(str1, str2,
add_special_tokens=True,
max_length=length)
input_ids = inputs["input_ids"]
input_masks = [1] * len(input_ids)
input_segments = inputs["token_type_ids"]
padding_length = length - len(input_ids)
padding_id = tokenizer.pad_token_id
input_ids = input_ids + ([padding_id] * padding_length)
input_masks = input_masks + ([0] * padding_length)
input_segments = input_segments + ([0] * padding_length)
return [input_ids, input_masks, input_segments]
### encode 3 sentences
input_ids, input_masks, input_segments = [], [], []
for instance in ['hello hello', 'ciao ciao', 'marco marco']:
ids, masks, segments = \
return_id(instance, None, 100)
input_ids.append(ids)
input_masks.append(masks)
input_segments.append(segments)
input_ = [np.asarray(input_ids, dtype=np.int32),
np.asarray(input_masks, dtype=np.int32),
np.asarray(input_segments, dtype=np.int32)]
### make prediction
model.predict(input_).shape # ===> (3,42)
我发现了问题 - 如果您在使用 Tensorflow 数据集 (tf.data.Dataset) 时得到意外的尺寸,可能是因为 运行 .batch
.
所以在我的例子中:
features = convert_examples_to_tf_dataset(test_examples, tokenizer)
添加:
features = features.batch(BATCH_SIZE)
使这项工作如我所料。所以,这不是 TFBertForSequenceClassification
相关的问题,只是因为我的输入不正确。我也想加个参考this answer,让我发现了问题
总结:
我想针对自定义数据集上的句子 classification 微调 BERT。我遵循了一些我发现的例子,比如 this one, which was very helpful. I have also looked at this gist。
我遇到的问题是,当运行对某些样本进行推理时,输出的维度与我预期的不同。
当我对 23 个样本进行 运行 推理时,我得到一个元组,其中包含维度为 (1472, 42) 的 numpy 数组,其中 42 是 classes 的数量。我希望尺寸 (23, 42).
代码和其他详细信息:
我运行使用Keras对训练模型的推断是这样的:
preds = model.predict(features)
其中 features 被标记化并转换为数据集:
for sample, ground_truth in tests:
test_examples.append(InputExample(text=sample, category_index=ground_truth))
features = convert_examples_to_tf_dataset(test_examples, tokenizer)
其中 sample
可以是例如"A test sentence I want classified"
和 ground_truth
可以是例如12
这是编码标签。因为我进行推理,所以我提供的基本事实当然无关紧要。
convert_examples_to_tf_dataset
-函数如下所示(我在this gist中找到):
def convert_examples_to_tf_dataset(
examples: List[Tuple[str, int]],
tokenizer,
max_length=64,
):
"""
Loads data into a tf.data.Dataset for finetuning a given model.
Args:
examples: List of tuples representing the examples to be fed
tokenizer: Instance of a tokenizer that will tokenize the examples
max_length: Maximum string length
Returns:
a ``tf.data.Dataset`` containing the condensed features of the provided sentences
"""
features = [] # -> will hold InputFeatures to be converted later
for e in examples:
# Documentation is really strong for this method, so please take a look at it
input_dict = tokenizer.encode_plus(
e.text,
add_special_tokens=True,
max_length=max_length, # truncates if len(s) > max_length
return_token_type_ids=True,
return_attention_mask=True,
pad_to_max_length=True, # pads to the right by default
)
# input ids = token indices in the tokenizer's internal dict
# token_type_ids = binary mask identifying different sequences in the model
# attention_mask = binary mask indicating the positions of padded tokens so the model does not attend to them
input_ids, token_type_ids, attention_mask = (input_dict["input_ids"],
input_dict["token_type_ids"], input_dict['attention_mask'])
features.append(
InputFeatures(
input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=e.category_index
)
)
def gen():
for f in features:
yield (
{
"input_ids": f.input_ids,
"attention_mask": f.attention_mask,
"token_type_ids": f.token_type_ids,
},
f.label,
)
return tf.data.Dataset.from_generator(
gen,
({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
(
{
"input_ids": tf.TensorShape([None]),
"attention_mask": tf.TensorShape([None]),
"token_type_ids": tf.TensorShape([None]),
},
tf.TensorShape([]),
),
)
with tf.device('/cpu:0'):
train_data = convert_examples_to_tf_dataset(train_examples, tokenizer)
train_data = train_data.shuffle(buffer_size=len(train_examples), reshuffle_each_iteration=True) \
.batch(BATCH_SIZE) \
.repeat(-1)
val_data = convert_examples_to_tf_dataset(val_examples, tokenizer)
val_data = val_data.shuffle(buffer_size=len(val_examples), reshuffle_each_iteration=True) \
.batch(BATCH_SIZE) \
.repeat(-1)
它按我的预期工作,运行ning print(list(features.as_numpy_iterator())[1])
产生以下结果:
({'input_ids': array([ 101, 11639, 19962, 23288, 13264, 35372, 10410, 102, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0,
0], dtype=int32), 'attention_mask': array([1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
dtype=int32), 'token_type_ids': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
dtype=int32)}, 6705)
到目前为止,一切看起来都符合我的预期。似乎分词器正在按预期工作; 3 个长度为 64 的数组(对应于我设置的最大长度),以及一个整数标签。
模型训练如下:
config = BertConfig.from_pretrained(
'bert-base-multilingual-cased',
num_labels=len(label_encoder.classes_),
output_hidden_states=False,
output_attentions=False
)
model = TFBertForSequenceClassification.from_pretrained('bert-base-multilingual-cased', config=config)
# train_data is then a tf.data.Dataset we can pass to model.fit()
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-05, epsilon=1e-08)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy(name='accuracy')
model.compile(optimizer=optimizer,
loss=loss,
metrics=[metric])
model.summary()
history = model.fit(train_data,
epochs=EPOCHS,
steps_per_epoch=train_steps,
validation_data=val_data,
validation_steps=val_steps,
shuffle=True,
)
结果
现在的问题是,当 运行 预测 preds = model.predict(features)
时,输出维度与 documentation 所说的不一致:logits (Numpy array or tf.Tensor of shape (batch_size, config.num_labels)):
。 我得到的是一个元组,其中包含一个 numpy 数组,维度为:(1472,42).
42 是有道理的,因为这是我的 classes。我发送了 23 个样本进行测试,23 x 64 = 1472。64 是我的最大句子长度,所以听起来有点耳熟。这个输出不正确吗?如何将此输出转换为每个输入样本的实际 class 预测?我得到了 1472 个预测,而我预计会有 23 个。
如果我可以提供更多有助于解决此问题的详细信息,请告诉我。
我报告了我的示例,其中我尝试预测 3 个文本样本并获得 (3, 42) 作为输出形状
### define model
config = BertConfig.from_pretrained(
'bert-base-multilingual-cased',
num_labels=42,
output_hidden_states=False,
output_attentions=False
)
model = TFBertForSequenceClassification.from_pretrained('bert-base-multilingual-cased', config=config)
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-05, epsilon=1e-08)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy(name='accuracy')
model.compile(optimizer=optimizer,
loss=loss,
metrics=[metric])
### import tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")
### utility functions for text encoding
def return_id(str1, str2, length):
inputs = tokenizer.encode_plus(str1, str2,
add_special_tokens=True,
max_length=length)
input_ids = inputs["input_ids"]
input_masks = [1] * len(input_ids)
input_segments = inputs["token_type_ids"]
padding_length = length - len(input_ids)
padding_id = tokenizer.pad_token_id
input_ids = input_ids + ([padding_id] * padding_length)
input_masks = input_masks + ([0] * padding_length)
input_segments = input_segments + ([0] * padding_length)
return [input_ids, input_masks, input_segments]
### encode 3 sentences
input_ids, input_masks, input_segments = [], [], []
for instance in ['hello hello', 'ciao ciao', 'marco marco']:
ids, masks, segments = \
return_id(instance, None, 100)
input_ids.append(ids)
input_masks.append(masks)
input_segments.append(segments)
input_ = [np.asarray(input_ids, dtype=np.int32),
np.asarray(input_masks, dtype=np.int32),
np.asarray(input_segments, dtype=np.int32)]
### make prediction
model.predict(input_).shape # ===> (3,42)
我发现了问题 - 如果您在使用 Tensorflow 数据集 (tf.data.Dataset) 时得到意外的尺寸,可能是因为 运行 .batch
.
所以在我的例子中:
features = convert_examples_to_tf_dataset(test_examples, tokenizer)
添加:
features = features.batch(BATCH_SIZE)
使这项工作如我所料。所以,这不是 TFBertForSequenceClassification
相关的问题,只是因为我的输入不正确。我也想加个参考this answer,让我发现了问题