这些是 PyTorch 中 Bert 预训练模型推理的正常速度吗
Are these normal speed of Bert Pretrained Model Inference in PyTorch
我正在 Huggingface 中使用 4 种速度场景测试 Bert 基础和 Bert 蒸馏模型,batch_size = 1:
1) bert-base-uncased: 154ms per request
2) bert-base-uncased with quantifization: 94ms per request
3) distilbert-base-uncased: 86ms per request
4) distilbert-base-uncased with quantifization: 69ms per request
我用的是IMDB文本作为实验数据,设置max_length=512,所以比较长。 Ubuntu 18.04 上的 cpu 信息如下:
cat /proc/cpuinfo | grep 'name'| uniq
model name : Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
机器有3个GPU可用:
Tesla V100-SXM2
实时应用似乎很慢。这些速度对于 bert 基本模型来说是否正常?
测试代码如下:
import pandas as pd
import torch.quantization
from transformers import AutoTokenizer, AutoModel, DistilBertTokenizer, DistilBertModel
def get_embedding(model, tokenizer, text):
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
outputs = model(**inputs)
output_tensors = outputs[0][0]
output_numpy = output_tensors.detach().numpy()
embedding = output_numpy.tolist()[0]
def process_text(model, tokenizer, text_lines):
for index, line in enumerate(text_lines):
embedding = get_embedding(model, tokenizer, line)
if index % 100 == 0:
print('Current index: {}'.format(index))
import time
from datetime import timedelta
if __name__ == "__main__":
df = pd.read_csv('../data/train.csv', sep='\t')
df = df.head(1000)
text_lines = df['review']
text_line_count = len(text_lines)
print('Text size: {}'.format(text_line_count))
start = time.time()
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
process_text(model, tokenizer, text_lines)
end = time.time()
print('Total time spent with bert base: {}'.format(str(timedelta(seconds=end - start))))
model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
process_text(model, tokenizer, text_lines)
end2 = time.time()
print('Total time spent with bert base quantization: {}'.format(str(timedelta(seconds=end2 - end))))
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model = DistilBertModel.from_pretrained("distilbert-base-uncased")
process_text(model, tokenizer, text_lines)
end3 = time.time()
print('Total time spent with distilbert: {}'.format(str(timedelta(seconds=end3 - end2))))
model = DistilBertModel.from_pretrained("distilbert-base-uncased")
model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
process_text(model, tokenizer, text_lines)
end4 = time.time()
print('Total time spent with distilbert quantization: {}'.format(str(timedelta(seconds=end4 - end3))))
编辑:根据建议我更改为以下内容:
inputs = tokenizer(text_batch, padding=True, return_tensors="pt")
outputs = model(**inputs)
其中 text_batch 是作为输入的文本列表。
不,你可以加快速度。
首先,为什么要用批量大小 1 进行测试?
tokenizer
和 model
都接受批量输入。基本上,您可以传递每行包含一个样本的 2D array/list。请参阅分词器的文档:https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.__call__ 这同样适用于模型。
此外,即使您使用的批量大小大于 1,您的 for 循环也是连续的。您可以创建一个测试数据,然后使用 Trainer
class 和 trainer.predict()
另请参阅我在 HF 论坛上的讨论:https://discuss.huggingface.co/t/urgent-trainer-predict-and-model-generate-creates-totally-different-predictions/3426
我正在 Huggingface 中使用 4 种速度场景测试 Bert 基础和 Bert 蒸馏模型,batch_size = 1:
1) bert-base-uncased: 154ms per request
2) bert-base-uncased with quantifization: 94ms per request
3) distilbert-base-uncased: 86ms per request
4) distilbert-base-uncased with quantifization: 69ms per request
我用的是IMDB文本作为实验数据,设置max_length=512,所以比较长。 Ubuntu 18.04 上的 cpu 信息如下:
cat /proc/cpuinfo | grep 'name'| uniq
model name : Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
机器有3个GPU可用:
Tesla V100-SXM2
实时应用似乎很慢。这些速度对于 bert 基本模型来说是否正常?
测试代码如下:
import pandas as pd
import torch.quantization
from transformers import AutoTokenizer, AutoModel, DistilBertTokenizer, DistilBertModel
def get_embedding(model, tokenizer, text):
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
outputs = model(**inputs)
output_tensors = outputs[0][0]
output_numpy = output_tensors.detach().numpy()
embedding = output_numpy.tolist()[0]
def process_text(model, tokenizer, text_lines):
for index, line in enumerate(text_lines):
embedding = get_embedding(model, tokenizer, line)
if index % 100 == 0:
print('Current index: {}'.format(index))
import time
from datetime import timedelta
if __name__ == "__main__":
df = pd.read_csv('../data/train.csv', sep='\t')
df = df.head(1000)
text_lines = df['review']
text_line_count = len(text_lines)
print('Text size: {}'.format(text_line_count))
start = time.time()
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
process_text(model, tokenizer, text_lines)
end = time.time()
print('Total time spent with bert base: {}'.format(str(timedelta(seconds=end - start))))
model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
process_text(model, tokenizer, text_lines)
end2 = time.time()
print('Total time spent with bert base quantization: {}'.format(str(timedelta(seconds=end2 - end))))
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model = DistilBertModel.from_pretrained("distilbert-base-uncased")
process_text(model, tokenizer, text_lines)
end3 = time.time()
print('Total time spent with distilbert: {}'.format(str(timedelta(seconds=end3 - end2))))
model = DistilBertModel.from_pretrained("distilbert-base-uncased")
model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
process_text(model, tokenizer, text_lines)
end4 = time.time()
print('Total time spent with distilbert quantization: {}'.format(str(timedelta(seconds=end4 - end3))))
编辑:根据建议我更改为以下内容:
inputs = tokenizer(text_batch, padding=True, return_tensors="pt")
outputs = model(**inputs)
其中 text_batch 是作为输入的文本列表。
不,你可以加快速度。
首先,为什么要用批量大小 1 进行测试?
tokenizer
和 model
都接受批量输入。基本上,您可以传递每行包含一个样本的 2D array/list。请参阅分词器的文档:https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.__call__ 这同样适用于模型。
此外,即使您使用的批量大小大于 1,您的 for 循环也是连续的。您可以创建一个测试数据,然后使用 Trainer
class 和 trainer.predict()
另请参阅我在 HF 论坛上的讨论:https://discuss.huggingface.co/t/urgent-trainer-predict-and-model-generate-creates-totally-different-predictions/3426