Jetson NX 使用 TensorRT 优化 tensorflow 模型

Question

我正在尝试加速分段模型 (unet-mobilenet-512x512)。我使用 FP16 精度模式将我的 tensorflow 模型转换为 tensorRT。而且速度低于我的预期。在优化之前，我用 .pb 冻结图进行推理时有 7FPS。在 tensorRT 优化后我有 14FPS。

这是他们网站上的 Jetson NX 基准测试结果
你可以看到，unet 256x256 分割速度是 146 FPS。我想，我的unet512x512的速度最坏的情况下应该慢4倍。

这是我使用 TensorRt 优化 tensorflow 保存模型的代码：

import numpy as np
from tensorflow.python.compiler.tensorrt import trt_convert as trt
import tensorflow as tf

params = trt.DEFAULT_TRT_CONVERSION_PARAMS
params = params._replace(
    max_workspace_size_bytes=(1<<32))
params = params._replace(precision_mode="FP16")
converter = tf.experimental.tensorrt.Converter(input_saved_model_dir='./model1', conversion_params=params)
converter.convert()

def my_input_fn():
  inp1 = np.random.normal(size=(1, 512, 512, 3)).astype(np.float32)
  yield [inp1]

converter.build(input_fn=my_input_fn)  # Generate corresponding TRT engines
output_saved_model_dir = "trt_graph2"
converter.save(output_saved_model_dir)  # Generated engines will be saved.


print("------------------------freezing the graph---------------------")


from tensorflow.python.framework.convert_to_constants import convert_variables_to_constants_v2

saved_model_loaded = tf.saved_model.load(
    output_saved_model_dir, tags=[tf.compat.v1.saved_model.SERVING])
graph_func = saved_model_loaded.signatures[
    tf.compat.v1.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY]
frozen_func = convert_variables_to_constants_v2(
    graph_func)
frozen_func.graph.as_graph_def()

tf.io.write_graph(graph_or_graph_def=frozen_func.graph,
                logdir="./",
                name="unet_frozen_graphTensorRt.pb",
                as_text=False)

我下载了用于 Jetson NX 基准测试的存储库 (https://github.com/NVIDIA-AI-IOT/jetson_benchmarks)，unet256x256 的速度确实是 ~146FPS。但是没有管道来优化模型。我怎样才能得到类似的结果？我正在寻找使我的模型 (unet-mobilenet-512x512) 速度接近 30FPS
的解决方案，也许我应该运行以其他方式（没有 tensorflow）进行推理或更改一些转换参数？
任何建议，谢谢

Answer 1

据我所知，您 link 访问的存储库使用在后台使用 TensorRT (TRT) 的命令行工具。注意 TensorRT is not the same as "TensorRT in TensorFlow" aka TensorFlow-TensorRT (TF-TRT) which is what you are using in your code. Both TF-TRT and TRT models run faster than regular TF models on a Jetson device but TF-TRT models still tend to be slower than TRT ones (source 1, source 2).

TRT 的缺点是需要在目标设备上完成到 TRT 的转换，并且由于存在各种 TensorFlow 操作，因此很难成功实施TRT does not support（在这种情况下，您需要编写一个自定义插件，或者向上帝祈祷互联网上有人已经这样做了。……或者仅对模型的一部分使用 TensorRT，并在 TensorFlow 中执行 pre-/postprocessing ).

基本上有两种方法可以将模型从 TensorFlow 模型转换为 TensorRT“引擎”又名“计划文件”，这两种方法都使用中间格式：

TF -> UFF -> TRT
TF -> ONNX -> TRT

在这两种情况下，graphsurgeon/onnx-graphsurgeon库都可以用来修改TF/ONNX图，以实现图操作的兼容性。如上所述，可以通过 TensorRT 插件添加不受支持的操作。（这确实是这里的主要挑战：不同的图形文件格式和不同的目标 GPU 支持不同的图形操作。）

还有第三种方法，您可以执行 TF -> Caffe -> TRT，显然还有第四种方法，您可以使用 Nvidia's Transfer Learning Toolkit (TLT) (based upon TF/Keras) and a tool called tlt-converter 但我不熟悉它。不过，后者 link 确实提到了转换 UNet 模型。

请注意，涉及 UFF 和 Caffe are now deprecated 的路径和支持将在 TensorRT 9.0 中删除，因此如果您想要面向未来的东西，您可能应该选择 ONNX。也就是说，我在网上遇到的大多数在线示例代码仍然使用 UFF，而 TensorRT 9.0 还需要一段时间。

无论如何，我还没有尝试将 UNet 转换为 TensorRT，但以下存储库提供了示例代码，可能会让您了解它的工作原理：

TF -> UFF -> TRT：jkjung-avt/tensorrt_demos, NVIDIA-AI-IOT/tf_to_trt_image_classification（后者使用了一点 C++）
TF -> ONNX -> TRT：tensorflow-onnx, onnx-tensorrt
Keras -> ONNX -> TRT：Nvidia blog post（这个提到将 Unet 转换为 TRT！）

请注意，即使您无法为您的模型实现从 ONNX 到 TRT 的转换，也可以使用 ONNX 运行时进行推理 could potentially still give you a performance gain, especially when you're using the CUDA or the TensorRT execution provider which will be enabled automatically provided you're on a Jetson device and running the correct ONNXRuntime build。（虽然我不确定它与 TF-TRT 或 TRT 相比如何，但它可能仍然值得一试。）

最后，为了完整起见，我还要提一下，至少我的团队一直在尝试从 TF 切换到 PyTorch 的想法，部分原因是 Nvidia 的支持最近变得更好了，而且 Nvidia 员工似乎被吸引了也转向 PyTorch。特别是，现在有两种不同的方法可以将模型转换为 TRT：

PyTorch -> ONNX -> TRT（dusty_nv 使用）
PyTorch -> TRT（通过 torch2trt). It seems that quite a few Nvidia repositories 使用这个直接转换。

Answer 2

嗨，你能分享一下你遇到的错误吗？它应该通过以下步骤工作：

将 TensorFlow/Keras 模型转换为 .pb 文件。
将 .pb 文件转换为 ONNX 格式。
创建一个 TensorRT 引擎。
运行来自 TensorRT 引擎的推理。

我不确定 Unet（我会检查），但你可能有一些 onnx 不支持的操作（请分享你的错误）。

这里是 Resnet-50 的例子。

转换为 .pb：

import tensorflow as tf
import keras
from tensorflow.keras.models import Model
import keras.backend as K
K.set_learning_phase(0)

def keras_to_pb(model, output_filename, output_node_names):

   """
   This is the function to convert the Keras model to pb.

   Args:
      model: The Keras model.
      output_filename: The output .pb file name.
      output_node_names: The output nodes of the network. If None, then
      the function gets the last layer name as the output node.
   """

   # Get the names of the input and output nodes.
   in_name = model.layers[0].get_output_at(0).name.split(':')[0]

   if output_node_names is None:
       output_node_names = [model.layers[-1].get_output_at(0).name.split(':')[0]]

   sess = keras.backend.get_session()

   # The TensorFlow freeze_graph expects a comma-separated string of output node names.
   output_node_names_tf = ','.join(output_node_names)

   frozen_graph_def = tf.graph_util.convert_variables_to_constants(
       sess,
       sess.graph_def,
       output_node_names)

   sess.close()
   wkdir = ''
   tf.train.write_graph(frozen_graph_def, wkdir, output_filename, as_text=False)

   return in_name, output_node_names

# load the ResNet-50 model pretrained on imagenet
model = keras.applications.resnet.ResNet50(include_top=True, weights='imagenet', input_tensor=None, input_shape=None, pooling=None, classes=1000)

# Convert the Keras ResNet-50 model to a .pb file
in_tensor_name, out_tensor_names = keras_to_pb(model, "models/resnet50.pb", None)

然后需要将.pb模型转换为ONNX格式。为此，您需要安装 tf2onnx。示例：

python -m tf2onnx.convert  --input /Path/to/resnet50.pb --inputs input_1:0 --outputs probs/Softmax:0 --output resnet50.onnx

最后一步从 ONNX 文件创建 TensorRT 引擎：

import tensorrt as trt

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_runtime = trt.Runtime(TRT_LOGGER)
def build_engine(onnx_path, shape = [1,224,224,3]):

   """
   This is the function to create the TensorRT engine
   Args:
      onnx_path : Path to onnx_file. 
      shape : Shape of the input of the ONNX file. 
  """
   with trt.Builder(TRT_LOGGER) as builder, builder.create_network(1) as network, trt.OnnxParser(network, TRT_LOGGER) as parser:
       builder.max_workspace_size = (256 << 20)
       with open(onnx_path, 'rb') as model:
           parser.parse(model.read())
       network.get_input(0).shape = shape
       engine = builder.build_cuda_engine(network)
       return engine

def save_engine(engine, file_name):
   buf = engine.serialize()
   with open(file_name, 'wb') as f:
       f.write(buf)
def load_engine(trt_runtime, plan_path):
   with open(engine_path, 'rb') as f:
       engine_data = f.read()
   engine = trt_runtime.deserialize_cuda_engine(engine_data)
   return engine

我建议你检查这个Pytorch TRT Unet implementation

Jetson NX 使用 TensorRT 优化 tensorflow 模型

Jetson NX optimize tensorflow model using TensorRT

tensorflow

tensorrt

nvidia-jetson