TensorRT

Chieh Tsai

Is the RT meaning of TensroRT a mystery?

There is not a definite answer.

 

Most people explain that

TensorRT is Tensor Runtime.

What is TensorRT?

TensorRT is a platform to optimize inference for all deep learning frameworks leveraging libraries.

It can accept many deep learning frameworks including TensorFlow, Pytorch, MxNet, Caffe, and so on.

In general, the neural network should go forward and backward propagation. 

 

However, TensorRT only computes one time propagation (i.e., Forward propagation).

1. TensorRT can highly speed up the computational time via optimizing algorithms, GPU and reorganizing the neural network.

Advantages

2. TensorRT can get rid of original computational framework to inference. Hence, it can save a lot of memory.

How to optimize?

TensorRT Optimization

  • Layer & Tensor Fusion
  • Weight  & Activation Precision Calibration
  • Kernel Auto-tuning
  • Dynamic Tensor Memory
  • Multi-Stream Execution

Layer & Tensor Fusion

TensorRT can combine same layers to reduce the GPU resource and times of reading and writing. 

TensorRT converts to optimized matrix math depending on the specified precision (FP32, FP16 or INT8) for improved latency, throughput, and efficiency.

Precision Calibration

Of course, it will decrease a little bit accuracy, but TensorRT is very clever that it will determine where is suitable for adjustments.

Automatically adjust CUDA cores for different GPU platforms, algorithms, models.

Kernel Auto-Tuning

During using each tensor, TensorRT will designate video memory for it to avoid repeated application of video memory, then reducing memory consumption and improving reuse efficiency.

Dynamic Tensor Memory

Scalable design to process multiple input streams in parallel, optimized for the underlying GPU.

Multi-Stream Execution

If the model calculation speeds up and the memory usage decreases, more complex models (i.e., better-performing models) can be used instead of the modified models.

Another secret is ...

Asynchronous

TensorRT Workflow

You can check  the official website from here .

Convert Graph

Concept

Build

Stage

Deployment Stage

Get the trained model

 

 

Convert the .pb model (e.g., TensorFlow)

to UFF format

 

 

Build the TensorRT engine

 

 

Execute TensorRT to inference

Important components in TensorRT

  • Path - It will provide the path
  • Model - It will convert a model to UFF format.  Mainly it focuses on customized plugin.
  • Engine - deployment on TensorRT engine.
  • Inference - Use the engine to infer.

 

 

(If your model is trained by Tensorflow,

you need to covert to UFF format.)

  1. Convert-to-uff.

  2. UFF Parser API.

There are two methods to convert model for TensorFlow.

However, sometimes it is difficult to convert because you should write the plugin. So another way is converting the pb model to onnx format, then use Onnx Parser API.

If your model isn't supported by TensorRT.

You need to customize the plugin in the config.py file.

convert-to-uff frozen_inference_graph.pb -O NMS -p config.py

* If you haven't a config.py for your model, you can directly convert .pb file.

Thanks!!